Jump to content

Hi,

Recently I was watching a clip from wan show about optane. That got me thinking: Can we make a GPU that can have a SSD (~128GB) that uses cashe as a the GPU VRAM. The SSD be soldered onte the GPU board.

What would be wrong with this? Do you guys see anything wrong with this?

Link to comment
https://linustechtips.com/topic/1638685-optane-but-for-gpus/
Share on other sites

Link to post
Share on other sites

It's certainly possible, but SSDs are orders of magnitude slower than GDDR. So it would only make sense for a use case where memory capacity is way more important than its speed

Remember to either quote or @mention others, so they are notified of your reply

Link to comment
https://linustechtips.com/topic/1638685-optane-but-for-gpus/#findComment-16905957
Share on other sites

Link to post
Share on other sites

1 hour ago, DavidPatrascu said:

Hi,

Recently I was watching a clip from wan show about optane. That got me thinking: Can we make a GPU that can have a SSD (~128GB) that uses cashe as a the GPU VRAM. The SSD be soldered onte the GPU board.

What would be wrong with this? Do you guys see anything wrong with this?

possible? absolutely, AMD done it years ago 

worth? not really, RAM speed is waaaaaaaaaaaaaay faster then any SSD at the moment and probably will never change in the near future.

 

it's probably easier that a way to expand VRAM will appear sometime in the future and still it's something unlikely to happen, also soldering an SSD would be a bad decision that won't solve the issue and simply will expand exponentially GPU prices not to mention even going 10x on the quantitiy of GPUs that are already in the market, you might end up with something like 7 different subtiers of a GPU.... think about a 5060 that could have: 32gb SSD or 64gb SSD or 128gb SSD or 256gb SSD or 512gb SSD or 1tb SSD or 2tb SSD this would be HELL 

                   -`                    y0ur5h4d0w@Darkness
                  .o+`                   ------------------- 
                 `ooo/                   OS: Arch Linux x86_64 
                `+oooo:                  Host: Darkness
               `+oooooo:                 Kernel: Latest  
               -+oooooo+:                Packages: Only what i need to keep it simple
             `/:-:++oooo+:               Shell: ZSH
            `/++++/+++++++:              Main Monitor: LG Ultragear LG 27GS85Q 
           `/++++++++++++++:             Secondary Monitor: Asus MG28UQ
          `/+++ooooooooooooo/`           DE: Plasma Always Bleeding Edge  
         ./ooosssso++osssssso+`          WM: kwin 
        .oossssso-````/ossssss+`         Theme: Breeze-Dark [GTK2], Breeze [GTK3] 
       -osssssso.      :ssssssso.        Icons: Breeze-dark [GTK2/3] 
      :osssssss/        osssso+++.       Terminal: Kitty 
     /ossssssss/        +ssssooo/-       Terminal Font: Noto Color Emoji 17 FreeMono 13 
   `/ossssso+/:-        -:/+osssso+-     CPU: AMD Ryzen 7 9800X3D (16) @ 5.307GHz 
  `+sso+:-`                 `.-/+oso:    GPU: AMD ATI Radeon RX 7800 XT 
 `++:.                           `-/+/   GPU: AMD ATI Radeon Graphics 
 .`                                 `/   Memory: 61830MiB 

 

Link to comment
https://linustechtips.com/topic/1638685-optane-but-for-gpus/#findComment-16905962
Share on other sites

Link to post
Share on other sites

You can only use optane for hosting the model and temp parameters. Active context will be still generated and operated on the GPU. I.e., its a DRAM substitute. A slower one. But it'll work, with, say 5090 fine.

*using non-conversational, sketch-level language to gesture at structure and direction.
The GB8/12 Liberation Front

 

 

Link to comment
https://linustechtips.com/topic/1638685-optane-but-for-gpus/#findComment-16905966
Share on other sites

Link to post
Share on other sites

1 hour ago, DavidPatrascu said:

Hi,

Recently I was watching a clip from wan show about optane. That got me thinking: Can we make a GPU that can have a SSD (~128GB) that uses cashe as a the GPU VRAM. The SSD be soldered onte the GPU board.

What would be wrong with this? Do you guys see anything wrong with this?

 

The problem you're missing is they are talking about Optane in the DDR4 slots on a server platform where the AI use is CPU/GPU hybrid. That is a 3060 and a fairly old CPU, which is why 4 tokens per second (Which is super slow anyway.) A mobile device with a dedicated NPU may only get that, on a much smaller model.

 

The reality is that an a LLM Chatbot requires hundreds of Gigabytes of dedicated memory to be able to do anything useful, and the Optane direction doesn't actually improve the speed, it only makes it capable of running a full model instead of the quantized models which in the context of a LLM might not really affect the result that much.

 

However other kinds of ML like Automated Speech Recognition, Text to Speech, and RVC, is heavily dependent on the amount of video memory on the GPU to run inference, because if you run it on the CPU, it's too slow to be used in real time. this is why most "AI voices" you hear have lag between the input and the output, and why they often sound robotic, because speech has to be chopped up into 140 character pieces to work on a 8GB GPU, and why the speech generators are often still 22khz/16bit audio that sound terrible instead of 96khz/24bit and actually sound like a real human in a real human environment.

 

All the AI TTS voices used by streamers on twitch, be it google/amazon's stock voices or Tangia, or something else, these all sound so unconvincing and robotic, even when it's the streamer themselves voice being used on it, because these are not being run on the streamer's PC from a dedicated GPU, they're being run on some virtual machine that is spun up, the voice is generated, and then spun down again when another request isn't made in a few minutes. Because just having the machine spun up at all, incurs a loading penalty to copy that 8GB model into the GPU, and then the GPU spins up to 100% load for the 3-4 seconds it needs to generate the audio and then goes back down to the lowest power state the VM has configured it for. If you want no latency, that GPU has to live in the same room, on a device that is on the local network. 

 

Fortunately, most streamers aren't actually full-time using a TTS, they're just using it for giggles and jokes, so if it takes 20 minutes to round trip someone dumping the bee movie script into it, the streamer is just going to kill the audio to it after 10 seconds, so best not to waste your time on anything that is more than 8 seconds.

 

Going back to the usage of Optane, if you have a CPU ML load, that is largely dependent on a large model for accuracy, not latency, then yes, LLM's and Speech generation can both benefit from that in the same way, by having models that support much longer prompts/input, and having much larger context windows to work with. But if it relies on speed (eg ASR) no it will not work. ASR is one of the few ML things that can NOT run on the CPU with any level of accuracy, it has to run on the GPU to have accuracy, and even then it has to run on a GPU with 20GB+ VRAM so the entire word scoring model can be loaded, not simply the sounds. 

 

LLM's don't really have a cost to high latency because when humans talk to humans, we still "think about it", so expecting a delay in a response is normal. But if it has to to go through a TTS, which then breaks everything along sentences, or 72 characters in order to fit in the context space, then it sounds very robotic, even if the underlying voice is fairly human sounding.  What happens is you get a break in the sentence and the sentence goes from sounding like a statement, and then suddenly the second half of the sentence is a question and it sounds awkward. One streamer that uses a TTS full time, changed their setup from sentence-at-once to word-phrase at once, and it just sounds, so awful now because it will tone shift multiple times.

 

Optane or SSD's on the video card can't improve the situation when it is a memory bandwidth/latency issue. It can only improve when there is NO latency penalty to it's use. So while a LLM is largely the reason to go in this direction, it's not the only use case. CV where it merely needs to recognize a subject, and not react to the subject is also a good use for that. If your CV needs to recognize dangerous situations, then no, you probably want a dedicated GPU/NPU hardware with enough ram to have the entire model at once.

 

 

 

Link to comment
https://linustechtips.com/topic/1638685-optane-but-for-gpus/#findComment-16905967
Share on other sites

Link to post
Share on other sites

1 hour ago, Kisai said:

The problem you're missing is they are talking about Optane in the DDR4

Sorry, but are sure about this one? Optane was never meant a substitute for the on-board RAM. Whether it were 32-96GB sticks of M2 Rapid-Strorage cache/memory, or the actual server-grade 900p+ PCIe expansion cards(faster than the sticks), they are all the same Optane memory,  with the PCIe cards having 256gb+ capacity for super-low-latency storage(or any other caching operation). It can work in parallel with RAM, serving as the place to base the running model onto, but that's it.

*using non-conversational, sketch-level language to gesture at structure and direction.
The GB8/12 Liberation Front

 

 

Link to comment
https://linustechtips.com/topic/1638685-optane-but-for-gpus/#findComment-16905989
Share on other sites

Link to post
Share on other sites

Optane Cache for GPU doesn't really make sense as is. it's meant as cache for mechanical drives, I did build a system that took advantage of it.

DSCN1384.thumb.JPG.dc51680ead16af786252a4a6f5ebc88d.JPG

 

Optane Dimms make lots more sense. At 10GB/s read per module it's quite bandwidth starved, even on large server CPUs, hence the low TPS, but it is a cheap way to get huge amount of system memory. A modern server CPU with twwelve or sixteen channel DDR5 can get between 500GB/s to 1TB/s of bandwidth.

image.thumb.png.2c37fc6abf0fb7ad85d8bed5e00df2f6.png

 

One thing that was done is Direct Storage. GPUs can stream textures directly from storage without passing from RAM. It will take time for game engines to take advantage of that. it will speed up loading mostly.

 

The other thing is High Bandwidth Flash. Replaces GDDR with Flash on the GPU, and works for workloads where huge data has to be loaded once, and just read without writing. It could work for textures, but is most useful for parameters of AI models that are read only. Nothing prevents making Flash DIMMs, or more likely, Flash CAMM modules. 

 

Xeon 7 16 channel speed 9000 is around 1.1TB/s bandwidth. It's GPU territory. If Flash Dimm modules were a thing, doing 8 channels DDR5 and 8 channels flash dimm, it could be a fairly incredible device for LLM inference.

image.png.3b388b5ae20766ecfe09251a8ee29462.png

 

Link to comment
https://linustechtips.com/topic/1638685-optane-but-for-gpus/#findComment-16905993
Share on other sites

Link to post
Share on other sites

2 hours ago, Timme said:

Sorry, but are sure about this one? Optane was never meant a substitute for the on-board RAM. Whether it were 32-96GB sticks of M2 Rapid-Strorage cache/memory, or the actual server-grade 900p+ PCIe expansion cards(faster than the sticks), they are all the same Optane memory,  with the PCIe cards having 256gb+ capacity for super-low-latency storage(or any other caching operation). It can work in parallel with RAM, serving as the place to base the running model onto, but that's it.

Here's the reddit post

image.thumb.png.bb219f9ec200513e714d31395c259e59.png

 

This is literately the story that was on the WAN show.

 

Now if you pick it apart, you realize two problems with this

- You need an Intel platform that supports Optane

- it only works one specific LLM that supports this mixed cpu/gpu configuration.

 

Obviously if someone could afford 768GB in one GPU, that would beat the pants of this handily. Heck, let me show you another post from that reddit to just drive the point home

r/LocalLLaMA - PSA

 

That DGX Spark, is $5000 dollars and only has 128GB of RAM.

 

I was considering getting one at some point and then saw the bandwidth and was like "nope, that is worse than the GPU I have" (608 3070Ti and 936 3090), as I already know the performance of running the stuff I have on these two GPU's, anything less what the 3070Ti (or B70 in that image) is not going to cut it. That 3060 from the reddit post is 360GB/s.

 

The memory bandwidth and GPU compute determine certain time-consequences. The capacity determines how big the model can be. So if you use Optane, in theory, you can run a really large model, but the memory bandwidth of optane is already worse than that of the RAM slots it uses. About 75% less. So if you did the same benchmark with 1TB of RAM, the expectation would be 16 tokens/s not 4. 

 

The idea I presume is that you load the model into the memory once, and thus the persistence keeps it from needing to be unloaded and reloaded if it was simply RAM. From practical experience with some AI stuff, training will constantly have to load stuff into the GPU and unload it, but inference will only load it once, and then not even unload it when the program ends. It only gets unloaded when the computer is shut down or it's forced out by different GPU process because the underlying driver I assume, is expecting to run the same model again, and if it's still "warm" it can keep the already allocated memory assigned to it.

 

Or maybe it was just a weird artifact of what I was using. Point being, that if AI ever hopes to be a "local" use situation, the amount of RAM in these devices has to go way way up, and the manufacturers have to not be so greedy or they are literately leaving money on the table as there are more customers than just data centers.

 

Link to comment
https://linustechtips.com/topic/1638685-optane-but-for-gpus/#findComment-16906030
Share on other sites

Link to post
Share on other sites

Not going to bring up things that were as fast as optane, Non volitle memory is just weird market. PCM has been in research for decades, there are thousands of papers on the stuff. 


But yes. we are getting "optane for GPUs" but its not necessarily for GPUs, XL-FLASH. While that is specifically kioxias solution. Nvidia is spreading out alot of money for Storage class memory.(that is your search term to research this with)To get drives with millions of IOPS to feed the GPUs with larger models. 

Optane was to early (or to late) for what the market demanded out of SCM. 

Also no it wasnt meant to be a cache for HDD, it was to be slotted between ram and SSD on the heirarchy. thats like saying L3 Cache is for SSDs not for DRAM... thats just a weird way to think about it. @05032-Mendicant-Bias
Phase Change Memory (PCM) for High Density Storage Class Memory (SCM)  Applications

 

  

6 hours ago, DavidPatrascu said:

Hi,

Recently I was watching a clip from wan show about optane. That got me thinking: Can we make a GPU that can have a SSD (~128GB) that uses cashe as a the GPU VRAM. The SSD be soldered onte the GPU board.

What would be wrong with this? Do you guys see anything wrong with this?

But as for a GPU with SSD access. that is also already a thing now. That is the whole idea with Direct Storage, that the ps5 uses an equivalent of,

GPUs have direct read access to the SSD and can bypass the CPU. Would an SSD on board be much faster? Not really. how are you connecting between the two? Its still going to be some kind of PCIe serdes thing. (notice who jensen was hanging out with at Computex this last week)

Link to comment
https://linustechtips.com/topic/1638685-optane-but-for-gpus/#findComment-16906036
Share on other sites

Link to post
Share on other sites

5 hours ago, Kisai said:

- it only works one specific LLM that supports this mixed cpu/gpu configuration.

You can do that with any LLM. Heck, you can even read those straight from disk (which is awfully slow, ofc).

They also used a Q2 quant of K2.5, at 375GB. Said model also is a MoE with "only" 32B active params, which means that (at that quant) you only do a forward pass over ~12GB of weights (ignoring things like context and whatnot).

 

4tok/s is AWFULLY bad when you consider the above. It means that you're stuck at something like ~48GB/s of bandwidth. Ofc that's way faster than mmap'ing straight from disk (given that PCIe 5.0 x4 would top out at less than 16GB/s), but that system should be able to achieve ~127GB/s of memory bandwidth in theory (6-channels @2666MT/s).

A single DDR4 2666MT/s in theory should provide 21GB/s, but that specific Optane PMEM DIMM seems to cap at ~7GB/s according to this:

https://www.servethehome.com/intel-optane-dc-persistent-memory-guide-for-pmem-100-pmem-200-and-pmem-300-optane-dimms/

So that seems to track, Optane provides at most 1/3rd of the bandwidth of a regular DIMM, but does provide ~4x more capacity.

 

A modern server CPU with 12C can provide way more bandwidth and capacity, see:

That's running a Q4 quant, at 583GB, which is pretty much full quality given that the original model is INT4 @ 600GB. That means 18GB of active parameters, and at ~12tok/s means an effective 216GB/s of memory bandwidth WITHOUT a GPU. If they were to run that same Q2 model, they'd likely manage ~18tok/s, a 4.5x speedup without any GPUs involved.

Ofc that's way less than the theoretical 576GB/s that setup should be able to provide, but adding even a simple 16GB GPU should be able to speedup things a lot (especially prefill). With a RTX PRO 6000 in hybrid mode with that same model, their prefill speed got a nice 6x boost and TG went from 12tok/s to almost 20tok/s.

One minor caveat is that the 9175F in that post has 16x CCDs, with only one core per CCD. Each core should be able to do ~52GB/s reads, whereas each CCD should be able to do ~106GB/s reads, so a single core can't really saturate all of the capability of a CCD, but with 16x CCDs that CPU should be more than able to saturate the total memory bandwidth. Reference:

https://chipsandcheese.com/p/amds-turin-5th-gen-epyc-launched

 

6 hours ago, Kisai said:

That DGX Spark, is $5000 dollars and only has 128GB of RAM.

$3500*. That's how much the Asus model is current going for at Amazon US.

6 hours ago, Kisai said:

The idea I presume is that you load the model into the memory once, and thus the persistence keeps it from needing to be unloaded and reloaded if it was simply RAM.

That's not even possible with Optane in memory config. In such setup it works exactly like RAM and data is lost once a reboot takes place.

Ofc one could use the persistent mode and then mmap out of it, but I'm not sure what performance would be like.

FX6300 @ 4.2GHz | Gigabyte GA-78LMT-USB3 R2 | Hyper 212x | 3x 8GB + 1x 4GB @ 1600MHz | Gigabyte 2060 Super | Corsair CX650M | LG 43UK6520PSA
ASUS X550LN | i5 4210u | 12GB
Lenovo N23 Yoga

Link to comment
https://linustechtips.com/topic/1638685-optane-but-for-gpus/#findComment-16906182
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×