Large language models (LLM) Local Running GPU, CPU: Benchmarks and Next-Gen Prospects

kex243 · August 27, 2023

I wanted to kick off a discussion about an increasingly popular topic in the AI and tech community: running large language models (LLMs) on different hardware.

One of the remarkable aspects of this trend is how accessible it has become.

I wanted to discuss the real game-changer – running LLMs not just on pricy GPUs, but on CPUs. I wanted to see LLM running to testing benchmarks for both GPUs and CPUs, RAM sticks. Basic models like Llama2 could serve as excellent candidates for measuring generation and processing speeds across these different hardware configurations. As for me, seeking for upgrade, it would be high priority thing.

1. Moving on to the details, I'd like to ask some questions about the productivity of running LLM models on specific CPUs, RAMs, and motherboards. Does RAM frequency play a key role in generating speed? And what about RAM timings – do they impact generation speed significantly? As far as my understanding goes, the difference between 40 and 32 timings might be minimal or negligible. And motherboard chips- is there any reason to have modern edge one to prevent higher bandwidth issues in some way (b760 vs z790 for example)? And also- standard holy war Intel vs AMD for CPU processing, but later about it.

2. I'm also intrigued by the idea of optimizing RAM setups for LLMs. For instance, is it more beneficial to have 4x24 GB or 4x16 GB sticks with a single channel rather than 2 sticks of 48 GB with two channels (at possible maximal frequencies avaliable on market)? Could those arrangements improve bandwidth for LLM processing?

3. Looking ahead, it's exciting to consider the upcoming 14th-gen Intel and 8000-series AMD CPUs. Rumors suggest these processors will feature integrated GPUs. It would be really interesting to explore how productive they are for LLM processing without requiring ~~additional~~ any GPUs. At least for such low budget entusiast like me =). This could potentially be a game-changer.

I haven't fond similar theme searching for 'llm' or 'llama' nor better place to ask questions just in case. Found an opinion that those things are for basement trolls on reddit. Sure it's not a thing widepsread like Blender (which is based on CPUs and RAM mostly too, to say, specially for simulations, high poly etc.) Also English is not my native language, sorry in advance.

Eigenvektor · August 27, 2023

51 minutes ago, kex243 said:

2. I'm also intrigued by the idea of optimizing RAM setups for LLMs. For instance, is it more beneficial to have 4x24 GB or 4x16 GB sticks with a single channel rather than 2 sticks of 48 GB with two channels (at possible maximal frequencies avaliable on market)? Could those arrangements improve bandwidth for LLM processing?

On a consumer board, 2x48, 4x24 or 4x16 should all run in dual channel, so I'm not quite sure what you're trying to get at.

Dual channel doubles your bandwidth compared to single channel, so I don't see how having single channel with more sticks would improve anything. If bandwidth is beneficial for a specific model (and latency is not an issue), then you'd want a server board with quad-channel (or more) instead.

GOTSpectrum · August 27, 2023

2 minutes ago, Eigenvektor said:

On a consumer board, 2x48, 4x24 or 4x16 should all run in dual channel, so I'm not quite sure what you're trying to get at.

Dual channel doubles your bandwidth compared to single channel, so I don't see how having single channel with more sticks would improve anything. If anything, if bandwidth is beneficial for a specific model (and latency is not an issue), then you'd want a server board with quad-channel instead.

Also, the chances of GPUs on a CPU being significantly powerful (Outside of AMD APU type products) is very small IMO

Eigenvektor · August 27, 2023

Here are two articles, both of which hint at memory bandwidth being critical:

https://finbarr.ca/how-is-llama-cpp-possible/

Quote

As the memory bandwidth is almost always⁵ much smaller than the number of FLOPS, memory bandwidth is the binding constraint. … Memory bandwidth is the limiting factor in almost everything to do with sampling from transformers

https://www.hardware-corner.net/guides/computer-to-run-llama-ai-model/

Quote

In a situations where you use CPU for inference, the bandwidth between the CPU and the memory is a critical factor. I’d like to emphasize its importance. When generating a single token, the entire model needs to be read from memory once. Suppose your have Core i9-10900X (4 channel support) and DDR4-3600 memory, this means a throughput of 115 GB/s and your model size is 13 GB. In that case the inference speed will be around 9 tokens per second, regardless of how fast your CPU is or how many parallel cores it has.

kex243 · August 27, 2023

1 hour ago, Eigenvektor said:

On a consumer board, 2x48, 4x24 or 4x16 should all run in dual channel, so I'm not quite sure what you're trying to get at.

Dual channel doubles your bandwidth compared to single channel, so I don't see how having single channel with more sticks would improve anything. If bandwidth is beneficial for a specific model (and latency is not an issue), then you'd want a server board with quad-channel (or more) instead.

Thanks, it was not so clear concept of what ranks are supposed to be and how they are connected with dual channel mode and if having total 4 ranks (2 sticks 2 ranks or 4 sticks with one) will somehow interfere with only two channels mode, leading to lower performance if ranks number is higher than channels or to no visible difference at all. As I understood from al lthe information I obtain- it doesn't have any impact at all. (But 1 channel used, sure.)

1 hour ago, Eigenvektor said:

Here are two articles, both of which hint at memory bandwidth being critical:

https://finbarr.ca/how-is-llama-cpp-possible/

https://www.hardware-corner.net/guides/computer-to-run-llama-ai-model/

Thanks for articles, havent see em. I read em now and everything become clearer. So, seem's that cpu frequency is not the key feature. As I understand it now any CPU may interfere without obvious problems regardless of number of cores and not too much difference will be visible from i-3-5-7-9 (in theory). This question also was on my mind. But relatively to model size, that can be up to 65Gb for 70B 8bit it can possibly be a problem, that's why I'm intereset in upgrading to 96Bb in two sticks with comparable CPU, but anyway I'm going somewhere i7-9 to not to make it bottlenecked later and not spending 10x budgets on GPUs now.
Yes, I know that running large models on CPU is deviation, but hope of having wide memory and relatively good processing with low budget sounds too good. I wish I just had returned to this thread when both new gens are realeased. Chances are small, but there are many ways of them to be realeased, like new researches, papers, drivers, etc.
Sorry if i'm looking like a kid among serious dudes, really I'm not keen on nor hardware nor sowtware parts of the question. I just like llm and see difference between 34B q8 and 70B q2 and it is significantly 70Bq2 win despite of similar size stored on hdd. Which is 'kinda a thing', at least for similarly created, teached, processed models.

Biohazard777 · August 27, 2023

Let's try summoning @igormp, this thread seems to be right up his alley.

Eigenvektor · August 27, 2023

3 minutes ago, kex243 said:

Thanks, it was not so clear concept of what ranks are supposed to be and how they are connected with dual channel mode and if having total 4 ranks (2 sticks 2 ranks or 4 sticks with one) will somehow interfere with only two channels mode, leading to lower performance if ranks number is higher than channels or to no visible difference at all. As I understood from al lthe information I obtain- it doesn't have any impact at all. (But 1 channel used, sure.)

Using two sticks on a CPU that has two memory channels means you're using dual channel, single rank per channel (one stick per channel)

Using four sticks with that CPU means dual channel, dual rank per channel (two sticks per channel)

However, depending on a memory stick's size, a single stick can already be dual rank (if it has memory ICs on both sides). That is typically only the case for 32 GB sticks or larger these days. So with two of such sticks per channel you could be looking at quad rank.

You would probably need to find very specific benchmarks to find out how beneficial either of these configurations is. But there are likely much more worthwhile optimizations before such micro-optimizations start to make sense.

igormp · August 27, 2023

13 minutes ago, Biohazard777 said:

Let's try summoning @igormp, this thread seems to be right up his alley.

Thanks for the summon haha

My actual area of research is computer vision using just GPUs, but I'll try to help with what I can

2 hours ago, kex243 said:

Does RAM frequency play a key role in generating speed?

Bandwidth sure is important for those LLMs, but the actual performance gains from going from 3200 to 3600MHz, or 5200MHz to 6000MHz should only net a 10~15% extra bandwidth that won't really translate into 10~15% extra perf. Going for more channels would be a better way to achieve more bandwidth (since you can actually double it).

2 hours ago, kex243 said:

And what about RAM timings – do they impact generation speed significantly? As far as my understanding goes, the difference between 40 and 32 timings might be minimal or negligible.

Yeah, should be negligible and at this point you'll be dealing with OC'ing and dealing with instabilities.

2 hours ago, kex243 said:

And motherboard chips- is there any reason to have modern edge one to prevent higher bandwidth issues in some way (b760 vs z790 for example)? And also- standard holy war Intel vs AMD for CPU processing, but later about it.

No difference. Consumer platforms are dual channel only. The important thing is that if you're going with DDR5 and 128 or 192gb, AMD is currently able to clock way higher than intel and is more stable. AMD also has AVX512, which is a nice bonus.

2 hours ago, kex243 said:

2. I'm also intrigued by the idea of optimizing RAM setups for LLMs. For instance, is it more beneficial to have 4x24 GB or 4x16 GB sticks with a single channel rather than 2 sticks of 48 GB with two channels (at possible maximal frequencies avaliable on market)? Could those arrangements improve bandwidth for LLM processing?

How much ram do you need for your model? If you're going with llama 70b quantized, then 64gb should be more than enough, meaning that you can go for 2x32GB at 6000MHz or more. However, if your model needs more RAM, than you'd need to sacrifice ram speeds in order to get more ram, otherwise you'd be swapping out data and any extra frequency would be worthless.

As others said, any consumer platform is dual channel, ranks won't really matter for this use case, so you should only worry about the amount of ram needed per se.

2 hours ago, kex243 said:

3. Looking ahead, it's exciting to consider the upcoming 14th-gen Intel and 8000-series AMD CPUs. Rumors suggest these processors will feature integrated GPUs. It would be really interesting to explore how productive they are for LLM processing without requiring ~~additional~~ any GPUs. At least for such low budget entusiast like me =). This could potentially be a game-changer.

Still going to be slower than using a couple 3090s/4090s. ROCm is also awful, and the iGPUs shouldn't really be that much faster than current ones found in the 7000 and 13th CPUs. You don't see people using those currently, do you?

47 minutes ago, kex243 said:

I wish I just had returned to this thread when both new gens are realeased. Chances are samll, but there are many ways of them to be realeased, like new researches, papers, drivers, etc.

I don't see how a new gen of CPUs will change anything, they're still going to be way slower than any GPU, and should only net a 10~20% perf increase from current CPUs for this use case.

kex243 · August 27, 2023

25 minutes ago, Eigenvektor said:

However, depending on a memory stick's size, a single stick can already be dual rank (if it has memory ICs on both sides). That is typically only the case for 32 GB sticks or larger these days. So with two of such sticks per channel you could be looking at quad rank.

You would probably need to find very specific benchmarks to find out how beneficial either of these configurations is. But there are likely much more worthwhile optimizations before such micro-optimizations start to make sense.

Yes, this- for 48 GB sticks that is clearly dual channel- just see no other variants but to have 4 ranks with 2 channel consumer motherboard. Sad but true, saw no 4 channel ddr5 motherboards at all, at least in common consumer segment shops, but to talk about potential price. Great thanks for clarifying about ranks and timings.

Read Igor's answer and have less and less enthusiasm. Maybe I will consider to look for some server solution for 4 channel, but those prices... I didn't expect any overclocking, just wanted to have less issues and tested frequency for RAM.
Maybe I have to lower my expectations of q8 and q4 quantization form large models and stop with 64 GB instead of 96, but tighter speed, clocks (and price). I know about swap, have to cut my Windows to fit 70B q2 into 32 Gb and still have only 0,3 GB free, lol.
I would better listen to Igor, sure, but most of articles covered that intel 13900 had Higher stable Mhz for RAM. Sure , AVX512 is an icing on the cake.
I will be alarming here later, when CPU gens are released, just wanted Linus to make all the testings for me =). For now the closets test is archiving, which can be supplemented with LLM running as you mentioned too.
I tell nothing but rumors her, but even 1660 level iGPU is better than nothing for such prices. I will be waitnig for both blue and red to have their turns.
Great thanks for all your God's work being done.

GOTSpectrum · August 27, 2023

12 minutes ago, kex243 said:

Read Igor's answer and have less and less enthusiasm. Maybe I will consider to look for some server solution for 4 channel, but those prices... I didn't expect any overclocking, just wanted to have less issues and tested frequency for RAM.

Both intel and AMD have high-channel memory platforms, for AMD it is the threadripper platform with quad channel DDR4 and Intel have their XEON W with up to 56 cores with quad channel DDR5.

AMD is expected to release new Threadrippers in the near future with updated features such as DDR5 and they are expected with up to 64 cores I believe and Threadripper pro with 8 channel memory and up to (maybe) 96 cores

igormp · August 27, 2023

52 minutes ago, kex243 said:

but most of articles covered that intel 13900 had Higher stable Mhz for RAM

For only 2 sticks, yes. But go for 4x32gb or 4x48gb and it goes the other way around.

53 minutes ago, kex243 said:

Maybe I will consider to look for some server solution for 4 channel, but those prices

You could look at older platforms, or even HEDT ones (like Xeon-W or Threadripper), but it's still going to be way more expensive than a regular platform (unless you go with really older, and by then your CPU perf will be miles behind).

54 minutes ago, kex243 said:

just wanted Linus to make all the testings for me

That's not really the kind of content LTT does. And even when they do, they often make mistakes or can't really be trusted for such tests. Better look at serve the home, or the actual repos of the projects, people often post benchmarks in there.

55 minutes ago, kex243 said:

I tell nothing but rumors her, but even 1660 level iGPU is better than nothing for such prices. I will be waitnig for both blue and red to have their turns.

If you really want to do CPU inference, your best bet is actually to go with an Apple device lol

38 minutes ago, GOTSpectrum said:

Both intel and AMD have high-channel memory platforms, for AMD it is the threadripper platform with quad channel DDR4 and Intel have their XEON W with up to 56 cores with quad channel DDR5.

Since TR 3000 there's the Pro variant with octa channel. 7000s TR should be able to have up top 12 channels if they just copy paste their Epyc setup.

Current Xeon W offerings are also octa-channel.

39 minutes ago, GOTSpectrum said:

AMD is expected to release new Threadrippers in the near future with updated features such as DDR5 and they are expected with up to 64 cores I believe and Threadripper pro with 8 channel memory and up to (maybe) 96 cores

I don't think non-Pro models will be a thing. 5000 series was already Pro-only, the non-pro variant makes no sense whatsoever.

SorryBella · August 27, 2023

58 minutes ago, igormp said:

or the actual repos of the projects

HEAVILY underrated in terms of looking at AI performance. Its either that or going to Elon Musk Social Piss Factory which is where I found DDR5 performance boost on Pytorch.

GOTSpectrum · October 20, 2023

On 8/27/2023 at 9:27 PM, igormp said:

I don't think non-Pro models will be a thing. 5000 series was already Pro-only, the non-pro variant makes no sense whatsoever.

Well I'm glad I was on point with my prediction

igormp · October 20, 2023

3 hours ago, GOTSpectrum said:

Well I'm glad I was on point with my prediction

And I'm really salty at it, as you can see in their launch thread lol

GOTSpectrum · October 20, 2023

32 minutes ago, igormp said:

And I'm really salty at it, as you can see in their launch thread lol

I agree... the right thing to do would be to just have one platform with all options open.

It just feels like a money grab

Kir_ · October 29, 2023

I wish I had found this thread earlier. It has all the info I painstakingly gathered over days.

I too have been looking into running LLMs on CPU for inference (on a budget), since it's much cheaper to add a bunch of RAM than GPUs.

imo octa channel is where it starts to get interesting, since memory bandwidth seems to be the limiting factor. TR 3000 is obviously not a budget option. TR 1000/2000 is only quad channel. I have not yet found "cheap" used intel offerings.

Epyc 7002 though offers octa channel, goes up to a theoretical 204.8 GB/s bandwidth and can be found on ebay at somewhat reasonable prices. An Epyc 7302 + Gigabyte MZ32-AR0 Motherboard can be had for ~400USD+customs. You'll need 3200MT/s ECC RAM for that, which isn't necessarily cheap, but maybe you could get away with slower single rank modules and overclocking.

Taking a 70b Q4_K_M model at ~44GB RAM you should theoretically achieve 4.6 t/s, not sure how much less it would be in actuality. You'd need two 3090's to run this and context could eat up the remaining 4GB. Q5_K_M at ~51GB simply wouldn't fit. I'd consider 4t/s to still be acceptable. Not sure about time to first token.

Now what I'm still wondering is, would using dual socket motherboard with 2x Epyc 7002 also double the bandwidth/can llama.cpp make use of it?

In the end I'm not sure I want to go for it though. Adding in 8 sticks of 3200MT/s ECC RAM, cooler, case, psu etc. the "budget" machine quickly gets closer to 1k, which is a bit much for a project purely for fun.

JannisGPT · November 17, 2023

Regarding Next-Gen Prospects (and even This-Gen Prospects): Xeon Saphire Rapids have AMX(Advanced Matrix Extension) instructions. This can give you a 10x performance boost when you compare fp32 with BF16 and a 4x performance boost when you compare AVX512 BF16 with AMX BF16.

With AMX enabled, Saphire Rapids are much much faster than EPYC Genoa in Deep Learning.

This is my first post, hello to the Linus Tech Tips community

kex243 · February 12

I was thinking of 8000 ryzens, but I will wait for their Strix Halo '4060-70-laptop analogue'. Maybe there will be software to use it with to this time. If to wait for ddr6 it will be or too costy or too late. I will put a big fan agains my motherboard to squeeze out all of it.

CatHerder · February 23

So how do you people recon things will evolve for CPU side LLMs? Now that stuff like 8x7B models are a thing, which afaik are better at utilizing RAM, will the development continue to this direction? Just wondering if in a year or so, the most cost efficient consumer LLM build will be to get a decent amount of very fast RAM and combining it with a single fast GPU, rather than stacking up 3090s, 3060s. Hypothetically speaking of course. No one should build a PC around such speculation.

If future models get better at prioritizing what to put into the VRAM and what to cycle out to RAM, this could be the way of the future. This idea of forever ballooning model sizes, VRAM requirements and extreme quantization to make it somehow fit in someone's VRAM does not seem sustainable to me, especially for consumer self hosted use cases.

I have no idea. I am just a tech tourist that uses the tools without understanding the theory behind any of this.

CatHerder · March 1

https://arxiv.org/abs/2402.17764

So this seems to be the latest hyped up thing in the LLM sphere. Using ternary 1-bit LLMs, which supposedly have achieved very similar quality to FP16 models. If this pans out, the implications for running LLMs on CPU could be huge?

igormp · March 1

3 hours ago, CatHerder said:

https://arxiv.org/abs/2402.17764

So this seems to be the latest hyped up thing in the LLM sphere. Using ternary 1-bit LLMs, which supposedly have achieved very similar quality to FP16 models. If this pans out, the implications for running LLMs on CPU could be huge?

That doesn't make things faster, just make them require ~~more~~ less memory. (typo, sorry)

In fact, it's likely it actually makes things slower since you'll have some overhead on converting the data to a suitable unit of data for processing (int8/fp8 or whatever your hw has support for)

Edited March 1 by igormp
typo on more->less

kex243 · April 16

hey, it doesn't utilize the npu module even now. Still excited to see if halo strix will be what it should be with those AI teraflops mentioned in every leak.
https://www.anandtech.com/show/21242/amd-ryzen-7-8700g-and-ryzen-5-8600g-review/8

Sign In

Large language models (LLM) Local Running GPU, CPU: Benchmarks and Next-Gen Prospects

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites