Jump to content

Intel releases new Advanced Performance Extensions and AVX10 extensions to their future CPUs

igormp

I'm not a fan of vector ISA extensions, especially for GPU like instructions like wide vector multiplication. I like custom ISA extension that you can implement for exactly what you need.

 

We are in the chiplet era. The massive ALUs are in the GPU, no need to make CPU cores less efficient to handle GPU like workloads they are dreadful. Why not have a deeper integration beween CPU and iGPU?

 

Make a special L3 cache shared with the iGPU, so it becomes cheap for the CPU and GPU to keep each other fed without going to memory.

Massive L3 cache chiplet, CPU chiplet and GPU chiplet all glued by the IO die. Let the iGPU handle everything that is vector MAC.

And because DDR is much cheaper than GDDR, you enable running big ML models on APUs. All without touching the AMD64 ISA.

Link to comment
Share on other sites

Link to post
Share on other sites

12 minutes ago, 05032-Mendicant-Bias said:

I'm not a fan of vector ISA extensions, especially for GPU like instructions like wide vector multiplication. I like custom ISA extension that you can implement for exactly what you need.

 

We are in the chiplet era. The massive ALUs are in the GPU, no need to make CPU cores less efficient to handle GPU like workloads they are dreadful. Why not have a deeper integration beween CPU and iGPU?

 

Make a special L3 cache shared with the iGPU, so it becomes cheap for the CPU and GPU to keep each other fed without going to memory.

Massive L3 cache chiplet, CPU chiplet and GPU chiplet all glued by the IO die. Let the iGPU handle everything that is vector MAC.

And because DDR is much cheaper than GDDR, you enable running big ML models on APUs. All without touching the AMD64 ISA.

You just described AMD's MI300 lol

 

And also nvidia's grace hopper in a smaller degree.

FX6300 @ 4.2GHz | Gigabyte GA-78LMT-USB3 R2 | Hyper 212x | 3x 8GB + 1x 4GB @ 1600MHz | Gigabyte 2060 Super | Corsair CX650M | LG 43UK6520PSA
ASUS X550LN | i5 4210u | 12GB
Lenovo N23 Yoga

Link to comment
Share on other sites

Link to post
Share on other sites

29 minutes ago, 05032-Mendicant-Bias said:

We are in the chiplet era. The massive ALUs are in the GPU, no need to make CPU cores less efficient to handle GPU like workloads they are dreadful. Why not have a deeper integration beween CPU and iGPU?

For this type of case, execution isn't the biggest problem, but getting data to where it needs to be is.

 

The only way to have a deep enough integration to be worth doing this by making a high end APU, like Apple M series, or the one in PS5. Basically you'd have to give up GPU and ram upgradability. For a lot of use cases, like laptops, this could be fine. How many would accept this route on a desktop?

 

It isn't happening on dGPUs. Bandwidth over PCIe isn't good enough. Even 5.0 x16 is less bandwidth than a typical DDR5 system, which itself is already inadequate.

 

My suspicion is that iGPUs are too feeble to replace the CPU in this use case. Taking my 5 year old 7920X as an example, it runs 3.4 GHz on AVX2 workloads, 12 cores, and in theory has a peak rate of 8 FP64 instructions per clock, double that if you count FMA as 2 which is common. That's around 650 GFLOPS. Double the peak IPC again with AVX-512, but drops to 2.9 GHz, for just over 1.1 TFLOPS.

 

My 4070? Rated at 504 GFLOPS. 7900 XT does a bit better, at 1.6 TFLOPS. Again, these are peak rates, and practical rates may be much lower. Basically most iGPUs would be really slow even before you factor in the penalties involved with moving data around.

 

Note in the above I'm looking only at FP64, which is a strong point for CPUs, and a weak point for consumer GPUs. If you drop to FP32, then lower GPUs might be viable.

 

With current tech, basically it is introducing a lot of complexity for worse performance than keeping it on CPU.

Main system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, Corsair Vengeance Pro 3200 3x 16GB 2R, RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

6 hours ago, igormp said:

But what I haven't properly wrapped my head around is that there are 2 subsets of AVX10: the 256bit and 512bit one. The 256bit is going to be supported in all cores after its debut, but are the E cores able to run the 512 subset somehow? That wasn't clear from intel's docs (at least to me).

I would say for the near future E-Cores will only support 256bit so that's also where client computing CPUs will be limited to. Server will be where the 512bit support will be. Intel is looking at doing all E-Core Xeons as well so maybe at that point 512bit will get supported or that could also be a server only subvariant. 

Link to comment
Share on other sites

Link to post
Share on other sites

4 hours ago, leadeater said:

I would say for the near future E-Cores will only support 256bit so that's also where client computing CPUs will be limited to. Server will be where the 512bit support will be. Intel is looking at doing all E-Core Xeons as well so maybe at that point 512bit will get supported or that could also be a server only subvariant. 

While that happens, AMD supports avx512 on all the latest processors, including Zen4c, without power shenanigans like Intel. Not sure what they did, but it's fantastic work. 

Link to comment
Share on other sites

Link to post
Share on other sites

51 minutes ago, Forbidden Wafer said:

While that happens, AMD supports avx512 on all the latest processors, including Zen4c, without power shenanigans like Intel. Not sure what they did, but it's fantastic work. 

Intel has solved the power issues when doing avx512 some generations ago if I'm not mistaken. 

FX6300 @ 4.2GHz | Gigabyte GA-78LMT-USB3 R2 | Hyper 212x | 3x 8GB + 1x 4GB @ 1600MHz | Gigabyte 2060 Super | Corsair CX650M | LG 43UK6520PSA
ASUS X550LN | i5 4210u | 12GB
Lenovo N23 Yoga

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Forbidden Wafer said:

While that happens, AMD supports avx512 on all the latest processors, including Zen4c, without power shenanigans like Intel. Not sure what they did, but it's fantastic work. 

 

40 minutes ago, igormp said:

Intel has solved the power issues when doing avx512 some generations ago if I'm not mistaken. 

AVX512 power hasn't been a problem for ages and default offset is not set anymore. I think the option still exists but it's not used/set by default. Per core AVX512 on Intel is quite a lot faster, Intel's focus for more than a decade is per core performance rather than minimizing power or die area for the core. Intel CPUs are also wayy faster in small INT/FP workloads (AI/ML) as well.

 

image.thumb.png.93c50feaa5388ab4f033884e659e1d17.png

 

image.thumb.png.12b73d1c89d37063adb1bc85dceead26.png

 

 

image.thumb.png.06da5dd51ea71b0a9e837fb20509f22e.png

 

image.thumb.png.4d072cc8005aa5e598a618814462f547.png

https://www.phoronix.com/review/intel-xeon-platinum-8490h

 

AMD for general/most workloads is faster almost all the time but there's stuff you want to run on Intel CPUs and not AMD and also not on a GPU. Also if you are under a per core licensing model for anything then then you'll be comparing CPUs with exactly the same core count rather than pitting 96 core CPUs versus 60 core CPUs etc.

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, leadeater said:

 

AVX512 power hasn't been a problem for ages and default offset is not set anymore. I think the option still exists but it's not used/set by default. Per core AVX512 on Intel is quite a lot faster, Intel's focus for more than a decade is per core performance rather than minimizing power or die area for the core. Intel CPUs are also wayy faster in small INT/FP workloads (AI/ML) as well.

 

image.thumb.png.93c50feaa5388ab4f033884e659e1d17.png

 

image.thumb.png.12b73d1c89d37063adb1bc85dceead26.png

 

 

image.thumb.png.06da5dd51ea71b0a9e837fb20509f22e.png

 

image.thumb.png.4d072cc8005aa5e598a618814462f547.png

https://www.phoronix.com/review/intel-xeon-platinum-8490h

 

AMD for general/most workloads is faster almost all the time but there's stuff you want to run on Intel CPUs and not AMD and also not on a GPU.

It's a niche market where this type of hardware will be hosted in the cloud (datacenter). Not sure why Intel doesn't breakout the AI acceleration to a chiplet design bolted to a Xeon CPU die. At least, they're leaning in that direction already. But as it stands, they're half-assing the implementation regardless of the impressive numbers so far.

AI processing is a "go big or go home" venture.

Link to comment
Share on other sites

Link to post
Share on other sites

5 minutes ago, StDragon said:

AI processing is a "go big or go home" venture.

Don't forget these 8490H's can be 2P, 4P, 6P & 8P 🙂

 

AMD is still 2P max.

 

Intel's approach will rapidly become less niche when they get the whole HBM on Xeon's sorted out in the market properly, but AMD is right there now too but they are mostly relying on the GPU die for it. Will be rather interesting to see practical workload performance evaluations that are realistic and include all the actual steps involved.

 

Everyone has assumed for a while now GPUs are required and/or better for AI/ML but it's quickly becoming apparent that isn't the universal case and now people are optimizing for CPU again and getting huge gains along with memory capacity benefits and not needing to move data around as much.

Link to comment
Share on other sites

Link to post
Share on other sites

12 hours ago, porina said:

For this type of case, execution isn't the biggest problem, but getting data to where it needs to be is.

Indeed.

Let's say the CPU fetches from DDR a 100MB batch that is now sitting on the shared L3, and the GPU is synced to work on that batch with its ALUs. When done, new data is fetched by the CPU from DDR and the previous results are retired, with the cache commiting the changes to memory. One write and one read have been avoided with the shared cache, and the GPU's beefy ALUs have done quickly the operations. The GPU basically worked as an accelerator glued to the CPU, like in the old 486 days with the math coprocessor that handled some math instructions outside the CPU.

I find it more desirable than adding more generic vector instructions to the CPU that add complexity to the already uber complex X86-64 decoder. It's good that the new extension doesn't impose huge requirements on the ALUs width, but if someone is serious about doing batches of vector operations, they would rather have a serious ALU to back it up. 

1 hour ago, StDragon said:

AI processing is a "go big or go home" venture.

Training yes, but Inference and Fine Tuning are workloads that are gaining purchase on consumer CPUs and GPUs. This extensions I think is targeted at eventually having some vector support on P and E cores for this types of workload.

DDR channels are so much cheaper than GDDR channels. You can economically put together a 64GB DDR+CPU at much less the cost of a 24GB GPU.

The endgame, is getting to the point where a serious personal assistent with a bazillion parameters can run locally on a phone.


Another observation is that most of the data for inference is read only. I suspect a properly placed flash chiplet glued on top of an ALU would massively reduce the costs of storing parameters.

Link to comment
Share on other sites

Link to post
Share on other sites

44 minutes ago, 05032-Mendicant-Bias said:

Let's say the CPU fetches from DDR a 100MB batch that is now sitting on the shared L3, and the GPU is synced to work on that batch with its ALUs. When done, new data is fetched by the CPU from DDR and the previous results are retired, with the cache commiting the changes to memory. One write and one read have been avoided with the shared cache, and the GPU's beefy ALUs have done quickly the operations. The GPU basically worked as an accelerator glued to the CPU, like in the old 486 days with the math coprocessor that handled some math instructions outside the CPU.

A GPU doesn't have access to the CPU L3 cache, it doesn't even necessarily have access to what is in CPU memory either even if sharing the same physical RAM/HBM.

 

ROCm does have unified memory support but not all "unified memory" implementations are equal and since AMD GPUs are so devoid of usage in this industry there is very little information on that.

 

44 minutes ago, 05032-Mendicant-Bias said:

I find it more desirable than adding more generic vector instructions to the CPU that add complexity to the already uber complex X86-64 decoder. It's good that the new extension doesn't impose huge requirements on the ALUs width, but if someone is serious about doing batches of vector operations, they would rather have a serious ALU to back it up.

As I've mentioned, GPUs being best is an assumption rapidly being broken. GPUs will need to deal with wider data execution to offset CPUs getting faster and optimizations to use the wider capabilities of CPUs. The long standing issue however is that the more you try and make a GPU do the more complex and expensive it gets while still having the problem of it not actually being the main system processor and main system memory.

 

Quote

GPUs are known for being significantly better than most CPUs when it comes to AI deep neural networks (DNNs) training simply because they have more execution units (or cores). But a new algorithm proposed by computer scientists from Rice University is claimed to actually flip the tables and make CPUs a whopping 15 times faster than some leading-edge GPUs. 

 

Quote

Anshumali Shrivastava, an assistant professor of computer science at Rice's Brown School of Engineering, and his colleagues have presented an algorithm that can greatly speed up DNN training on modern AVX512 and AVX512_BF16-enabled CPUs. 

 

Quote

To prove their point, the scientists took SLIDE (Sub-LInear Deep Learning Engine), a  C++ OpenMP-based engine that combines smart hashing randomized algorithms with modest multi-core parallelism on CPU, and optimized it heavily for Intel's AVX512 and AVX512-bfloat16-supporting processors. 

 

Quote

"We leveraged [AVX512 and AVX512_BF16] CPU innovations to take SLIDE even further, showing that if you aren't fixated on matrix multiplications, you can leverage the power in modern CPUs and train AI models four to 15 times faster than the best specialized hardware alternative."

 

Quote

The results they obtained with Amazon-670K, WikiLSHTC-325K, and Text8 datasets are indeed very promising with the optimized SLIDE engine. Intel's Cooper Lake (CPX) processor can outperform Nvidia's Tesla V100 by about 7.8 times with Amazon-670K, by approximately 5.2 times with WikiLSHTC-325K, and by roughly 15.5 times with Text8. In fact, even an optimized Cascade Lake (CLX) processor can be 2.55–11.6 times faster than Nvidia's Tesla V100.

https://www.tomshardware.com/news/cpu-vs-gpu-ai-performance-uplift-with-optimizations

 

Quote

Six months ago, Neural Magic shared remarkable MLPerf results, with a 175X increase in CPU performance, attained using sparsity. This breakthrough was achieved exclusively with software, using sparsity-aware inferencing techniques. The impressive outcomes showcased the potential of network sparsity to enhance the performance of machine learning models on readily available CPUs. This advancement empowers individuals and businesses to deploy scalable, high-speed, and accurate machine learning (ML) models without investing in costly hardware accelerators. Building upon our previous submission, we are thrilled to reveal results that showcase a 6X improvement, elevating our overall CPU performance boost to an astounding 1,000X.

 

Quote

This year, Neural Magic used 4th Gen AMD EPYC™ processors for our benchmark testing. Neural Magic’s software stack takes advantage of continued innovations in the 4th Gen AMD EPYC™ processors, such as AVX-512 and VNNI instructions, as well as advanced features like highly performant DDR5 memory and a core count up to 96 cores, to unlock new possibilities for delivering better than GPU speeds on x86 CPUs.

 

CHART-Neural_Magic_Scales_Up_MLPerf_Infe

 

CHART-Neural_Magic_Scales_Up_MLPerf_Infe

https://neuralmagic.com/blog/neural-magic-scales-up-mlperf-inference-performance-with-demonstrated-power-efficiency-no-gpus-needed/

 

Your GPU isn't necessity better than your CPU, particularly not when it's a consumer GPU with features locked out or hardware not present at all compared to datacenter compute GPUs. And this performance can be used even with your desktop CPU.

Link to comment
Share on other sites

Link to post
Share on other sites

27 minutes ago, leadeater said:

Your GPU isn't necessity better than your CPU, particularly not when it's a consumer GPU with features locked out or hardware not present at all compared to datacenter compute GPUs. And this performance can be used even with your desktop CPU.

 

Fixed logic is still fixed logic. If I buy a desktop CPU and it lacks AVX512, and I buy a GPU and it lacks CUDA, I'm basically up the creek with no AI paddles. Perhaps one day AMD will stick with compute stack others will develop against. Like it's taken quite a long time for AMD to be supported for video encoding (GPU) in OBS as well. Which again is because AMD doesn't commit to one thing long enough.

 

Like each type of data (text, music, speech, photos, video) doesn't always call for the same neural network type. Right now everyone is focused on transformers (which is what GPT, stable diffusion, and VITS) are all based on. Something else may come along and prove to be better than transformer.

 

https://www.intel.com/content/www/us/en/developer/articles/technical/optimize-transformer-model-inference-processors.html#gs.31ewr3

 

Intel has (as far as I can tell) upstreamed stuff into pytorch and tensorflow . But that still requires using that code path, which most ML stuff just goes straight to "Use CUDA? YES, else FAIL"

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, Kisai said:

Fixed logic is still fixed logic. If I buy a desktop CPU and it lacks AVX512, and I buy a GPU and it lacks CUDA, I'm basically up the creek with no AI paddles.

All Zen 4 including Ryzen have it and Intel will have it too with either subset of AVX10.

 

2 hours ago, Kisai said:

Perhaps one day AMD will stick with compute stack others will develop against

You should read my above post then since it's got people optimizing for AMD EPYC and that's also applicable to Ryzen as they are the same exact capabilities.

Link to comment
Share on other sites

Link to post
Share on other sites

7 hours ago, Forbidden Wafer said:

While that happens, AMD supports avx512 on all the latest processors, including Zen4c, without power shenanigans like Intel. Not sure what they did, but it's fantastic work. 

While I haven't directly tested Zen 4 personally, I doubt AMD have changed the laws of physics here. The last time I tested it was on Zen 2, and I could check on my Zen 3 laptop. Since AMD CPUs typically run on a relatively low power limit (or thermal limit with Zen 4), you don't see the higher powers Intel CPUs may reach if set at higher limits, as common on enthusiast configurations. Instead, AMD CPUs just drop clock. It isn't a fixed offset but varies with set limiters. It's too long ago so I can't recall the exact numbers, but even with AVX2, my old 3700X ran somewhere above 4 GHz with a light load such as Cinebench R15, but put a heavy AVX load on it like Prime95 and it will drop several hundred MHz. I suspect similar will happen on Zen 4, especially with AVX-512. This can be easily tested by anyone with a Zen 4.

 

4 hours ago, StDragon said:

Not sure why Intel doesn't breakout the AI acceleration to a chiplet design bolted to a Xeon CPU die. At least, they're leaning in that direction already. But as it stands, they're half-assing the implementation regardless of the impressive numbers so far.

I guess it comes down to which bit of AI? There's already VNNI extension. Weren't they looking at dedicated "accelerators" for various tasks on the latest Xeons? Not an area I pay close attention to.

 

2 hours ago, 05032-Mendicant-Bias said:

The GPU basically worked as an accelerator glued to the CPU, like in the old 486 days with the math coprocessor that handled some math instructions outside the CPU.

The FPU was eventually integrated as standard. Likewise many other features over time that were external to the "CPU", getting absorbed. North Bridge is now in CPU. L2 cache in CPU. Maybe some day we'll have big GPUs integrated as standard, but that's some way off.

 

Then again, if you integrate it closely enough, it becomes an extension of the CPU.

 

2 hours ago, 05032-Mendicant-Bias said:

I find it more desirable than adding more generic vector instructions to the CPU that add complexity to the already uber complex X86-64 decoder.

Like it or not I feel that is still the main way to get more performance out of CPUs. Find use cases that people could use, and make them faster. My personal interest focuses more on FP64, so AI is kinda pushing the industry in the wrong direction for me with smaller data sizes. I had wondered what my "ideal" compute CPU would be like. Big FP64 units. Minimal other stuff supporting it. Slap it on a big lump of cache. Oh, I just described Knights Landing. Intel had already done it. Why not GPU? For reasons I don't claim to understand, while some GPUs have massive FP64 throughputs, they can't attain the speeds CPUs can. Something to do with the software scaling perhaps. One task running spread over multiple cores is generally worse than one task per core, providing that core isn't external resource constrained.

 

1 hour ago, leadeater said:

A GPU doesn't have access to the CPU L3 cache, it doesn't even necessarily have access to what is in CPU memory either even if sharing the same physical RAM/HBM.

The earlier talk was about a hypothetical "closely integrated" CPU and GPU arrangement, perhaps along the lines of Apple M and PS5 SOC.

Main system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, Corsair Vengeance Pro 3200 3x 16GB 2R, RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

58 minutes ago, porina said:

The earlier talk was about a hypothetical "closely integrated" CPU and GPU arrangement, perhaps along the lines of Apple M and PS5 SOC.

Only Apple SoC's have a shared L3 cache between the CPU and GPU since that exists in the SLC that sits between the memory that all memory access go through. PS5 SoC has dedicated L3 cache for the CPU only, the GPU doesn't utilize it at all.

 

For the most part you wouldn't want the LLC/L3 caches shared since that is only useful for compute workloads and data sharing between CPU and GPU, everything else the data in the LLC would only be useful for the CPU and vice versa. So basically it would make sense on a datacenter SoC like AMD/Nvidia are doing but not a whole lot on consumer desktop realistically.

Link to comment
Share on other sites

Link to post
Share on other sites

30 minutes ago, leadeater said:

Only Apple SoC's have a shared L3 cache between the CPU and GPU since that exists in the SLC that sits between the memory that all memory access go through. PS5 SoC has dedicated L3 cache for the CPU only, the GPU doesn't utilize it at all.

Does PS5 SOC apply similarly to AMD desktop/laptop APUs too? 

 

30 minutes ago, leadeater said:

For the most part you wouldn't want the LLC/L3 caches shared since that is only useful for compute workloads and data sharing between CPU and GPU, everything else the data in the LLC would only be useful for the CPU and vice versa. So basically it would make sense on a datacenter SoC like AMD/Nvidia are doing but not a whole lot on consumer desktop realistically.

That was the scenario painted by @05032-Mendicant-Bias in an earlier post - offload vector calculations to GPU instead of pushing more into the CPU. To minimise the cost of transferring data, utilise a shared cache between the two.

Main system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, Corsair Vengeance Pro 3200 3x 16GB 2R, RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

4 hours ago, Kisai said:

Like each type of data (text, music, speech, photos, video) doesn't always call for the same neural network type. Right now everyone is focused on transformers (which is what GPT, stable diffusion, and VITS) are all based on. Something else may come along and prove to be better than transformer.

Minor correction: Stable diffusion is, well, a diffusion model, unrelated to transformers.

FX6300 @ 4.2GHz | Gigabyte GA-78LMT-USB3 R2 | Hyper 212x | 3x 8GB + 1x 4GB @ 1600MHz | Gigabyte 2060 Super | Corsair CX650M | LG 43UK6520PSA
ASUS X550LN | i5 4210u | 12GB
Lenovo N23 Yoga

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, porina said:

Does PS5 SOC apply similarly to AMD desktop/laptop APUs too? 

SoC design wise it's the same as the APUs etc. PS5 has customized FPUs in the CPU core arch, it's Zen2 with some Zen3isms and GDDR memory interface. Anyway it's not similar to Apple M series.

 

1 hour ago, porina said:

That was the scenario painted by @05032-Mendicant-Bias in an earlier post - offload vector calculations to GPU instead of pushing more into the CPU. To minimise the cost of transferring data, utilise a shared cache between the two.

I know it's just that sharing caches between GPU and CPU is non-trivial and if you can get well more than acceptable performance out of a CPU then why go through all the effort and complexity. The hardware side of things isn't the only part of the picture as evidence by these recent CPU optimizations but that along with GPUs isn't a good reason for things like VNNI and AVX10/AVX512 to not exist. Add to that not everyone is going to have an "RTX 3060" but they will all have a CPU and if you can get a middle of the road CPU performing the same tasks as well as say an RTX 2060 does it matter that it's not as fast as a RTX 4090 or H100? If the performance is good enough then why worry that there is faster available.

 

Those who buy RTX 3060/4060 don't worry that much about the existence of RTX 4090's, they were never an option due to price.

 

So why lock all the capabilities behind exclusive hardware devices that aren't ubiquitous or affordable? I'm sure Nvidia and AMD/Radeon would love that and they spend lots of marketing money to tell you so and make you believe it but should we listen to them so much?

 

Consideration also has to be made that if you can get these kinds of performance levels out of CPUs do you need to offload to the GPU anymore? Is the reason and desire to do so still valid or are you doing it because it wasn't possible before? Is a solution to a problem trying to be found for a problem that no longer exists?

Link to comment
Share on other sites

Link to post
Share on other sites

Intel had a problem with the AVX512 extension when trying to make E cores, and had to drop support for P cores as well.

This latest AVX10 revision is meant to restore support in a way that E cores can implement. It's why I don't like standard extensions to an ISA that implements widespread but expensive quite not as ubiquotous instructions.  Now Intel has caused all sort of compatibility and porting issues for the people that use those powerful instructions vector instructions.

 

The X86-64 ISA doesn't even have things like trigonometric cos and sin instructions for this reason. they are too big with not ubiquotous use. I like the RISC-V and ARM approach of reserving instruction space for custom instructions for those who needs it. XEON even have FPGA space to implement custom accelerators which I like a lot.

 

It's why I would argue that CPUs would be better off doing only the ubiquotous instructions as efficiently as possible, and be paired with dedicated accelerators for the expensive workloads. I proposed a shared L3 cache between CPU and iGPU as an idea to pull off vector acceleration but without adding complexity to decoder and execution units of the CPU. I would like this apporach over AVX10 because it would push for development of eterogenous computing and Open Compute. You have those iGPU transistors inside the iGPU that are there, mostly doing nothing. With better drivers and memory subsystem integration, they could be used to accelerate things like vector operations and so much more, and not be dead silicon. I think the chiplet approach lends itself very well for this type of approach. Keep the CPU lean generic and efficent. Everything else is a chiplet.

Link to comment
Share on other sites

Link to post
Share on other sites

11 hours ago, 05032-Mendicant-Bias said:

 Training yes, but Inference and Fine Tuning are workloads that are gaining purchase on consumer CPUs and GPUs. This extensions I think is targeted at eventually having some vector support on P and E cores for this types of workload.

DDR channels are so much cheaper than GDDR channels. You can economically put together a 64GB DDR+CPU at much less the cost of a 24GB GPU.

The endgame, is getting to the point where a serious personal assistent with a bazillion parameters can run locally on a phone.


Another observation is that most of the data for inference is read only. I suspect a properly placed flash chiplet glued on top of an ALU would massively reduce the costs of storing parameters.

Von Neumann computers aren't cut out for the kind of AI that can train itself. Everyone wants to beat around that bush. But fact is, you need compute (neural processing) and storage within the same compute unit.

Neurons do just that, they weigh information and store the results in long protein chains (long-term memory).

Link to comment
Share on other sites

Link to post
Share on other sites

11 hours ago, 05032-Mendicant-Bias said:

It's why I would argue that CPUs would be better off doing only the ubiquotous instructions as efficiently as possible, and be paired with dedicated accelerators for the expensive workloads. I proposed a shared L3 cache between CPU and iGPU as an idea to pull off vector acceleration but without adding complexity to decoder and execution units of the CPU. I would like this apporach over AVX10 because it would push for development of eterogenous computing and Open Compute. You have those iGPU transistors inside the iGPU that are there, mostly doing nothing. With better drivers and memory subsystem integration, they could be used to accelerate things like vector operations and so much more, and not be dead silicon. I think the chiplet approach lends itself very well for this type of approach. Keep the CPU lean generic and efficent. Everything else is a chiplet.

You could have said the same thing about SSE, SSE2, SSE3, SSE4.1, AVX, AVX2 though and all of them are widely used and in every processor. Something cannot be used if it's not present in hardware so not including it definitely means it won't be used so it's a self fulfilling prophecy.

 

I'd probably agree we don't need 512bit.

 

Thing is even the old implementation with dedicated AVX-512 execution unit was only a really small part of the CPU core and thus a tiny fraction of the total die area. Making all FP execution units 512bit wouldn't actually be that costly in die area which means moving that off to a chiplet becomes prohibitively expensive because it's either going to be practically speaking too small with most of the chiplet used for interconnection or you'll need to put all the dedicated 512bit FP execution units in to 1 or 2 chiplets then have interconnect bandwidth and/or cost problems.

 

AVX extensions are actually among the top of the list for most important extension for modern CPUs, we can't live without it. Golden Cove has in hardware AVX-512 btw (Alder Lake). Golden Cove wasn't made only for Alder Lake, it's primarily for Xeon that only had Golden Cove cores.

 

I don't see any reason to not do AVX10 since it's not adding anything extra that is a problem if staying with 256bit and we get to move on from the limitations of AVX2.

 

I'm not sure pulling execution units out in to chiplets is such a great idea over chiplet core dies, you'd end up wasting tons of space to interconnect and that is really costly while also potentially having bandwidth issues. Having all the important stuff inside the die itself and scaling out the number of these is a step down in complexity.

 

Also once you are at 16 cores and higher the FP performance of these is actually higher than almost every iGPU so offloading to iGPU would be slower not faster oddly enough. So do you really want bigger iGPU or more or faster CPU cores? On the server side of things that answer is already loud and clear, desktop and laptop now that's a lot more difficult but laptop would be the most for yes iGPU since the number of cores and power limits is low meaning the gap between CPU and iGPU is vastly wider.

 

Anyway there is no free lunch here, more of anything will always cost more so making the iGPU bigger and more important is only going to drive up cost, more so than AVX10 256bit will anyway.

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×