Jump to content

GPU vs NPU?

Haswellx86

NPUs are more specialized towards accelerating AI workloads. But then why is Nvidia still making GPUs for AI acceleration?

 

Fundamentally from the hardware level, how exactly are they different from each other? I couldn't find sources myself.

 

Can we expect Nvidia or someone else launching graphics card looking like PCIe cards which have much powerful and discrete NPUs then the one's integrated in the latest AI processors? I mean we already had such AI accelerators like for tensorflow, but AI has progressed much further.

Microsoft owns my soul.

 

Also, Dell is evil, but HP kinda nice.

Link to comment
Share on other sites

Link to post
Share on other sites

NPUs are much smaller and designed more around power efficiency. They're not designed to do really heavy lifting, but ok for more limited tasks. In Intel's tech demos the software used can use a mix of the AI capabilities of CPU, NPU and GPU to get it done faster when power efficiency is not the priority.

 

For context, Copilot+ requires 40+ TOPS. A 4070 is rated at 466 TOPS. NPUs in this year's CPUs (Ryzen AI, Lunar Lake, Snapdragon Elite) are in 40-50 TOPS region. First gen NPUs from AMD and Intel were around 10 TOPS from memory.

 

Intel-Core-Ultra-200V-Series-Lunar-Lake-Launch-NPU-Summary.thumb.jpg.d17cb0b041e3a9717d1d8c5d30f6bd30.jpg

https://www.servethehome.com/intel-core-ultra-200v-series-lunar-lake-launched/

 

Illustration of the breakout of "AI" capability in Lunar Lake.

Gaming system: R7 7800X3D, Asus ROG Strix B650E-F Gaming Wifi, Thermalright Phantom Spirit 120 SE ARGB, Corsair Vengeance 2x 32GB 6000C30, RTX 4070, MSI MPG A850G, Fractal Design North, Samsung 990 Pro 2TB, Alienware AW3225QF (32" 240 Hz OLED)
Productivity system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, 64GB ram (mixed), RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, iiyama ProLite XU2793QSU-B6 (27" 1440p 100 Hz)
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

I think you can not use NPUs for training, only for inference. So essentially a waste of sand. Shame on you chip manufacturerers.

If a post resolved/answered your question, please consider marking it as the solution. If multiple answers solved your question, mark the best one as answer.

Link to comment
Share on other sites

Link to post
Share on other sites

29 minutes ago, Gat Pelsinger said:

NPUs are more specialized towards accelerating AI workloads.

They are meant for inference, and often only meant to run quantized models (like int4 or int8). You won't be using those for training.

29 minutes ago, Gat Pelsinger said:

But then why is Nvidia still making GPUs for AI acceleration?

Those are used for training given their flexibility.

30 minutes ago, Gat Pelsinger said:

how exactly are they different from each other?

NPUs are just "dumb" INT4/INT8/FP8/FP16 (often just 1 or 2 of those types) ALUs, without much more to it.

GPUs have tons of ALUs with tons of supported data types and different acceleration paths (like taking care of sparse matrixes and whatnot), as well as proper scheduling and multiple "threads".

32 minutes ago, Gat Pelsinger said:

Can we expect Nvidia or someone else launching graphics card looking like PCIe cards which have much powerful and discrete NPUs then the one's integrated in the latest AI processors?

No, that would be useless. The whole point of NPUs is to be small and not consume much power.

A big device on PCIe defeats all of that purpose.

32 minutes ago, Gat Pelsinger said:

I mean we already had such AI accelerators like for tensorflow, but AI has progressed much further.

Do you mean stuff like TPUs? Those are still used to this day.

How has AI progressed much further? Your models are still trained using tensorflow or pytorch in most cases, fundamentals are still the same since years ago.

 

9 minutes ago, anirudthelinuxwIzard said:

So essentially a waste of sand. Shame on you chip manufacturerers.

Why? No regular user is going to be training models, it'd be actually useless to add hw meant for training at a higher cost and more power usage.

FX6300 @ 4.2GHz | Gigabyte GA-78LMT-USB3 R2 | Hyper 212x | 3x 8GB + 1x 4GB @ 1600MHz | Gigabyte 2060 Super | Corsair CX650M | LG 43UK6520PSA
ASUS X550LN | i5 4210u | 12GB
Lenovo N23 Yoga

Link to comment
Share on other sites

Link to post
Share on other sites

25 minutes ago, porina said:

A 4070 is rated at 466 TOPS

Now I know the latest Nvidia GPUs have AI capabilities in them, but they are still traditional graphics cards. So with all this, does this mean that GPUs the way they are built, are still perfect for AI tasks, and don't really need a whole separate type of hardware to accelerate it better (except NPUs which are mainly for power efficiency rather than performance)?

 

So do the actual AI accelerators like the H100 or the new GB200 still have the traditional CUDA cores, with the tensor and maybe RT cores (do they have RT? that would be just a waist, isn't it?)? So does this mean you can still game on them perfectly fine? I know you can't game on an NPU that easily (if a game was coded out of the NPU instructions then maybe).

Microsoft owns my soul.

 

Also, Dell is evil, but HP kinda nice.

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, Gat Pelsinger said:

Now I know the latest Nvidia GPUs have AI capabilities in them, but they are still traditional graphics cards. So with all this, does this mean that GPUs the way they are built, are still perfect for AI tasks, and don't really need a whole separate type of hardware to accelerate it better (except NPUs which are mainly for power efficiency rather than performance)?

 

So do the actual AI accelerators like the H100 or the new GB200 still have the traditional CUDA cores, with the tensor and maybe RT cores (do they have RT? that would be just a waist, isn't it?)? So does this mean you can still game on them perfectly fine? I know you can't game on an NPU that easily (if a game was coded out of the NPU instructions then maybe).

There are 2 groups of "AI" workloads, training and infrence, and there need different hardware. Most of the big Nvidia gpus are used for training workload.

 

The big server GPUs from nvidia are a different archedute than the desktop parts, with no ray tracing, display out(i think) and more to save die space for genearl compute and ai accerlators.

 

While you could build a better card for only AI, these cards are bought by other customers only, so its probably cheaper to have one die for many customers than try to split it up to more different products. Look at something like the google tensor chips for a AI only training chip, but I'm pretty sure you can only rent them in google cloud, not buy them.

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, igormp said:

Those are used for training given their flexibility.

But then what hardware is used for running those models, for say, OpenAI giving us ChatGPT services? I think still GPUs, right? Wouldn't it be more efficient to develop specialized hardware for it, just like the new NPUs but more powerful?

 

4 minutes ago, igormp said:

No, that would be useless. The whole point of NPUs is to be small and not consume much power.

A big device on PCIe defeats all of that purpose.

I am asking for a more hardware accelerated solution for AI rather than using traditional GPUs. I suppose something like this hasn't come out, because GPUs are still just perfect for AI tasks? I kind of doubt that.

Microsoft owns my soul.

 

Also, Dell is evil, but HP kinda nice.

Link to comment
Share on other sites

Link to post
Share on other sites

7 minutes ago, igormp said:

 

Why? No regular user is going to be training models, it'd be actually useless to add hw meant for training at a higher cost and more power usage.

Fixed function hardware can only do one thing, it's way less wasteful to have the same silicon be able to do more.

If a post resolved/answered your question, please consider marking it as the solution. If multiple answers solved your question, mark the best one as answer.

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, Gat Pelsinger said:

I am asking for a more hardware accelerated solution for AI rather than using traditional GPUs. I suppose something like this hasn't come out, because GPUs are still just perfect for AI tasks? I kind of doubt that.

I guess what is the market for this? There really aren't good programs for on device AI, and if your using a desktop PC with slots, your GPU is likely way faster than the NPU on a laptop chip which these models are made for. Probably less efficent, but a desktop PC probably isn't best for efficiency either.

Link to comment
Share on other sites

Link to post
Share on other sites

39 minutes ago, porina said:

A 4070 is rated at 466 TOPS.

Any source on that value that you got? Comparing Nvidia's products "tops" seem to be quite misleading due to the unknown underlying data type.

Anyhow, in theory a 4070 should be rated at ~233 TOPS on INT8, double that for INT4, and double both values again is sparse matrixes are on the table.

So yeah, hard to compare values given all variables.

8 minutes ago, Gat Pelsinger said:

So with all this, does this mean that GPUs the way they are built, are still perfect for AI tasks, and don't really need a whole separate type of hardware to accelerate it better

Yes.

9 minutes ago, Gat Pelsinger said:

So do the actual AI accelerators like the H100 or the new GB200 still have the traditional CUDA cores, with the tensor

Yes, they have a pretty similar underlying arch to your regular GeForce cores, but with some difference in amount of units, memory controllers, and some other stuff related to video (like, they have way less ROPs since those are not really useful for what they're meant to be used).

Has been like so since always. An A100 is still pretty similar too.

10 minutes ago, Gat Pelsinger said:

and maybe RT cores (do they have RT? that would be just a waist, isn't it?)?

No, they don't have RT cores on their dies.

You should just give a read at some of Nvidia's whitepapers:

Ampere consumer - https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf

Ampere x100 - https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf

Ada - https://images.nvidia.com/aem-dam/Solutions/Data-Center/l4/nvidia-ada-gpu-architecture-whitepaper-V2.02.pdf

Hopper - https://resources.nvidia.com/en-us-tensor-core

 

12 minutes ago, Gat Pelsinger said:

So does this mean you can still game on them perfectly fine?

Not fine since they lack some stuff (like the ROPs as I said above). But they can run games in theory, yes. But it lacks driver stuff needed on windows to run games, LTT has done a video on the A100:

 

11 minutes ago, Electronics Wizardy said:

The big server GPUs from nvidia are a different archedute than the desktop parts, with no ray tracing, display out(i think) and more to save die space for genearl compute and ai accerlators

That only applies to the x100 chips. Those have a different arrangement of the units (such as no RT or display, as you said), but all of the rest of their lineup uses the exact same chip found in your gaming GPUs.

 

12 minutes ago, Gat Pelsinger said:

But then what hardware is used for running those models, for say, OpenAI giving us ChatGPT services? I think still GPUs, right?

Yup, afaik OpenAI is using a mix of Nvidia and AMD Instinct GPUs on Azure.

13 minutes ago, Gat Pelsinger said:

Wouldn't it be more efficient to develop specialized hardware for it, just like the new NPUs but more powerful?

Why? OpenAI is training models all the time. Within the same cluster they can do both training for updating their models/building their models, and with the remaining of the compute it can be used to serve inference.

Having 2 distinct types would mean sub-utilization from both sides.

 

FWIW, they have all of their stuff on Azure. MS does have their own custom inference thing, but I'm not sure if it's widely used.

Since OpenAI is on the cloud, without any actual datacenter, they don't need to care much about optimizing such things.

 

16 minutes ago, Gat Pelsinger said:

I am asking for a more hardware accelerated solution for AI rather than using traditional GPUs.

For what?

Training? Google has TPUs, there are tons of other companies with dedicated hardware, such as cerbera, intel's habana, huawei's thingie that I forgot the name, tenstorrent products, etc etc. Those are all fast and even faster than Nvidia's offering, but software support is annoying and Nvidia does it best. No good a fast HW if you can't make proper use of it.

For inference, you have TPUs.

 

What else you want?

 

16 minutes ago, anirudthelinuxwIzard said:

Fixed function hardware can only do one thing, it's way less wasteful to have the same silicon be able to do more.

And all consumer wants is to run inference locally for smaller models. It's wasteful to use more silicon that won't be used at all.

NPUs are also not fixed function, they are fully fledged co-processors with tons of MAC unis. Any task that can make use of such hw (like trained models) can take advantage of it.

FX6300 @ 4.2GHz | Gigabyte GA-78LMT-USB3 R2 | Hyper 212x | 3x 8GB + 1x 4GB @ 1600MHz | Gigabyte 2060 Super | Corsair CX650M | LG 43UK6520PSA
ASUS X550LN | i5 4210u | 12GB
Lenovo N23 Yoga

Link to comment
Share on other sites

Link to post
Share on other sites

21 minutes ago, Gat Pelsinger said:

I am asking for a more hardware accelerated solution for AI rather than using traditional GPUs. I suppose something like this hasn't come out, because GPUs are still just perfect for AI tasks? I kind of doubt that.

My understanding of "AI" as an outsider is that it is mostly a throughput game. You do as many calcs as you can to get your eventual answer. Ideal scaling workload of GPUs/NPUs. Maybe you could further optimise certain parts of it, but the general case is good enough. Try to look up if anyone is trying to make something more specific. If no one is doing it in such a growth area, that says something.

 

If you really want custom silicon for AI, there's always these guys: https://cerebras.ai/product-chip/

Gaming system: R7 7800X3D, Asus ROG Strix B650E-F Gaming Wifi, Thermalright Phantom Spirit 120 SE ARGB, Corsair Vengeance 2x 32GB 6000C30, RTX 4070, MSI MPG A850G, Fractal Design North, Samsung 990 Pro 2TB, Alienware AW3225QF (32" 240 Hz OLED)
Productivity system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, 64GB ram (mixed), RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, iiyama ProLite XU2793QSU-B6 (27" 1440p 100 Hz)
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, igormp said:

Any source on that value that you got? Comparing Nvidia's products "tops" seem to be quite misleading due to the unknown underlying data type.

Anyhow, in theory a 4070 should be rated at ~233 TOPS on INT8, double that for INT4, and double both values again is sparse matrixes are on the table.

https://www.nvidia.com/en-gb/geforce/graphics-cards/40-series/rtx-4070-family/

Expand "all specs".

I was concerned about if sparsity was considered but I wasn't motivated enough to dig deeper. I know I have looked up all that stuff before when the export limits on 4090 going to China was announced. Still, potential errors are on magnitude factor of 2 from that, as opposed to order of magnitude illustration I was making.

Gaming system: R7 7800X3D, Asus ROG Strix B650E-F Gaming Wifi, Thermalright Phantom Spirit 120 SE ARGB, Corsair Vengeance 2x 32GB 6000C30, RTX 4070, MSI MPG A850G, Fractal Design North, Samsung 990 Pro 2TB, Alienware AW3225QF (32" 240 Hz OLED)
Productivity system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, 64GB ram (mixed), RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, iiyama ProLite XU2793QSU-B6 (27" 1440p 100 Hz)
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

NPUs don't have very good documentation. Here's a summary I wrote up, but haven't been able to find documentation for some bits:

image.thumb.png.705bc2f0968d4f6c5fc4474fe0918525.png

 

image.thumb.png.f4266c82d222f5bf44e6b3bf4c0c2008.png

 

image.thumb.png.c37be5ae41264782977cd1b240045567.png

 

 

 

cpuvsnpuvsgpu.pdf

Link to comment
Share on other sites

Link to post
Share on other sites

5 minutes ago, porina said:

https://www.nvidia.com/en-gb/geforce/graphics-cards/40-series/rtx-4070-family/

Expand "all specs".

I was concerned about if sparsity was considered but I wasn't motivated enough to dig deeper. I know I have looked up all that stuff before when the export limits on 4090 going to China was announced. Still, potential errors are on magnitude factor of 2 from that, as opposed to order of magnitude illustration I was making.

Ohhh, nice to know where people are getting those TOPs numbers from for Nvidia products. Interesting to see they're populated those.

Comparing the listing on the 4090 on their website:

https://www.nvidia.com/en-gb/geforce/graphics-cards/40-series/

 

To the actual whitepaper:

https://images.nvidia.com/aem-dam/Solutions/Data-Center/l4/nvidia-ada-gpu-architecture-whitepaper-V2.02.pdf

 

Seems like they're listing the value for INT8/FP8 with sparsity. I still stand that those tops numbers are quite misleading and don't represent much. You can't even compare those directly between the different NPUs claiming high values without knowing which data type they're referring to 🙄

 

5 minutes ago, sounds said:

NPUs don't have very good documentation. Here's a summary I wrote up, but haven't been able to find documentation for some bits:

image.thumb.png.705bc2f0968d4f6c5fc4474fe0918525.png

 

image.thumb.png.f4266c82d222f5bf44e6b3bf4c0c2008.png

 

image.thumb.png.2335c067bcd975c79892141c6fe4e7ce.png

Apple's NPU can't really be used for training, same applies to AMD's XDNA.

Google's TPU is far different than those two.

 

Nvidia's GPUs also have open source drivers now. Their userland CUDA stuff still is proprietary tho.

Only proper support on tensorflow/pytorch is using CPU or going with Nvidia. ROCm is a pain, and Apple's GPU has no full support, lots of operations do not work properly.

You can also use Apple's GPU through Metal instead of Core ML (seems like you just copy-pasted the data from the NPU).

 

Google's TPU also has top notch tensorflow support, given that tensorflow is also made by google.

FX6300 @ 4.2GHz | Gigabyte GA-78LMT-USB3 R2 | Hyper 212x | 3x 8GB + 1x 4GB @ 1600MHz | Gigabyte 2060 Super | Corsair CX650M | LG 43UK6520PSA
ASUS X550LN | i5 4210u | 12GB
Lenovo N23 Yoga

Link to comment
Share on other sites

Link to post
Share on other sites

9 minutes ago, igormp said:

Apple's NPU can't really be used for training, same applies to AMD's XDNA.

Google's TPU is far different than those two.

 

Nvidia's GPUs also have open source drivers now. Their userland CUDA stuff still is proprietary tho.

Only proper support on tensorflow/pytorch is using CPU or going with Nvidia. ROCm is a pain, and Apple's GPU has no full support, lots of operations do not work properly.

You can also use Apple's GPU through Metal instead of Core ML (seems like you just copy-pasted the data from the NPU).

Yeah, you're correct, and packing so much information into some brief slides wasn't possible.

 

I summarized it. Feel free to alter/edit the info from the PDF I attached and make more slides. I am doing the slides because many folks do not get the tradeoffs of going with each different solution. There are a lot of "AI" solutions being pushed on different hardware.

Quote

Google's TPU also has top notch tensorflow support, given that tensorflow is also made by google.

Google's TPU has tensorflow support but all I said was it doesn't scale up... if you're going all-in with Google Cloud you can get really great tensorflow support and even scale it up massively. But I think most people here aren't the audience who's ready to drop 10 grand and start a Google Cloud tensorflow project... So I just left all that off and didn't say anything about "Google TPU + Tensorflow + $10,000 = maybe maybe maybe"

Link to comment
Share on other sites

Link to post
Share on other sites

28 minutes ago, igormp said:

Seems like they're listing the value for INT8/FP8 with sparsity. I still stand that those tops numbers are quite misleading and don't represent much. You can't even compare those directly between the different NPUs claiming high values without knowing which data type they're referring to 🙄

That whitepaper was what I was thinking of, but not motivated enough to find for the purposes of this thread. At least the NPUs meeting Copilot+ should be measured similarly since it is to a Microsoft spec. 

Gaming system: R7 7800X3D, Asus ROG Strix B650E-F Gaming Wifi, Thermalright Phantom Spirit 120 SE ARGB, Corsair Vengeance 2x 32GB 6000C30, RTX 4070, MSI MPG A850G, Fractal Design North, Samsung 990 Pro 2TB, Alienware AW3225QF (32" 240 Hz OLED)
Productivity system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, 64GB ram (mixed), RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, iiyama ProLite XU2793QSU-B6 (27" 1440p 100 Hz)
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

21 minutes ago, sounds said:

Google's TPU has tensorflow support but all I said was it doesn't scale up

What do you mean by "scale up"?

You can use multiple TPUs to train larger models. Apple has recently done so with that models of theirs.

46 minutes ago, sounds said:

if you're going all-in with Google Cloud you can get really great tensorflow support and even scale it up massively. But I think most people here aren't the audience who's ready to drop 10 grand and start a Google Cloud tensorflow project... So I just left all that off and didn't say anything about "Google TPU + Tensorflow + $10,000 = maybe maybe maybe"

I mean, you can only use the TPU on GCP anyway. And by this point, it should be compared to the likes of the A100/H100, which are even more expensive to use.

Most people here won't be training models at all tbh.

24 minutes ago, porina said:

That whitepaper was what I was thinking of, but not motivated enough to find for the purposes of this thread. At least the NPUs meeting Copilot+ should be measured similarly since it is to a Microsoft spec. 

Seems like INT8 is the default baseline for the TOPs number (at leasts the SD Elite Hexagon claims 45 TOPs for INT8). However, there's no meantion about being able to use sparsity or not.

FX6300 @ 4.2GHz | Gigabyte GA-78LMT-USB3 R2 | Hyper 212x | 3x 8GB + 1x 4GB @ 1600MHz | Gigabyte 2060 Super | Corsair CX650M | LG 43UK6520PSA
ASUS X550LN | i5 4210u | 12GB
Lenovo N23 Yoga

Link to comment
Share on other sites

Link to post
Share on other sites

@igormp

 

Completely forgot to mention actual enterprise AI accelerators like Intel's Gaudi AI accelerator. Is that what I am talking about a much powerful NPU? How are they different from Nvidia's GPU solutions?

Microsoft owns my soul.

 

Also, Dell is evil, but HP kinda nice.

Link to comment
Share on other sites

Link to post
Share on other sites

29 minutes ago, Gat Pelsinger said:

@igormp

 

Completely forgot to mention actual enterprise AI accelerators like Intel's Gaudi AI accelerator. Is that what I am talking about a much powerful NPU? How are they different from Nvidia's GPU solutions?

I had mentioned those already:

On 9/16/2024 at 2:32 PM, igormp said:

For what?

Training? Google has TPUs, there are tons of other companies with dedicated hardware, such as cerbera, intel's habana, huawei's thingie that I forgot the name, tenstorrent products, etc etc. Those are all fast and even faster than Nvidia's offering, but software support is annoying and Nvidia does it best. No good a fast HW if you can't make proper use of it.

 

For actual differences, you can look at the whitepaper for it, but basically it's a big chip with lots of SRAM as cache and just 2 kind of compute units: a matrix engine and a tensor cores.

https://www.intel.com/content/www/us/en/content-details/817486/intel-gaudi-3-ai-accelerator-white-paper.html

image.thumb.png.27d8e6f9cf1b2f79e06b25113a6ce71a.png

 

The TPCs are VLIW chips, so they really rely on the compiler to do the proper scheduling of instructins and only run the instruction stream as it comes, without any further optimization. GPUs have departed from this idea ages ago.

The MMEs are actually the equivalent to Nvidia's tensor cores, whereas the TPCs are mean for more "general" use. Think of the TPCs as really dumb AVX2 units found in your CPU.

 

Nvidia's SMs (which should be the equivalent to Gaudi's TPCs) are way more powerful and can do many more operations. The GPU itself can also do better instruction scheduling and whatnot.

 

Perf wise, it's really meh for non matrix ops (barely does 30TFLOPS for BF16 while the H100 does over 100TFLOPS), and apparently really nice for BF16 tensor ops (1.8PFLOPS vs 1.5PFLOPS on H100), but the H100 takes the lead on FP8 data. 

image.png.2bcdee467f5bb6f795a9de8c50b13539.png

image.png.7d06c61a97675e66976c7115198ce466.png

FX6300 @ 4.2GHz | Gigabyte GA-78LMT-USB3 R2 | Hyper 212x | 3x 8GB + 1x 4GB @ 1600MHz | Gigabyte 2060 Super | Corsair CX650M | LG 43UK6520PSA
ASUS X550LN | i5 4210u | 12GB
Lenovo N23 Yoga

Link to comment
Share on other sites

Link to post
Share on other sites

12 minutes ago, igormp said:

The TPCs are VLIW chips, so they really rely on the compiler to do the proper scheduling of instructins and only run the instruction stream as it comes, without any further optimization. GPUs have departed from this idea ages ago.

The MMEs are actually the equivalent to Nvidia's tensor cores, whereas the TPCs are mean for more "general" use. Think of the TPCs as really dumb AVX2 units found in your CPU.

 

Nvidia's SMs (which should be the equivalent to Gaudi's TPCs) are way more powerful and can do many more operations. The GPU itself can also do better instruction scheduling and whatnot.

 

Perf wise, it's really meh for non matrix ops (barely does 30TFLOPS for BF16 while the H100 does over 100TFLOPS), and apparently really nice for BF16 tensor ops (1.8PFLOPS vs 1.5PFLOPS on H100), but the H100 takes the lead on FP8 data. 

Yeah but what are the up sides if GPUs are much more powerful? Why would Intel then make these?

Microsoft owns my soul.

 

Also, Dell is evil, but HP kinda nice.

Link to comment
Share on other sites

Link to post
Share on other sites

21 minutes ago, Gat Pelsinger said:

Yeah but what are the up sides if GPUs are much more powerful? Why would Intel then make these?

intel claims it's 30~70% faster than an H100.

Pricing is also a factor, a Gaudi 3 platform should be like half or even the third of the price of a H100 one.

FX6300 @ 4.2GHz | Gigabyte GA-78LMT-USB3 R2 | Hyper 212x | 3x 8GB + 1x 4GB @ 1600MHz | Gigabyte 2060 Super | Corsair CX650M | LG 43UK6520PSA
ASUS X550LN | i5 4210u | 12GB
Lenovo N23 Yoga

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×