Jump to content

Amdahl's Law

Daharen

A little discussed topic that is becoming more relevant than Moore's Law in computing spaces today. After all GPU processing is all about parallelization, and CPU processing is becoming increasingly focused on core counts, and software is beginning to adapt accordingly, so a discussion of Amdahl's Law which as more to do with the growth of parallelization over time, then the growth of raw computing power over time, should become a point of discussion, don't you think?

CPU | 8700k @ 5.1 Ghz, AVX 0, 1.37 v Stable, Motherboard | Z390 Gigabyte AORUS Master V1.0, BIOS F9, RAM | G.Skill Ripjaw V 16x2 @ 2666 Mhz 12-16-16-30, Latency 38.5ns GPU | EVGA 2080 Ti FTW3 Ultra HydroCopper @ 2160 Mhz Clock & 7800 Mhz Mem, Case | Phantek - Enthoo Primo, Storage | Intel 905p 1 TB PCIe NVME SSD, PSU | EVGA SuperNova Titanium 1600 w, UPS | CyberPower SineWave 2000VA/1540W, Display(s) | LG 4k 55" OLED & CUK 1440p 27" @ 144hz, Cooling | Custom WL, 1 x 480x60mm , 1 x 360x60mm, 2 x 240x60mm, 1 x 120x30mm rads, 12 x Noctua A25x12 Fans, Keyboard | Logitech G915 Wireless (Linear), Mouse | Logitech G Pro Wireless Gaming, Sound | Sonos Soundbar, Subwoofer, 2 x Play:3, Operating System | Windows 10 Professional.

Link to comment
Share on other sites

Link to post
Share on other sites

What has this got to do with a forum feature?

Hello

Link to comment
Share on other sites

Link to post
Share on other sites

Oh, huh, yeah I definitely posted this in the wrong place, that's what happens when you just google what your looking for and trust the link... I'll delete and repost in the right spot.

Or in retrospect, I guess I can't delete it... Well it's posted n the right spot, so I suppose when someone who can delete it sees it they'll do me the favor. Sorry for the inconvenience. 

Edited by Daharen
Correction...

CPU | 8700k @ 5.1 Ghz, AVX 0, 1.37 v Stable, Motherboard | Z390 Gigabyte AORUS Master V1.0, BIOS F9, RAM | G.Skill Ripjaw V 16x2 @ 2666 Mhz 12-16-16-30, Latency 38.5ns GPU | EVGA 2080 Ti FTW3 Ultra HydroCopper @ 2160 Mhz Clock & 7800 Mhz Mem, Case | Phantek - Enthoo Primo, Storage | Intel 905p 1 TB PCIe NVME SSD, PSU | EVGA SuperNova Titanium 1600 w, UPS | CyberPower SineWave 2000VA/1540W, Display(s) | LG 4k 55" OLED & CUK 1440p 27" @ 144hz, Cooling | Custom WL, 1 x 480x60mm , 1 x 360x60mm, 2 x 240x60mm, 1 x 120x30mm rads, 12 x Noctua A25x12 Fans, Keyboard | Logitech G915 Wireless (Linear), Mouse | Logitech G Pro Wireless Gaming, Sound | Sonos Soundbar, Subwoofer, 2 x Play:3, Operating System | Windows 10 Professional.

Link to comment
Share on other sites

Link to post
Share on other sites

The degree of parallelization starts with how much of the program is I/O bound or CPU bound. Most applications we use for daily tasks are I/O bound.

 

The multicore boom is likely more due to people trying to run more CPU demanding tasks at once than any one app actually taking advantage of more cores

 

Edit: The funny thing with graphics is that the operation itself is actually pretty serial. The only reason why it's parallel is you're doing that operation over millions of pixels.

Link to comment
Share on other sites

Link to post
Share on other sites

4 hours ago, Daharen said:

A little discussed topic that is becoming more relevant than Moore's Law in computing spaces today. After all GPU processing is all about parallelization, and CPU processing is becoming increasingly focused on core counts, and software is beginning to adapt accordingly, so a discussion of Amdahl's Law which as more to do with the growth of parallelization over time, then the growth of raw computing power over time, should become a point of discussion, don't you think?

Consumers like simple performance metrics. Getting into Amdahl's law and how many picojoules per bit it costs to move data from a core to memory is complicated.

 

You should look at the HPC space if you want to find people talking about real performance measurements. For example, Linpack has been used for decades but it no longer does an accurate job of predicting real world performance to get a FLOPS number from Linpack. HPCG FLOPS and what percent of the Linpack number the HPCG represents is more accurate. Computational efficiency is also another often overlooked aspect of systems. 

 

Realistically, the consumer x86 CPUs are very far behind in the context of Amdahl's law, so looking at them is guaranteed to disappoint.

 

Consumers dont usually pay more than a few hundred dollars for a CPU, so the x86 desktop CPUs are pretty standard and havent radically changed to attack the real bottlenecks in a system.

 

In HPC and mainframes(yes theyre still some of the most advanced systems, especially in terms of bottleneck removal) these issues are aggressively attacked and no expense is spared.

 

For a few examples, look at:

 

2013's NEC SX-ACE CPU. 4 cores with 16 channel DDR3. Gets around that whole byte/FLOP bottleneck. The current SX Aurora successor is even more radical, with 6 stacks of HBM(48GB 1.2TB/s) per CPU.

 

Back in 2015 Fujitsu released the 32+2 core SPARC XIfx, which had 32GB of HMC DRAM. Still dont have a single consumer CPU with HMC or HBM. The ARM successor with HBM is in the works. Both of these support extensive optical interconnects as well.

 

People usually think of mainframes as being old school and low tech, but if you look at the z14 its probably the single most advanced computer system that exists and eliminates every bottleneck possible. I would need to write pages about it to do it justice so ill just link to this.

 

https://fuse.wikichip.org/news/941/isscc-2018-the-ibm-z14-microprocessor-and-system-control-design/

 

But yes Amdahl's law is very often discussed in regard to things outside the consumer space. 

 

Link to comment
Share on other sites

Link to post
Share on other sites

8 minutes ago, Canada EH said:

Hey if its cheap and a giant leep for us, I may buy or not. 

Those chips i mentioned are several thousand dollars to hundreds of thousands each and they go into multimillion to over $1 billion USD systems. 

 

For a consumer system that really gets around the problems of diminishing returns, there are other ways to get around the issue than adding more cores and parallelization.

 

Intel had it in the form of eDRAM L4 caches, but they basically gave it up because it would have eaten into higher margin chip sales.

 

Adding larger on chip L3 caches, possibly using MRAM instead of SRAM, would benefit performance of most desktop workloads. Replacing that waste of space iGPU with a big slice of eDRAM and still having 4-8 physical cores at 4.5-5GHz with a sufficiently competent quad channel DDR4 controller would make for the ultimate desktop chip without any radical new tech at all.

 

IBM already has PCI-e 4 and CAPI 2.0 on its current CPUs. If you had something like that on x86 combined with the features i just mentioned thatd be unbeatable. You wouldnt need an i9 with 18 cores. You could do most things desktop users care about faster with 4-8 cores if you didnt have the underlying bottlenecks.

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Amazonsucks said:

Those chips i mentioned are several thousand dollars

OK I will wait a decade or two then decide if we will purchase it or not.

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Canada EH said:

OK I will wait a decade or two then decide if we will purchase it or not.

Desktop CPUs are about 5 to 10 years behind mainframes and HPC depending on what youre looking at.

Link to comment
Share on other sites

Link to post
Share on other sites

10 minutes ago, Canada EH said:

Price is all that matters to us, we want it all for nothing!

I should try to configure an "ultimate desktop" for windows. The DGX Station is pretty close but idk if it can run anything other than Linux.

Link to comment
Share on other sites

Link to post
Share on other sites

@Canada EH Genuine question, you seem to be going on about memory bandwidth like it's the limiting factor despite the fact that LTT did a video last year where 3200 was pretty much the limit of usable speed. Thats probably gone up some since the default DDR4 speed jumped from 2400 to 2666 in the meantime but i'd be surprised if desktop and home workstation workloads have gotten that much more memory bandwidth limited since, (i would however love to challenge LTT to do some benchmarking on things lie advantages of core count, memory amounts and memory speeds with my typical use case as i'm out on the fringes and i'd love to hear more about how my use case works out).

Link to comment
Share on other sites

Link to post
Share on other sites

8 hours ago, Canada EH said:

OK I will wait a decade or two then decide if we will purchase it or not.

All that money saved on Canada healthcare you’ll be right in a few months.

1 hour ago, CarlBar said:

@Canada EH Genuine question, you seem to be going on about memory bandwidth like it's the limiting factor despite the fact that LTT did a video last year where 3200 was pretty much the limit of usable speed. Thats probably gone up some since the default DDR4 speed jumped from 2400 to 2666 in the meantime but i'd be surprised if desktop and home workstation workloads have gotten that much more memory bandwidth limited since, (i would however love to challenge LTT to do some benchmarking on things lie advantages of core count, memory amounts and memory speeds with my typical use case as i'm out on the fringes and i'd love to hear more about how my use case works out).

Bottlenecks depends on application - for most tasks at I/O bound and bandwidth is the bottleneck. However it depends if the applicant itself can take full advantage.

Link to comment
Share on other sites

Link to post
Share on other sites

12 minutes ago, RorzNZ said:

Bottlenecks depends on application - for most tasks at I/O bound and bandwidth is the bottleneck. However it depends if the applicant itself can take full advantage.

 

I know it's application dependent, hence my "for desktop and home workstation workloads". My impression from Linus's video is that current workloads used by normal everyday users and home workstation users aren't very memory bound and are instead CPU or GPU bound. Obviously get into really extreme top end supercomputers and it's a whole other ballgame, they have enormous amounts of extra processing power via addon cards that desktop and home workstation users lack.

Link to comment
Share on other sites

Link to post
Share on other sites

5 hours ago, Canada EH said:

Day dreaming can be fun, I guess.

 

I have to admit I do it sometimes

Well its pretty hard to beat a dgx station's 4x 32GB V100s in NVLink with 256GB DDR4 main memory a 20 core Broadwell Xeon E5 2698 v4 almost 8TB of SSDs by default all liquid cooled.

 

Other than the old CPU, thats pretty good for something that sits on a desk and is very quiet. Its also currently $20,000 off so its $50,000 instead of $70,000. 

 

If it ran windows and had drivers that made games see the 4x V100s as one big GPU over NVLink like they appear to workloads in Linux then it would be unbeatable.

 

Although the Quadro RTX 8000 also supports NVLink, albeit only 2 way instead of 4 way with the V100. Since it appears as a single GPU with 96GB GDDR6 to the OS thats less than 4x V100s but it does have the RT cores. Not sure how 2x RTX 8000s compares with 4x V100s for games though.

 

If you were only playing games then probably an i9-9980XE with 128GB RAM, 2x Quadro RTX 8000s, an Optane 905p SSD or two would be the ultimate desktop.

Link to comment
Share on other sites

Link to post
Share on other sites

29 minutes ago, CarlBar said:

 

I know it's application dependent, hence my "for desktop and home workstation workloads". My impression from Linus's video is that current workloads used by normal everyday users and home workstation users aren't very memory bound and are instead CPU or GPU bound. Obviously get into really extreme top end supercomputers and it's a whole other ballgame, they have enormous amounts of extra processing power via addon cards that desktop and home workstation users lack.

Actually the most powerful and efficient HPC systems tend to be CPU only. Its not finished yet, but the first true exascale computer will be the >$1 billion Post K computer, which is the successor to K. Post K is using ARM V8 custom CPUs. K used custom SPARC chips. 

 

They never built a large system using the interim PrimeHPC FX10 or FX100, but they were also SPARC CPU only machines. They are still some of the most advanced systems out there too. FX100's SPARC XIfx chip was the first CPU to use HMC as main memory.

 

All of those systems, the K, FX10 and FX100 are >90% computationally efficient. Even really good heterogenous systems are typically less than 70%. Summit is about 65%.

 

Most of the really powerful heterogeneous systems, like the current fastest supercomputer in the world Summit are using GPUs you can get for your desktop. In the case of Summit and Sierra, theyre using Nvidia Volta V100s which is whats in the Titan V.

 

As for the memory bottleneck, its not straightforward in the case of a desktop or a supercomputer. For some HPC workloads you need to communicate from the memory in one node to a node thats a hundred feet away and have to go several hops to get there. Thats where the interconnect topology and fabric itself come into play. 

 

Byte/FLOP ratio balance is a big consideration. The people behind the best systems like Tadashi Watanabe and a bunch of other guys whose names escape me now like to keep around a 1/2 byte/FLOP ratio. Thats quite high for HPC and much higher than most desktop configurations.

 

With a desktop the cores need to be fed too, but keeping data as close to the cores as possible is easier than in a massive system with tens or hundreds of thousands of cores and petabytes of RAM spread out over hundreds of racks.

 

Large on chip or on package caches help keep the cores fed and able to do more work, as well as allow for better memory utilisation.

 

If you look back at the 3GHz i7-5775C, it had amazing performance considering its low clock. 

 

Intel sadly never made any more normal desktop CPUs with L4 caches, but they still exist.

 

https://ark.intel.com/products/93742/Intel-Xeon-Processor-E3-1585-v5-8M-Cache-3-50-GHz-

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, Amazonsucks said:

Byte/FLOP ratio balance is a big consideration. The people behind the best systems like Tadashi Watanabe and a bunch of other guys whose names escape me now like to keep around a 1/2 byte/FLOP ratio. Thats quite high for HPC and much higher than most desktop configurations.

 

Sorry could you give me an explanation of what Byte/FLOP is? I think i can translate it (bytes of data per FLOP of output?), but i really don;t understand how or why it's important, it's a bit more technical than i'm used to.

Link to comment
Share on other sites

Link to post
Share on other sites

21 minutes ago, CarlBar said:

 

Sorry could you give me an explanation of what Byte/FLOP is? I think i can translate it (bytes of data per FLOP of output?), but i really don;t understand how or why it's important, it's a bit more technical than i'm used to.

Its the balance of how many bytes of data can get to the CPU or GPU per second(memory bandwidth) compared to how many floating point operations that CPU or GPU can do per second(FLOPS).

 

Usually its a pretty low ratio of bytes to FLOPS, but some very specialized systems like the NEC SX-ACE have a 1:1 byte/FLOP ratio. Interestingly, since any one of the 4 cores on the SX-ACE could use all of the memory bandwidth, it could actually have a 4:1 byte/FLOP ratio in some workloads. Its successor has a 1:2 ratio with 1.2TB/s memory bandwidth and 2.45 TFLOPS per chip.

 

For real world performance in modern HPC workloads, which are often data intensive rather than compute intensive, its nice to have a high byte/FLOP, but memory capacity and bandwidth are much more expensive than FLOPS are, which is why you typically see significantly lower ratios.

 

Also please note that in all these cases i am talking double precision FP64 FLOPS, NOT the single precision FP32 TFLOPS youll see people talking about for normal desktop GPUs, most of which have almost no double precision capability. Titan V is an exception with lots of FP64 cores and 7.5 FP64 double precision TFLOPS, 15 FP32 single precision TFLOPS and like 110 trillion "deep learning" tensor core ops per second. 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Amazonsucks said:

Its the balance of how many bytes of data can get to the CPU or GPU per second(memory bandwidth) compared to how many floating point operations that CPU or GPU can do per second(FLOPS).

 

Usually its a pretty low ratio of bytes to FLOPS, but some very specialized systems like the NEC SX-ACE have a 1:1 byte/FLOP ratio. Interestingly, since any one of the 4 cores on the SX-ACE could use all of the memory bandwidth, it could actually have a 4:1 byte/FLOP ratio in some workloads. Its successor has a 1:2 ratio with 1.2TB/s memory bandwidth and 2.45 TFLOPS per chip.

 

For real world performance in modern HPC workloads, which are often data intensive rather than compute intensive, its nice to have a high byte/FLOP, but memory capacity and bandwidth are much more expensive than FLOPS are, which is why you typically see significantly lower ratios.

 

Also please note that in all these cases i am talking double precision FP64 FLOPS, NOT the single precision FP32 TFLOPS youll see people talking about for normal desktop GPUs, most of which have almost no double precision capability. Titan V is an exception with lots of FP64 cores and 7.5 FP64 double precision TFLOPS, 15 FP32 single precision TFLOPS and like 110 trillion "deep learning" tensor core ops per second. 

 

 

 

Ok new techie question since your so amazing at explaining. Whats the difference between single and double precision, is it just the usual 32 bit vs 64 bit maximum number size thing, or somthing more complex?

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, CarlBar said:

 

Ok new techie question since your so amazing at explaining. Whats the difference between single and double precision, is it just the usual 32 bit vs 64 bit maximum number size thing, or somthing more complex?

Im actually terrible at explaining anything math related so... Its got to do with how many bits total, 32 or 64, are used to represent each floating point number being worked on. Its kinda like the memory addressing range thing you mention with 32 vs 64 bits but instead of being how big the number of addresses actually is, its the amount of precision used to represent the floating point numbers being worked on by the chip.

 

Thats probably a terribly confusing explanation ?

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Amazonsucks said:

Its the balance of how many bytes of data can get to the CPU or GPU per second(memory bandwidth) compared to how many floating point operations that CPU or GPU can do per second(FLOPS).

 

Usually its a pretty low ratio of bytes to FLOPS, but some very specialized systems like the NEC SX-ACE have a 1:1 byte/FLOP ratio. Interestingly, since any one of the 4 cores on the SX-ACE could use all of the memory bandwidth, it could actually have a 4:1 byte/FLOP ratio in some workloads. Its successor has a 1:2 ratio with 1.2TB/s memory bandwidth and 2.45 TFLOPS per chip.

 

For real world performance in modern HPC workloads, which are often data intensive rather than compute intensive, its nice to have a high byte/FLOP, but memory capacity and bandwidth are much more expensive than FLOPS are, which is why you typically see significantly lower ratios.

 

Also please note that in all these cases i am talking double precision FP64 FLOPS, NOT the single precision FP32 TFLOPS youll see people talking about for normal desktop GPUs, most of which have almost no double precision capability. Titan V is an exception with lots of FP64 cores and 7.5 FP64 double precision TFLOPS, 15 FP32 single precision TFLOPS and like 110 trillion "deep learning" tensor core ops per second. 

 

 

 

Nope, i know floating point simply means decimal placed number so it's basically how any decimal places you can have.

Link to comment
Share on other sites

Link to post
Share on other sites

Eventually, we will also hit the core count wall. Core clock upper ceiling is around 5GHz. So, now we're making CPU's "wider". But you can only cram as much CPU cores before you run out of space or you need to sacrifice core speed to accomodate core count (remember that Xeon Phi thing Linus did and how hard it sucked?). Well, that's the problem. You can make insane core count, but they'll run at like 1.5GHz at best. Meaning this "MOAR CORES!!!!!111" will work for a while, but then we'll really have to change the architecture in some way. Not sure how yet. One would be multi socket consumer boards for one and probably the cheapest with miniaml radical changes for the manufacturers. But that won't happen for minimum of 10 years if not more.

Link to comment
Share on other sites

Link to post
Share on other sites

Thats basically what AMD is doing with their rome architecture, multiple CPU's on one die. Ok it's a bit more complex than that but they're using chiplet technology to cram large numbers of cores into a small area. TBH if they could find a better cooling solution 3d chip design as used on HBM memory system could really open thing up. AMD could then stack multiple Rome sets vertically. If as on HMB they do 4 sets that would give them 256 cores. Hail Caesar!

Link to comment
Share on other sites

Link to post
Share on other sites

8 hours ago, CarlBar said:

 

I know it's application dependent, hence my "for desktop and home workstation workloads". My impression from Linus's video is that current workloads used by normal everyday users and home workstation users aren't very memory bound and are instead CPU or GPU bound. Obviously get into really extreme top end supercomputers and it's a whole other ballgame, they have enormous amounts of extra processing power via addon cards that desktop and home workstation users lack.

The stuff you use for supercomputers I would imagine need pretty large data sets etc. Not an expert here but you are right for the average user. Its the application setting the limits in terms of need for bandwidth. 

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×