Jump to content

AMD holds onto top spot of Top 500 supercomputer list

porina

t500.thumb.jpg.2a665cee67aff8510208d8dcfac1c422.jpg

Summary

The November 2023 Top 500 supercomputer list has been announced. AMD silicon holds onto top spot in the Frontier system. Intel silicon moves into 2nd place in aurora, although it seems they're not finished yet and their placement is based on a run of half the final system. In 3rd spot is a MS cloud computer using Intel CPUs and NV GPUs.

 

Quotes

Quote

Housed at the Oak Ridge National Laboratory (ORNL) in Tennessee, USA, Frontier leads the pack with an HPL score of 1.194 EFlop/s – unchanged from the June 2023 list. Frontier utilizes AMD EPYC 64C 2GHz processors and is based on the latest HPE Cray EX235a architecture. The system has a total of 8,699,904 combined CPU and GPU cores. Additionally, Frontier has an impressive power efficiency rating of 52.59 GFlops/watt and relies on HPE’s Slingshot 11 network for data transfer.

 

The new Aurora system at the Argonne Leadership Computing Facility in Illinois, USA, entered the list at the No. 2 spot – previously held by Fugaku – with an HPL score of 585.34 PFlop/s. That said, it is important to note that Aurora’s numbers were submitted with a measurement on half of the planned final system. Aurora is currently being commissioned and will reportedly exceed Frontier with a peak performance of 2 EFlop/s when finished.

 

My thoughts

The AMD based Frontier system holds onto the top spot. It has been widely debated if Aurora would overtake it, and it seems not at this time. It has been subject to delays at Intel and they weren't able to do a full system run. I'm not up to speed on any upcoming supercomputers but it is expected to pass Frontier once it is fully running, perhaps at the next update in 6 months. It has been commented the power draw listed may be for a full system, but the perf was obtained only running on half. Looking at perf/W of these systems could be interesting. It is notable that many of the new entries use Xeon CPUs, but this may be due to timing of availability of those CPUs.

 

Sources

https://www.top500.org/news/frontier-remains-no-1-in-the-top500-but-aurora-with-intels-sapphire-rapids-chips-enters-with-a-half-scale-system-at-no-2/

https://www.top500.org/lists/top500/2023/11/

 

Gaming system: R7 7800X3D, Asus ROG Strix B650E-F Gaming Wifi, Thermalright Phantom Spirit 120 SE ARGB, Corsair Vengeance 2x 32GB 6000C30, RTX 4070, MSI MPG A850G, Fractal Design North, Samsung 990 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Productivity system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, 64GB ram (mixed), RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, random 1080p + 720p displays.
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

it just blows me away that a super computer is doing 1200 petaflops when just 12 years ago the fastest supercomputer in the world was slower than folding at home distributed computer when they broke the petaflop barrier. 

Link to comment
Share on other sites

Link to post
Share on other sites

These nerds cant hold a candle to the power of FOLDING!!!

 

https://www.tomshardware.com/news/folding-at-home-worlds-top-supercomputers-coronavirus-covid-19

"Put as much effort into your question as you'd expect someone to give in an answer"- @Princess Luna

Make sure to Quote posts or tag the person with @[username] so they know you responded to them!

 RGB Build Post 2019 --- Rainbow 🦆 2020 --- Velka 5 V2.0 Build 2021

Purple Build Post ---  Blue Build Post --- Blue Build Post 2018 --- Project ITNOS

CPU i7-4790k    Motherboard Gigabyte Z97N-WIFI    RAM G.Skill Sniper DDR3 1866mhz    GPU EVGA GTX1080Ti FTW3    Case Corsair 380T   

Storage Samsung EVO 250GB, Samsung EVO 1TB, WD Black 3TB, WD Black 5TB    PSU Corsair CX750M    Cooling Cryorig H7 with NF-A12x25

Link to comment
Share on other sites

Link to post
Share on other sites

24 minutes ago, TVwazhere said:

These nerds cant hold a candle to the power of FOLDING!!!

I'd caution that all FLOPS are not the same. I'd guess that folding likely uses FP32, whereas supercomputers are tested with FP64. Consumer GPUs are pretty fast at FP32 but horrifically bad at FP64. For a high performance GPUs (not consumer tier), FP64 rate is at best 1/2 the FP32 rate. The best consumer sold GPU for FP64 in recent times is Radeon VII at 1/4 rate, only because it was a cut down enterprise part. RDNA3 is now 1/32 rate, and Ada is 1/64 rate. CPUs can do the 1/2 rate, but if I'm not mistaken folding dropped CPU support a very long time ago.

 

Also the rated FLOPS for supercomputers is the throughput achieved in a version of Linpack, which needs everything to coordinate with everything else. Folding is trivially parallel in comparison. As such it may be more appropriate to compare to the supercomputer's peak rate.

 

Anyway, in general, if you got the supercomputer to run a trivially parallel FP32 workload, it would probably do worse than folding. If you got folders to run a FP64 workload, it would be a LOT worse than a supercomputer.

Gaming system: R7 7800X3D, Asus ROG Strix B650E-F Gaming Wifi, Thermalright Phantom Spirit 120 SE ARGB, Corsair Vengeance 2x 32GB 6000C30, RTX 4070, MSI MPG A850G, Fractal Design North, Samsung 990 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Productivity system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, 64GB ram (mixed), RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, random 1080p + 720p displays.
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

image.png.8d75a2dc578c59d991abfebee34c2046.png

 

Ouch, roughly half the Rmax for the same power. I really hope that HBM gives some real world workflow improvement, which it should. But damn, on paper that looks bad

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, leadeater said:

image.png.8d75a2dc578c59d991abfebee34c2046.png

 

Ouch, roughly half the Rmax for the same power. I really hope that HBM gives some real world workflow improvement, which it should. But damn, on paper that looks bad

I did cover that in the 1st post. It is understood the listed power is the full system power, but the performance submission was running on half the system. It should improve when they retest under full configuration.

Gaming system: R7 7800X3D, Asus ROG Strix B650E-F Gaming Wifi, Thermalright Phantom Spirit 120 SE ARGB, Corsair Vengeance 2x 32GB 6000C30, RTX 4070, MSI MPG A850G, Fractal Design North, Samsung 990 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Productivity system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, 64GB ram (mixed), RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, random 1080p + 720p displays.
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

10 minutes ago, porina said:

but if I'm not mistaken folding dropped CPU support a very long time ago.

oh no, CPU folding never went away, it's just highly not recommended power efficiency wise. But there are folding work units that are CPU only, or were as I'm not sure if they reworked those to be able to run on GPU.

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, porina said:

I did cover that in the 1st post. It is understood the listed power is the full system power, but the performance submission was running on half the system. It should improve when they retest under full configuration.

I did see that but also that's quite dubious. First who actually measures power like that and secondly far as I know Top 500 doesn't treat those as valid power figures so it wouldn't be allowed on the Green 500 list, not until a more verified deployment and facility power figure.

 

Even if that is projected power usage from half(ish) of the deployment it can't properly take in to account the extra power required for cooling and any loss in efficiency due to any increases in heat.

 

So not only do they need to double the running hardware they also need to double the output from that as well to hit the projected performance target. Suffice to say, LINPACK is not treating this system favorably on paper. Hopefully it's like Fugaku, LINPACK Rmax being quite unrelated to what is actually run on it and how it performs with that compared to a different system architecture.

Link to comment
Share on other sites

Link to post
Share on other sites

If the HPE Cray nodes are like the AMD HPE Cray EX235a then it's 1 CPU and 4 GPU per node, two node per blade.

 

Each node would be ~"568" cores" (CPU + GPU)

  • CPU: ~350W
  • GPU: 4x 600W (2400W total)
  • Power per node: ~2750W

4742808 / 568 = 8350

8350 = 22962500W (22.96MW) purely nodes alone power.

 

Either there are twice the GPUs per CPU, they are running customized SKUs at like half the power of official public SKUs or some combination of both plus other unknowns.

Link to comment
Share on other sites

Link to post
Share on other sites

50 minutes ago, leadeater said:

I did see that but also that's quite dubious. 

I'm just relaying what info we have from what I'd consider a reputable potato.

 

50 minutes ago, leadeater said:

Suffice to say, LINPACK is not treating this system favorably on paper.

I get the feeling the submission was just something to get on the scoreboard, and is not something that was heavily optimised. It was stated it was only lightly optimised. These benchmarks aren't like Cinebench. Can't just press a button and get a good number at the end. I hate to imaging the tweaking needed to get it to run efficiently on what is not a simple topology. I've long wanted to run this on consumer tier kit but I gave up just trying to understand how to run it. Note there are some pre-compiled binaries floating around but it is not the same version, with unknown and likely very out of date math libraries e.g. lack of modern CPU feature support.

Gaming system: R7 7800X3D, Asus ROG Strix B650E-F Gaming Wifi, Thermalright Phantom Spirit 120 SE ARGB, Corsair Vengeance 2x 32GB 6000C30, RTX 4070, MSI MPG A850G, Fractal Design North, Samsung 990 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Productivity system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, 64GB ram (mixed), RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, random 1080p + 720p displays.
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

4 hours ago, porina said:

These benchmarks aren't like Cinebench. Can't just press a button and get a good number at the end. I hate to imaging the tweaking needed to get it to run efficiently on what is not a simple topology.

LINPACK itself isn't actually that great either, not for the actual reality of usage and performance. The Rmax etc numbers are actually quite meaningless for anything other than Top 500 🙃

 

Having HBM doesn't and won't at all increase Rmax but for actual usage it'll allow large data sets that are memory throughput bound to not run like garbage on CPU.

Link to comment
Share on other sites

Link to post
Share on other sites

4 hours ago, leadeater said:

LINPACK itself isn't actually that great either, not for the actual reality of usage and performance. The Rmax etc numbers are actually quite meaningless for anything other than Top 500 🙃

That problem isn't going away as long as a single benchmark is used. Either the Linpack or Top500 site does say the benchmark results represent the performance of running the benchmark, or words to that effect! Different systems will be used for different purposes. Linpack does have what I consider some nice characteristics: FP64 heavy which is traditional HPC space usage, although I hear lower precision is gaining traction in some areas for the perf boost it can give. It is not trivially parallel, so node to node connectivity affects the results. I guess it also has the historic weight behind it. Has it always been used? A radical change in test methodology would make comparisons to past lists difficult.

 

4 hours ago, leadeater said:

Having HBM doesn't and won't at all increase Rmax but for actual usage it'll allow large data sets that are memory throughput bound to not run like garbage on CPU.

I was skimming the documentation again recently. I do see that to optimise throughput you apparently want to size the task to fill but not exceed available memory. So maybe HMB quantity might help to some degree. I'm guessing the bandwidth once sufficient, more doesn't help as you're execution bound.

Gaming system: R7 7800X3D, Asus ROG Strix B650E-F Gaming Wifi, Thermalright Phantom Spirit 120 SE ARGB, Corsair Vengeance 2x 32GB 6000C30, RTX 4070, MSI MPG A850G, Fractal Design North, Samsung 990 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Productivity system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, 64GB ram (mixed), RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, random 1080p + 720p displays.
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, porina said:

That problem isn't going away as long as a single benchmark is used. Either the Linpack or Top500 site does say the benchmark results represent the performance of running the benchmark, or words to that effect! Different systems will be used for different purposes. Linpack does have what I consider some nice characteristics: FP64 heavy which is traditional HPC space usage, although I hear lower precision is gaining traction in some areas for the perf boost it can give. It is not trivially parallel, so node to node connectivity affects the results. I guess it also has the historic weight behind it. Has it always been used? A radical change in test methodology would make comparisons to past lists difficult.

Top 500 actually has AI and another benchmark ranking list and standardized benchmark, I don't remember when those were added though. The Fugaku cluster actually jump up to 3rd place for both from memory, those custom vector units Fujitsu made are apparently rather good.

 

2 hours ago, porina said:

Linpack does have what I consider some nice characteristics: FP64 heavy which is traditional HPC space usage

Yep but the issue is that it's just a benchmark and tailoring the execution size to fit perfectly within all the layers of cache just doesn't fit well with real workloads, like simulating the decay for nuclear material. For example how actually useful is it in scientific research to calculate Pi and what benefit does that give to optimize code and rate performance based on that when that isn't actually what is going to be done.

 

We obviously have to just pick and choose how we benchmark systems like this but it's also good to remember it's just a benchmark and the rank in the list doesn't have a direct bearing on how good or bad each system/cluster would be for a designed workload. Fugaku is again a great example of that, it's really good at what it was made for and a H100 is also really great at what it's designed for, neither of which would run particularly well what the other was designed to run.

 

2 hours ago, porina said:

I guess it also has the historic weight behind it. Has it always been used? A radical change in test methodology would make comparisons to past lists difficult.

Yep, but I'm not saying it should or even needs to change. But it's no different to how Cinebench scores have no direct relation to how well a CPU will perform in games, 3D V-Cache for example does nothing at all for CB but does a great deal for games.

 

2 hours ago, porina said:

I'm guessing the bandwidth once sufficient, more doesn't help as you're execution bound.

That's the big selling points for Intel Xeon Max with HBM, being able to work on a larger dataset and having the necessary bandwidth to benefit from it. Breaking things down in to smaller data sizes may not always be wanted or possible so it's nice to not have to do that.

 

I remember one of the big things for IBM Roadrunner was it was the first time Boeing was able to visualize the entire 747 in complete detail with every single part which helped them better see how parts interacted in the plane. Before then they always did cross sections and there was always error/difficulty due to that or they did reduced detail models. I'm fairly sure it was just a technical demo though, plus they could do that today without as much problem.

Link to comment
Share on other sites

Link to post
Share on other sites

42 minutes ago, leadeater said:

Yep but the issue is that it's just a benchmark and tailoring the execution size to fit perfectly within all the layers of cache just doesn't fit well with real workloads, like simulating the decay for nuclear material. For example how actually useful is it in scientific research to calculate Pi and what benefit does that give to optimize code and rate performance based on that when that isn't actually what is going to be done.

I agree that one software does not necessarily represent other software. Not sure how restating it many different ways moves us forwards.

 

Calculating Pi I will comment on a bit more, because I think that is more about the journey than the destination. Is there a use for countless digits of Pi? Some people have fun looking for patterns in it, but otherwise those digits aren't necessary for general use. I am following Y-cruncher which has been used to set those records for some time regardless of who is running it on whatever hardware. The question is, how do you split up the work to be completed efficiently? It's a classic HPC-like problem. You have to decide how to break things down and keep all the compute in sync as data depends on everything else. Does it represent all other workloads? No! But techniques developed in implementing it efficiently could be carried over to other use cases.

Gaming system: R7 7800X3D, Asus ROG Strix B650E-F Gaming Wifi, Thermalright Phantom Spirit 120 SE ARGB, Corsair Vengeance 2x 32GB 6000C30, RTX 4070, MSI MPG A850G, Fractal Design North, Samsung 990 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Productivity system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, 64GB ram (mixed), RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, random 1080p + 720p displays.
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

7 minutes ago, porina said:

I agree that one software does not necessarily represent other software. Not sure how restating it many different ways moves us forwards.

I know it just seemed like the point wasn't clear, no matter how much anyone optimizes the LINPACK benchmark test to get a great Rmax it's basically irrelevant for anything other than Top 500, that's what I'm saying. It's not a bad benchmark, it's just really easy to badly use the result 🙂

Link to comment
Share on other sites

Link to post
Share on other sites

38 minutes ago, leadeater said:

I know it just seemed like the point wasn't clear, no matter how much anyone optimizes the LINPACK benchmark test to get a great Rmax it's basically irrelevant for anything other than Top 500, that's what I'm saying. It's not a bad benchmark, it's just really easy to badly use the result 🙂

It's like basing productivity performance and quality of a graphic card on how fast that card can run Alan Wake 2. It's irrelevant metric, but it's there.

Link to comment
Share on other sites

Link to post
Share on other sites

17 hours ago, TVwazhere said:

These nerds cant hold a candle to the power of FOLDING!!!

 

https://www.tomshardware.com/news/folding-at-home-worlds-top-supercomputers-coronavirus-covid-19

Yeah but you can't use Folding@home to do unproductive things like war simulations.

Specs: Motherboard: Asus X470-PLUS TUF gaming (Yes I know it's poor but I wasn't informed) RAM: Corsair VENGEANCE® LPX DDR4 3200Mhz CL16-18-18-36 2x8GB

            CPU: Ryzen 9 5900X          Case: Antec P8     PSU: Corsair RM850x                        Cooler: Antec K240 with two Noctura Industrial PPC 3000 PWM

            Drives: Samsung 970 EVO plus 250GB, Micron 1100 2TB, Seagate ST4000DM000/1F2168 GPU: EVGA RTX 2080 ti Black edition

Link to comment
Share on other sites

Link to post
Share on other sites

6 hours ago, leadeater said:

Yep but the issue is that it's just a benchmark and tailoring the execution size to fit perfectly within all the layers of cache just doesn't fit well with real workloads, like simulating the decay for nuclear material. For example how actually useful is it in scientific research to calculate Pi and what benefit does that give to optimize code and rate performance based on that when that isn't actually what is going to be done.

A good example of this is training AI these days.  At the Tesla AI day, there was mention that the bottleneck had become all that data that they were  feeding into it; and that a lot of the time the cores were effectively just sitting there idling waiting for the data to come in to be processed.  iirc a similar thing happened as well to folding at home, where they had so much extra work during a certain period of time it didn't matter that people were folding at home, because the backend effectively couldn't handle distributing all the work (I could be misremembering).

 

On a random tangent...the amount of digits of Pi has been impractical since like the 1400's where it was calculated out to 16 digits; which is more digits than even NASA really needs (someone once said 15 digits of Pi are enough for almost everything).  Even by the 1706 it was calculated out to 100 digits...which is more digits than we have a practical use for.  Since every digit of Pi effectively can give a 10x level of precision (if you had 1 cm accuracy before, adding 1 digit of pi gives you ~1 mm accuracy)

 

 

 

In regards to this topic, it's really hard to define what constitutes the top of a category like a super computer...as a super powerful super computer might be just as slow processing though certain types of datasets.  When the scale of things get that large, it almost becomes either tuning it for the specific task or renting out the compute.

3735928559 - Beware of the dead beef

Link to comment
Share on other sites

Link to post
Share on other sites

On 11/13/2023 at 4:44 PM, porina said:

I'd caution that all FLOPS are not the same-

Hey man I'm just trying to be a silly guy

Quote

if I'm not mistaken folding dropped CPU support a very long time ago.

F@H does still support CPU folding; it's just "inefficient" in points compared to GPU folding. However, GPU folding has different molecule Work Units associated with that work, so CPU folding is still important for research (thank you @leadeater)

"Put as much effort into your question as you'd expect someone to give in an answer"- @Princess Luna

Make sure to Quote posts or tag the person with @[username] so they know you responded to them!

 RGB Build Post 2019 --- Rainbow 🦆 2020 --- Velka 5 V2.0 Build 2021

Purple Build Post ---  Blue Build Post --- Blue Build Post 2018 --- Project ITNOS

CPU i7-4790k    Motherboard Gigabyte Z97N-WIFI    RAM G.Skill Sniper DDR3 1866mhz    GPU EVGA GTX1080Ti FTW3    Case Corsair 380T   

Storage Samsung EVO 250GB, Samsung EVO 1TB, WD Black 3TB, WD Black 5TB    PSU Corsair CX750M    Cooling Cryorig H7 with NF-A12x25

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×