Cray's new Supercomputer Shasta Powered by Epyc and Volta-next

cj09beira · November 14, 2018

We got quite a lot of info today on this new supercomputer, as we already knew it will be powered by epyc, using the Milan architecture, it will also use nvidia's Volta-next for gpu acceleration.

Quote

AMD recently announced its EPYC Rome processors, the first 7nm data center chips on the market, but the company is already moving forward with its next-generation products. Here at Supercomputer 2018, the US Department of Energy (DOE) announced that its Perlmutter supercomputer would come armed with AMD's unreleased EPYC Milan processors. The new supercomputer will also use Nvidia's "Volta-Next" GPUs, with the two combining to make an exascale-class machine that will be one of the fastest supercomputers in the world.

The Perlmutter supercomputer will be built using Cray's Shasta supercomputer platform, which was also on display here at the show. The supercomputer will be built with a mixture of both CPU and GPU nodes, with the CPU node pictured above. This watercooled chassis houses eight AMD Milan CPUs. We see four copper waterblocks that cover the Milan processors, while four more processors are mounted inverted on the PCBs between the DIMM slots. This system is designed for the ultimate in performance density, so all the DIMM sticks are also watercooled.

so it seems the Super computer will be made of cpu/gpu nodes all watercooled.

The Cpu node will be made of 8 sockets each with 8 dimms, which will eventually be populated with Milan cpus, just for reference using Rome cpus we would get 512 cores and 1024 threads with up to 16Tb of ram

Spoiler

the Gpu node will be made of 4 Nvidia gpus using a post Volta architecture, all connected to a single Amd Milan cpu (25GB/s bandwidth to each gpu)

Spoiler

here is what the Slingshot integrated switch behind the node looks like:

Spoiler

not all is watercooled though here is the network switch for the top of the Rack

Spoiler

This super computer will use nand/flash only for its storage

Spoiler

Thoughts:

I haven't seen many supercomputers before, but this case seems to allow for some great density, this will probably have even more cores as amd probably will increase core count again with Milan.

Just a shame that we dont have photos for the gpu node, it would be interesting to see how that was done, still very impressed with how much work goes into these machines.

this should help put amd back on the eyes of customers.

very interested to see more about this project.

Source:https://www.tomshardware.com/news/amd-epyc-milan-shasta-exascale,38067.html

The Benjamins · November 14, 2018

AMD is on a roll.

TVwazhere · November 14, 2018

11 minutes ago, cj09beira said:

Just for reference using Rome cpus we would get 512 cores and 1024 threads with up to 16Tb of ram

Captain Obvious here, informing you..... Holy shit thats a lot of cores!

Also, is that 4 Volta GPU's for every CPU ( meaning 32 GPU's total?)

cj09beira · November 14, 2018

4 minutes ago, TVwazhere said:

Captain Obvious here, informing you..... Holy shit thats a lot of cores!

Also, is that 4 Volta GPU's for every CPU ( meaning 32 GPU's total?)

don't think so, in the gpu node slide they say 4 gpus and one cpu, if the gpu node is the same size as the cpu node, i would say up to 2 cpus 8 gpus, but that is just speculation it might not have fit

edit:probably is just 4 gpus per node though

Sauron · November 14, 2018

5 minutes ago, TVwazhere said:

Captain Obvious here, informing you..... Holy shit thats a lot of cores!

Yep, we'll be able to start talking about kilo cores in single machines soon enough - especially if AMD decides to be absolutely savage and launch Milan with 128 cores.

Granular · November 14, 2018

Oh boy, does the name mean it'll be running Perl?

S w a t s o n · November 14, 2018

Why didn't they just call it Ampere, nvidia please oh please

RejZoR · November 14, 2018

46 minutes ago, TVwazhere said:

Captain Obvious here, informing you..... Holy shit thats a lot of cores!

Also, is that 4 Volta GPU's for every CPU ( meaning 32 GPU's total?)

Anyone feels like bringing back the MOAR CORES meme that was throw at AMD during Bulldozer days?

cj09beira · November 14, 2018

1 minute ago, RejZoR said:

Anyone feels like bringing back the MOAR CORES meme that was throw at AMD during Bulldozer days?

maybe but this time its moar cores (that are actually fast)

the best one imo is the star wars meme

Amazonsucks · November 14, 2018

2 hours ago, Sauron said:

Yep, we'll be able to start talking about kilo cores in single machines soon enough - especially if AMD decides to be absolutely savage and launch Milan with 128 cores.

Single machines? We already had a million cores in 2013.

https://engineering.stanford.edu/magazine/article/stanford-researchers-break-million-core-supercomputer-barrier

Sauron · November 14, 2018

1 minute ago, Amazonsucks said:

Single machines? We already had a million cores in 2013.

https://engineering.stanford.edu/magazine/article/stanford-researchers-break-million-core-supercomputer-barrier

That's not a single machine, it's a cluster.

Amazonsucks · November 14, 2018

3 minutes ago, Sauron said:

That's not a single machine, it's a cluster.

You mean a single rack, server chassis of x number of units? Because a distributed memory supercomputer like that is still considered a single machine...

Hence the Top500 exists to benchmark large machines like that.

S w a t s o n · November 14, 2018

2 hours ago, RejZoR said:

Anyone feels like bringing back the MOAR CORES meme that was throw at AMD during Bulldozer days?

cj09beira · November 14, 2018

5 minutes ago, S w a t s o n said:

thats vaporware doesnt count, and doesnt provide more cores than what intel could already do

S w a t s o n · November 14, 2018

30 minutes ago, cj09beira said:

thats vaporware doesnt count, and doesnt provide more cores than what intel could already do

Well the number of sockets does matter for licensing costs but yea no one is really "looking forward" to cascade-ap at this point. AWS signed onto AMD for a reason

Sauron · November 14, 2018

1 hour ago, Amazonsucks said:

You mean a single rack, server chassis of x number of units? Because a distributed memory supercomputer like that is still considered a single machine...

Hence the Top500 exists to benchmark large machines like that.

I think what I meant is pretty obvious, and discussing over meaningless semantics is pretty low on my list of priorities.

Taf the Ghost · November 14, 2018

3 hours ago, S w a t s o n said:

Why didn't they just call it Ampere, nvidia please oh please

Ampere is the Turing respin on 7nm. We haven't heard the code name for the 7nm Volta-Next, though it's nice that Cray confirmed most of it for us.

S w a t s o n · November 14, 2018

9 minutes ago, Taf the Ghost said:

Ampere is the Turing respin on 7nm. We haven't heard the code name for the 7nm Volta-Next, though it's nice that Cray confirmed most of it for us.

With the sheer amount of misinformation surrounding both ampere and turing's names I'd wait and see.

Amazonsucks · November 14, 2018

6 hours ago, Sauron said:

I think what I meant is pretty obvious, and discussing over meaningless semantics is pretty low on my list of priorities.

No its not really clear what you meant. In the HPC community Sequioa and other large HPC machines are typically referred to as a single machine...

Hence all one million or however many cores and the rest of the hardware including PB of RAM and disks, megawatts of power supplies etc are all called something like Sequoia, Summit, Sierra, Earth Simulator, Titan, K Computer, Oak Forest or some other name for the entire system. This thread specifically references Shasta systems like HLRS Hawk.

@cj09beira

I think the most interesting part of Shasta is its Slingshot interconnect. Itll be interesting to see how it compares to Tofu3 and other exascale interconnects.

https://www.cray.com/blog/meet-slingshot-an-innovative-interconnect-for-the-next-generation-of-supercomputers/

Nicnac · November 15, 2018

imagine folding on this^^

GOTSpectrum · November 15, 2018

11 minutes ago, Nicnac said:

imagine folding on this^^

We can only dream. Back in 2011 someone folded on the French atomic energy super computer for a while, it was like in the top 50 or something and only produced 12million a day.

leadeater · November 15, 2018

14 hours ago, Amazonsucks said:

You mean a single rack, server chassis of x number of units? Because a distributed memory supercomputer like that is still considered a single machine...

Hence the Top500 exists to benchmark large machines like that.

They are still treated as nodes, workload managers allocate jobs to nodes based on a lot of rules. CERN for example can only run up 300k of their 500K CPUs due to power and cooling (there is multiple locations where nodes are located, all under OpenStack). When you submit a job you give it workload tags so the system knows where to run it i.e. GPU node or memory heavy nodes. The workload manager will make sure jobs don't get allocated to nodes in racks that are at max power capability, stuff like that.

Have a look in to SLURM, that's the most popular workload manager. Most of the top 500 use it.

https://slurm.schedmd.com/

https://en.wikipedia.org/wiki/Slurm_Workload_Manager

Intel and GPU based systems don't combine in to single logical entities, IBM on the other hand does have the capability for that type of thing. You don't actually want to make massive blocks of logical compute nodes across chassis and cabinets though, you start to hit barriers like bandwidth and latency. It's more efficient, depending on task, to move that logic up to your code and make it aware of the hardware boundaries because you can get much better control that way.

Amazonsucks · November 15, 2018

9 minutes ago, leadeater said:

They are still treated as nodes, workload managers allocate jobs to nodes based on a lot of rules. CERN for example can only run up 300k of their 500K CPUs due to power and cooling (there is multiple locations where nodes are located, all under OpenStack). When you submit a job you give it workload tags so the system knows where to run it i.e. GPU node or memory heavy nodes. The workload manager will make sure jobs don't get allocated to nodes in racks that are at max power capability, stuff like that.

Have a look in to SLURM, that's the most popular workload manager. Most of the top 500 use it.

https://slurm.schedmd.com/

https://en.wikipedia.org/wiki/Slurm_Workload_Manager

Intel and GPU based systems don't combine in to single logical entities, IBM on the other hand does have the capability for that type of thing. You don't actually want to make massive blocks of logical compute nodes across chassis and cabinets though, you start to hit barriers like bandwidth and latency. It's more efficient, depending on task, to move that logic up to your code and make it aware of the hardware boundaries because you can get much better control that way.

I understand the difference between nodes and a whole system. I am merely stating that one section of the K Computer(as one random example) is still part of the same machine. Although HPC systems can be and are partitioned and not always fully utilized, its accurate to state that the IBM Blue Gene Sequoia was the first million core single machine.

leadeater · November 15, 2018

1 hour ago, Nicnac said:

imagine folding on this^^

Russian scientists have already been arrested for doing that (mining but same diff)

Sign In

Cray's new Supercomputer Shasta Powered by Epyc and Volta-next

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites