Jump to content

Ampere Altra Max 128-Core Arm Processor Appears in the Wild

Lightwreather

Summary

The last few years have seen significant improvements in the performance of Arm-powered servers. And today's progress is no exception.  Today, STH (Serve the Home) has images of the an engineering sample of Ampere's Altra Max M128-30 processor with 128 Arm-based cores.

Ampere Altra Max M128 30 Front Close

Quotes

Quote

Ampere Computing, a startup company founded in 2017, focuses on creating a more modern approach to server infrastructure. Using Arm-based processors, the company hopes to work with hyperscalers, enabling them to run all of the most intense workloads on their custom processors.

The company has already announced its Ampere Altra Max processors, which are supposed to enable up to 128 cores on a single die, with some impressive specifications. Each of those 128 cores is based on Arm's v8.2 specification, and they are capable of running at a maximum clock speed of 3 GHz. Each core has 64KB of L1 I-cache (instruction cache), 64KB of L1 D-cache (data cache), and 1MB of L2 cache.. The system-level cache is a full16MB while each core features a double 128-bit SIMD processing capability.

As the system operates with a large number of cores, there must be an appropriate memory system to handle those cores. As a result, Ampere uses an 8-channel, 72-bit DDR4-3200 memory controller that can carry up to 16 DIMMs, translating to 4TB of RAM per socket. For connectivity, it has 128 lanes of PCIe Gen4 protocol, and four x16 CCIX lanes as well, meaning that cache coherency is considered here as well.

The appearance of this image lets us know that the CPU is likely shipping, and customers may already have their hands on the Ampere Altra MAX M128-30 128-core processor. This could potentially mark a beginning of an era where large cloud hyperscalers are starting to purchase Arm-based processors, in addition to or in place of the x86 offerings that dominate the market today. If Ampere plays its cards right, the company could get looks from some big cloud service provides, and maybe get more clients on the Arm-based bandwagon.

Quote

This was a quick teaser, but an important one. The Intel and AMD server CPUs have large launch events. Ampere is more focused on selling to cloud customers. As a result, the product launches are a bit different to the point when folks may not know exactly when chips are out. We can confirm that the chips are out in the wild now (albeit this is still marked as ES silicon.)

My thoughts

So, this is interesting, ampere altra showing up in the wild on the heels of intel's rocketlake xeon announcement. Speaking of Intel, Intel, you really need to make a move quickly, AMD and Ampere now have 128-thread processors, you really need to step up your game. But yea, this processor is reall interestong, espically seeing as it is an arm processor that is socketable. I wish ampere all the best for their products and future products, since more competition is generallly always better, (not because I really like arm).

Sources

Tom's hardware

ServeTheHome

"A high ideal missed by a little, is far better than low ideal that is achievable, yet far less effective"

 

If you think I'm wrong, correct me. If I've offended you in some way tell me what it is and how I can correct it. I want to learn, and along the way one can make mistakes; Being wrong helps you learn what's right.

Link to comment
Share on other sites

Link to post
Share on other sites

Would be interesting to see some PPW numbers for this thing.

 

 

EDIT:// For clarification, the article says it’s marked as a 250W part but will be interesting to see how it performs at that power level.

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Spindel said:

Would be interesting to see some PPW numbers for this thing.

 

 

EDIT:// For clarification, the article says it’s marked as a 250W part but will be interesting to see how it performs at that power level.

https://www.anandtech.com/show/16315/the-ampere-altra-review

 

This is really just a product refresh to offer more cores, the above is a review of the 80 core 250W CPU with the same ARM microarchitecture so it's a rather good indicator of performance. Since the reviewed 80 core CPU is also 250W you should only expect a more moderate improvement, 60% more cores with probably like ~30% more performance.

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, J-from-Nucleon said:

So, this is interesting, ampere altra showing up in the wild on the heels of intel's rocketlake xeon announcement. Speaking of Intel, Intel, you really need to make a move quickly, AMD and Ampere now have 128-thread processors, you really need to step up your game.

For the types of workload these are targeted at, you have to consider platform performance. A simplistic core or thread comparison is negligent at best. Also be careful which position products are targeted at. Rocket Lake Xeons are more for small servers and workstations where massive core counts are not appropriate. Ice Lake Xeons are more likely to go against them in this space, which go to 40 cores per socket, not far off AMD's 64 cores. Sapphire Rapids next year will be the one to watch, as they will finally be on a competitive process not only in performance but also efficiency. Intel have had the architecture, just not the ability to make them in recent years. AMD will probably follow with Zen 4 offerings after that and the battle continues.

Gaming system: R7 7800X3D, Asus ROG Strix B650E-F Gaming Wifi, Thermalright Phantom Spirit 120 SE ARGB, Corsair Vengeance 2x 32GB 6000C30, RTX 4070, MSI MPG A850G, Fractal Design North, Samsung 990 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Productivity system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, 64GB ram (mixed), RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, random 1080p + 720p displays.
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

@porina You'll probably find this interesting, Altra Max has half the L3 cache as Altra (16MB vs 32MB). The Q80-33 was already heavily L3 cache limited in a few workloads so I find it interesting that it's been halved, sacrifice made to increase the cores to 128 because it's the same fab node and core microarchitecture. CCIX links increased from 2 to 4 as well.

Link to comment
Share on other sites

Link to post
Share on other sites

3 hours ago, leadeater said:

https://www.anandtech.com/show/16315/the-ampere-altra-review

 

This is really just a product refresh to offer more cores, the above is a review of the 80 core 250W CPU with the same ARM microarchitecture so it's a rather good indicator of performance. Since the reviewed 80 core CPU is also 250W you should only expect a more moderate improvement, 60% more cores with probably like ~30% more performance.

The 80 core version has a significant amount of power headroom in most workloads, averaging around 200W. 128 cores would put it over the 250W TDP but it looks like aiming for a slightly less aggressive frequency would keep power in check.

Quote

In terms of power-efficiency, the Q80-33 really operates at the far end of the frequency/voltage curves at 3.3GHz. While the TDP of 250W really isn’t comparable to the figures of AMD and Intel are publishing, as average power consumption of the Altra in many workloads is well below that figure – ranging from 180 to 220W – let’s say a 200W median across a variety of workloads, with few workloads actually hitting that peak 250W.

The main problem I foresee is the tiny 16MB L3. That's half of the L3 the 80 core version had and it already was becoming cache starved in certain scenarios.

Quote

There are still workloads in which the Altra doesn’t do as well – anything that puts higher cache pressure on the cores will heavily favours the EPYC as while 1MB per core L2 is nice to have, 32MB of L3 shared amongst 80 cores isn’t very much cache to go around.

 

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, ScratchCat said:

The 80 core version has a significant amount of power headroom in most workloads, averaging around 200W. 128 cores would put it over the 250W TDP but it looks like aiming for a slightly less aggressive frequency would keep power in check.

Yep but that is also largely due to being limited by insufficient L3 cache or memory bandwidth or both and since these are socket/platform compatible that isn't going to change. Also 300MHz reduction as well, which is 10%.

 

30% performance increase still seems fair to me because 128 cores is just more readily going to encounter platform limitations than the 80 core is. There will be workloads that will be much higher than 30% but there will also be ones much lower so when the benchmarks do come out I'm confident I'll be within the ball park average.

 

Do expect to see a little better dual socket scaling though from the doubling of the interconnect.

 

5 minutes ago, ScratchCat said:

The main problem I foresee is the tiny 16MB L3. That's half of the L3 the 80 core version had and it already was becoming cache starved in certain scenarios.

I know, check the above post 😉

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, leadeater said:

@porina You'll probably find this interesting, Altra Max has half the L3 cache as Altra (16MB vs 32MB). The Q80-33 was already heavily L3 cache limited in a few workloads so I find it interesting that it's been halved, sacrifice made to increase the cores to 128 because it's the same fab node and core microarchitecture. CCIX links increased from 2 to 4 as well.

Thanks. I don't usually pay close attention to Arm designs since they're more detached from anything I get really hands on with. That design decision will be for a reason, and it may be targeted at particular workloads where that isn't so much a problem. Without checking, will the lower core/higher cache version still be available in parallel? Offering the choice to the buyer of what suits their needs.

Gaming system: R7 7800X3D, Asus ROG Strix B650E-F Gaming Wifi, Thermalright Phantom Spirit 120 SE ARGB, Corsair Vengeance 2x 32GB 6000C30, RTX 4070, MSI MPG A850G, Fractal Design North, Samsung 990 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Productivity system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, 64GB ram (mixed), RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, random 1080p + 720p displays.
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, porina said:

Without checking, will the lower core/higher cache version still be available in parallel? Offering the choice to the buyer of what suits their needs.

Yea, buy the Altra Q CPUs instead.

 

2 hours ago, porina said:

That design decision will be for a reason, and it may be targeted at particular workloads where that isn't so much a problem

I would say it would have to be die area, they were already rather costly and one of the attraction points is the low cost. Have to also make sure it fits in the same package size and pin configuration too.

Link to comment
Share on other sites

Link to post
Share on other sites

7 minutes ago, leadeater said:

I would say it would have to be die area, they were already rather costly and one of the attraction points is the low cost. Have to also make sure it fits in the same package size and pin configuration too.

I was thinking from the other direction. Someone out there can make use of the increased cores even with the reduced cache. They will be happy to have that option, and others can find a more suitable product as needed.

Gaming system: R7 7800X3D, Asus ROG Strix B650E-F Gaming Wifi, Thermalright Phantom Spirit 120 SE ARGB, Corsair Vengeance 2x 32GB 6000C30, RTX 4070, MSI MPG A850G, Fractal Design North, Samsung 990 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Productivity system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, 64GB ram (mixed), RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, random 1080p + 720p displays.
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

  • 3 weeks later...

I was going to make a new post about this, until I saw this thread heh

 

Anyway, there are public tests available for it: https://www.phoronix.com/scan.php?page=article&item=ampere-altramax-benchmarks

It'd be nice to see how much performance is on the table due to the lack of cache for all those cores.

 

embed.php?i=2109138-TJ-AMPERE14175&sha=913c9444603c&p=2

 

It surely does look competitive while consuming less power.

FX6300 @ 4.2GHz | Gigabyte GA-78LMT-USB3 R2 | Hyper 212x | 3x 8GB + 1x 4GB @ 1600MHz | Gigabyte 2060 Super | Corsair CX650M | LG 43UK6520PSA
ASUS X550LN | i5 4210u | 12GB
Lenovo N23 Yoga

Link to comment
Share on other sites

Link to post
Share on other sites

5 hours ago, igormp said:

Anyway, there are public tests available for it: https://www.phoronix.com/scan.php?page=article&item=ampere-altramax-benchmarks

It'd be nice to see how much performance is on the table due to the lack of cache for all those cores.

Not far off my 30% performance increase guess from the Q80, looks to be ~40% performance increase. I'd like to see the test run again but with only 80 cores enabled and see how it compares to the Q80.

Link to comment
Share on other sites

Link to post
Share on other sites

Sorry stupid question time:

 

What does ”1P” and ”2P” mean in the graph?

Link to comment
Share on other sites

Link to post
Share on other sites

5 minutes ago, Spindel said:

Sorry stupid question time:

 

What does ”1P” and ”2P” mean in the graph?

1 Processor (1 socket) and 2 Processors (2 sockets). Just the number of CPUs in the system.

Link to comment
Share on other sites

Link to post
Share on other sites

 

1 hour ago, leadeater said:

1 Processor (1 socket) and 2 Processors (2 sockets). Just the number of CPUs in the system.

Ahhh 👍 

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×