Madlad engineers at IBM have made a chip with 32MB of L2 cache per core! (IBM Telum processor)

AlexGoesHigh · September 5, 2021

Summary

IBM has a product line know as IBM Z, some of you know what it is, and for those that don't its IBM mainframe offering, yep those old school room filling computers from the 70's and 80's, this is that, difference is that they are made with modern computer design philosophy, but these hulking beast can still run those programs made over 40 years ago. as for why they are still building modern day iterations of mainframes? well unlike your commodity x86 or ARM chip, these are the true battle tested tanks of the computer world that never stop, downtime on such systems is measured in milliseconds per year! (aka 99.9999% expected uptime) therefore are used on mission critical application such as banking transaction by the big banks of the world, also since they are optimized for such applications they are much faster at it, if said task was made to run on current day x86 or ARM servers, these beast would run circles around it.

With that introduction out of the way, IBM at hot chips unveiled their next gen Z mainframe processor, the IBM Telum the successor of the z15 architecture, which features a very radical departure from its predecessor design, the cache structure, as a bit of a primer, the previous gen z15 was made with 256MB of L3 cache for 12 cores (which is already an insane amount of cache) IBM paired four of these with a special processor that acted as system wide L4 cache with 960MB available, this paired is what IBM calls a unit, and then they put four of these units together to make a full mainframe system, the L4 cache chip is addressable to all the four units, and this what is sold to a customer, if you wanted one of those chips you had to buy the full system like this.

Now with the Telum they ditched both the special processor and the gigachad L3 cache, and they replaced all of it with a mega gigachad of 32MB of L2 cache per core, each chip has 8 cores paired with all that cache of a staggering total of 256MB of L2 cache, the same amount as the L3 available on the previous gen, now each L2 cache is private to each core like in a normal CPU, but now when a line of data gets evicted instead of going to RAM like it would on a conventional system, it gets tagged as L3 cache data and placed on available space in another core L2, effectively creating virtual L3 caching, yep the madlad engineers at IBM have created virtual on CPU caching system, (cue the mind blown meme here), but this doesn't stop here, there's still the L4 replacement, well it's the same as L3 but instead of a core in the same chips it goes to the cache of another chip in the system tagged as L4 cache data, and so now the madlads have created virtual on CPU L3 cache and virtual on system L4 cache!

to sum up an IBM Telum processor cache structure has 32MB of L2 private cache per core with 8 cores, then it has 256MB of virtual L3 cache on the processor and then it has 8GB! of virtual L4 cache! in a complete system, now a complete Telum system is made of a Telum chip, each chip is paired together with another chip on the same package for a duo of processor, then each package is paired with 3 other packages to form a four package unit, and finally a complete system is made out of four of these units, this gives a total of 32 chips on a system hence the total 8GB of L4 virtual cache. now the keen eye might have noticed this is less total cache that the previous gen chip, well IBM says that all of this changes (and not mentioned are the improvements to the core themselves as well as node process shrink from 14nm to 7nm) amounts to a 40% performance improvement over the previous gen z15, oh, the Telum system also amounts to 256 total cores BTW.

Also and this was mentioned in the comments in the article that a reader asked but a package has a TDP of ~400W so they are also toasty. honestly if you want a much better explanation go read the source or watch 🖥🖥 video below.

Quotes

Quote

The new system does away with the separate System Controller with the L4 cache. Instead we have what looks like a normal processor with eight cores. Built on Samsung 7nm and at 530mm², IBM packages two processors together into one, and then puts four packages (eight CPUs, 64 cores) into a single unit. Four units make a system, for a total of 32 CPUs / 256 cores.

On a single chip, we have eight cores. Each core has 32 MB of private L2 cache, which has a 19-cycle access latency. This is a long latency for an L2 cache, but it’s also 64x bigger than Zen 3's L2 cache, which is a 12-cycle latency.

Looking at the chip design, all that space in the middle is L2 cache. There is no L3 cache. No physical shared L3 for all cores to access. Without a centralized cache chip as with z15, this would mean that in order for code that has some amount of shared data to work, it would need a round trip out to main memory, which is slow. But IBM has thought of this.

The concept is that the L2 cache isn’t just an L2 cache. On the face of it, each L2 cache is indeed a private cache for each core, and 32 MB is stonkingly huge. But when it comes time for a cache line to be evicted from L2, either purposefully by the processor or due to needing to make room, rather than simply disappearing it tries to find space somewhere else on the chip. If it finds a space in a different core’s L2, it sits there, and gets tagged as an L3 cache line.

What IBM has implemented here is the concept of shared virtual caches that exist inside private physical caches. That means the L2 cache and the L3 cache become the same physical thing, and that the cache can contain a mix of L2 and L3 cache lines as needed from all the different cores depending on the workload. This becomes important for cloud services (yes, IBM offers IBM Z in its cloud) where tenants do not need a full CPU, or for workloads that don’t scale exactly across cores.

This means that the whole chip, with eight private 32 MB L2 caches, could also be considered as having a 256 MB shared ‘virtual’ L3 cache. In this instance, consider the equivalent for the consumer space: AMD’s Zen 3 chiplet has eight cores and 32 MB of L3 cache, and only 512 KB of private L2 cache per core. If it implemented a bigger L2/virtual L3 scheme like IBM, we would end up with 4.5 MB of private L2 cache per core, or 36 MB of shared virtual L3 per chiplet.

This IBM Z scheme has the lucky advantage that if a core just happens to need data that sits in virtual L3, and that virtual L3 line just happens to be in its private L2, then the latency of 19 cycles is much lower than what a shared physical L3 cache would be (~35-55 cycle). However what is more likely is that the virtual L3 cache line needed is in the L2 cache of a different core, which IBM says incurs an average 12 nanosecond latency across its dual direction ring interconnect, which has a 320 GB/s bandwidth. 12 nanoseconds at 5.2 GHz is ~62 cycles, which is going to be slower than a physical L3 cache, but the larger L2 should mean less pressure on L3 use. But also because the size of L2 and L3 is so flexible and large, depending on the workload, overall latency should be lower and workload scope increased.

But it doesn’t stop there. We have to go deeper.

For IBM Telum, we have two chips in a package, four packages in a unit, four units in a system, for a total of 32 chips and 256 cores. Rather than having that external L4 cache chip, IBM is going a stage further and enabling that each private L2 cache can also house the equivalent of a virtual L4.

This means that if a cache line is evicted from the virtual L3 on one chip, it will go find another chip in the system to live on, and be marked as a virtual L4 cache line.

This means that from a singular core perspective, in a 256 core system, it has access to:

32 MB of private L2 cache (19-cycle latency)

256 MB of on-chip shared virtual L3 cache (+12ns latency)

8192 MB / 8 GB of off-chip shared virtual L4 cache (+? latency)

Technically from a single core perspective those numbers should probably be 32 MB / 224 MB / 7936 MB because a single core isn’t going to evict an L2 line into its own L2 and label it as L3, and so on.

IBM states that using this virtual cache system, there is the equivalent of 1.5x more cache per core than the IBM z15, but also improved average latencies for data access. Overall IBM claims a per-socket performance improvement of >40%. Other benchmarks are not available at this time.

My thoughts

Just amazing, albeit all of this tech is made to run specific workloads for huge organizations, it can be an indicator for the future, AMD has V-cache coming soon to add 2 times more L3 cache to their CPU, and they made infinite cache for their GPU, Intel has sapphire rapids with HBM memory coming too and Nvidia perhaps might have something in their labs to tackle this and boost cache sizes to but we don't know, but it seems one part of the future for chips to get more performance is to boost cache sizes in novel ways, which makes sense RAM is somewhat getting stale, while DDR5 is coming with much higher bandwidth, it does not reduce latency, there's also another anadtech article, a preview of DDR5 and just looking at the specs numbers, latency on the higher frequencies tends to stay the same as current high frequency DDR4, ohh there's also samsung with PIM memory that attempts to offload some processing to RAM in order to increase bandwidth too.

All in all this decade seems to look setup to increase bandwidth and reduce latency in more novel ways that just make core go faster.

Sources

https://www.anandtech.com/show/16924/did-ibm-just-preview-the-future-of-caches

mariushm · September 5, 2021

EPYC processors have 256 MB cache... yeah, fine, you have 64 cores, but you can disable cores and get more MB per core.

Not quite the same as having a big pool of l3 cache available to every core at same latency but what I'm saying is big cache memory is nothing revolutionary.

L1$

4 MiB

L1I$	2 MiB	64x32 KiB	8-way set associative
L1D$	2 MiB	64x32 KiB	8-way set associative	write-back

L2$

32 MiB

		64x512 KiB	8-way set associative	write-back

L3$

256 MiB

		16x16 MiB

leadeater · September 5, 2021

And here I am only wondering how many pins are on that, sadly not mentioned

porina · September 5, 2021

14 minutes ago, mariushm said:

EPYC processors have 256 MB cache... yeah, fine, you have 64 cores, but you can disable cores and get more MB per core.

Not quite the same as having a big pool of l3 cache available to every core at same latency but what I'm saying is big cache memory is nothing revolutionary.

What IBM have done is unlike anything we've seen before. Massive L2 that can also function as virtual L3 (faster if local, slower if on another slice), and virtual L4 in multi-chip systems. As always it depends on the workload, but this would offer more flexibility than we've seen before. Use a bit less here, use a bit more there.

The direction AMD have taken Zen is great for some types of scaling, but not others. The biggest weakness with higher core count Zen currently is that it is not a very unified processor, and best viewed as multiple clusters of CCXs. On Zen 3 Epyc, the pool of 256 MB L3 is not well shared, but 8x 32MB pools limited in connectivity by relatively slow Infinity Fabric. Only data in the local L3 is quickly accessible. If things haven't changed, you can't access L3 from another CCX directly, although you can talk core to core cross CCX, obviously at higher latency than locally. Otherwise it'll be a trip back to ram to share data cross CCX.

If you want core count equality with IBM's solution, you'd need 8 of those chips, so you'd still have chip to chip connectivity to consider. Still, that same L2 cache can act as virtual L4, presumably offering faster access of cached data than you would have if you were to access ram off node for example.

For fun, we can imagine a 64 core Epyc turned down to 8 core, with one core per CCX in order to have all the cache. Reduced to that level, Infinity Fabric would probably not be limiting in bandwidth any more, CCX to CCX at least. But core to core will be at elevated latency.

As a parallel observation, it is interesting how Intel and AMD are differing in their L2/L3 balance. For example, Intel's current Tiger Lake and upcoming Alder Lake have (or expected to have) 1.25MB L2 and 3MB L3 per core in the maximum configuration. Zen 3 offers 0.5MB L2 and 4MB L3 per core in its current maximum configuration (vcache isn't quite here yet). Similar total L2+L3 overall, but different balance between L2 and L3. The system design choice is to get best overall performance, so it may be influenced by the core design also.

Side question, does anyone know if AMD used different IF configuration in Zen 3 vs consumer Zen 2? Zen 2 IF was full speed read but half speed write. If talking CCX to CCX you'd be limited to half rate overall. I did wonder if server versions might have full speed writes. Assuming this is the configuration still, best case all CCX to all CCX bandwidth is about 400GB/s counting bidirectionally, plus a bit more if you read from ram at the same time. The IBM processor is stated as having a 320GB/s ring bus, although I'm unclear how to read that value. For example, if multiple slices are talking to adjacent slices in the best case, you might have multiple transfers happening at that rate.

porina · September 5, 2021

2 minutes ago, leadeater said:

And here I am only wondering how many pins are on that, sadly not mentioned

Literally using my finger as a measuring device, I estimate that pictured pad array is about 90 wide, 80 tall, with 50% fill rate that is ball park 3600. Someone with more time on their hands might do an actual count.

leadeater · September 5, 2021

1 hour ago, mariushm said:

EPYC processors have 256 MB cache... yeah, fine, you have 64 cores, but you can disable cores and get more MB per core.

Not quite the same as having a big pool of l3 cache available to every core at same latency but what I'm saying is big cache memory is nothing revolutionary.

L1$ 4 MiB

L1I$ 2 MiB 64x32 KiB 8-way set associative

L1D$ 2 MiB 64x32 KiB 8-way set associative write-back

L2$ 32 MiB

64x512 KiB 8-way set associative write-back

L3$ 256 MiB

16x16 MiB

Like all the Z systems the main advantage is really the number of CPU sockets per system and the interconnectivity between them. IBM Z systems win out performance wise when the usage is tailored to the advantages of the system.

If you compare to an Intel 8 socket system the connectivity between the sockets is far better on IBM Z and the cache structures are better too.

Mostly I just love all the engineering that goes in to the Z systems, damn those things are sweet. Will never get hands on with one, simply can't ever see that happening unless I do something stupid and buy an old one off ebay.

AlexGoesHigh · September 5, 2021

5 minutes ago, porina said:

What IBM have done is unlike anything we've seen before. Massive L2 that can also function as virtual L3 (faster if local, slower if on another slice), and virtual L4 in multi-chip systems. As always it depends on the workload, but this would offer more flexibility than we've seen before. Use a bit less here, use a bit more there.

The direction AMD have taken Zen is great for some types of scaling, but not others. The biggest weakness with higher core count Zen currently is that it is not a very unified processor, and best viewed as multiple clusters of CCXs. On Zen 3 Epyc, the pool of 256 MB L3 is not well shared, but 8x 32MB pools limited in connectivity by relatively slow Infinity Fabric. Only data in the local L3 is quickly accessible. If things haven't changed, you can't access L3 from another CCX directly, although you can talk core to core cross CCX, obviously at higher latency than locally. Otherwise it'll be a trip back to ram to share data cross CCX.

If you want core count equality with IBM's solution, you'd need 8 of those chips, so you'd still have chip to chip connectivity to consider. Still, that same L2 cache can act as virtual L4, presumably offering faster access of cached data than you would have if you were to access ram off node for example.

For fun, we can imagine a 64 core Epyc turned down to 8 core, with one core per CCX in order to have all the cache. Reduced to that level, Infinity Fabric would probably not be limiting in bandwidth any more, CCX to CCX at least. But core to core will be at elevated latency.

As a parallel observation, it is interesting how Intel and AMD are differing in their L2/L3 balance. For example, Intel's current Tiger Lake and upcoming Alder Lake have (or expected to have) 1.25MB L2 and 3MB L3 per core in the maximum configuration. Zen 3 offers 0.5MB L2 and 4MB L3 per core in its current maximum configuration (vcache isn't quite here yet). Similar total L2+L3 overall, but different balance between L2 and L3. The system design choice is to get best overall performance, so it may be influenced by the core design also.

Side question, does anyone know if AMD used different IF configuration in Zen 3 vs consumer Zen 2? Zen 2 IF was full speed read but half speed write. If talking CCX to CCX you'd be limited to half rate overall. I did wonder if server versions might have full speed writes. Assuming this is the configuration still, best case all CCX to all CCX bandwidth is about 400GB/s counting bidirectionally, plus a bit more if you read from ram at the same time. The IBM processor is stated as having a 320GB/s ring bus, although I'm unclear how to read that value. For example, if multiple slices are talking to adjacent slices in the best case, you might have multiple transfers happening at that rate.

i never made the comparison so i always though Intel laking on L3 cache vs AMD was one of Intel weakness, but now that i read this and see that Intel has much higher L2 then AMD. It makes a lot of sense the lack of L3. now i'm more intrigued by the P cores on Alder Lake to see how they stack up.

leadeater · September 5, 2021

30 minutes ago, porina said:

Side question, does anyone know if AMD used different IF configuration in Zen 3 vs consumer Zen 2? Zen 2 IF was full speed read but half speed write. If talking CCX to CCX you'd be limited to half rate overall.

EPYC 7003 (Milan) uses Infinity Fabric 2 and xGMI-2 like 7002 (Rome) does, with a higher supported memory clock. So it's the same

Quote

The die-to-die Infinity Fabric bandwidth is 32 bytes for read and 16 bytes for write per Infinity Fabric clock (which has a maximum speed of 1,467 MHz).

Above is EPYC 7002/Zen 2 so just increase the maximum speed to EPYC 7003 maximum supported for it's bandwidth.

But no it has not changed.

da na · September 5, 2021

WolframaticAlpha · September 5, 2021

50 minutes ago, leadeater said:

unless I do something stupid and buy an old one off ebay.

If when you do, do post a benchmark.

Quackers101 · September 5, 2021

is it made of plastic and watercooled?

Forbidden Wafer · September 5, 2021

2 hours ago, leadeater said:

Will never get hands on with one, simply can't ever see that happening unless I do something stupid and buy an old one off ebay.

Is that a reference to this kid? xD

Forbidden Wafer · September 5, 2021

1 minute ago, Caroline said:

Salv8 (sam) · September 6, 2021

i don't see this taking off very well, these days banks and other financial organizations (the ones who mainly still use mainframes in this day and age) are moving their shit to the cloud (wow who coulda guessed) and are trying to move away from Fortran, COBOL and mainframes, java will still be used cause it's muti-platform and doesn't take much to get it working on cloud since there are cloud services designed to run java apps as a standalone app so banks will be able to do a slow transition from java running on a mainframe to running on the cloud.

the end goal for many is to try to move to newer languages such as python, which don't run on mainframes but just like java, have cloud services designed to run these apps as a standalone app. while some of the older stuff will be used (VBA scripts (yes macros, they are very useful to quickly sort data for analysis), R and MATLAB) due to their strengths in crunching numbers quickly and efficiently. mainframes won't.

it's a sad end of an era but shit like this happens daily in the word of tech.
i wouldn't be surprised to hear in a year that IBM will stop working on mainframes and will only be supporting existing deployments until they are phased out. (probably in 3-5 years)

leadeater · September 6, 2021

20 minutes ago, Salv8 (sam) said:

i don't see this taking off very well, these days banks and other financial organizations (the ones who mainly still use mainframes in this day and age) are moving their shit to the cloud (wow who coulda guessed) and are trying to move away from Fortran, COBOL and mainframes, java will still be used cause it's muti-platform and doesn't take much to get it working on cloud since there are cloud services designed to run java apps as a standalone app so banks will be able to do a slow transition from java running on a mainframe to running on the cloud.

the end goal for many is to try to move to newer languages such as python, which don't run on mainframes but just like java, have cloud services designed to run these apps as a standalone app. while some of the older stuff will be used (VBA scripts (yes macros, they are very useful to quickly sort data for analysis), R and MATLAB) due to their strengths in crunching numbers quickly and efficiently. mainframes won't.

it's a sad end of an era but shit like this happens daily in the word of tech.
i wouldn't be surprised to hear in a year that IBM will stop working on mainframes and will only be supporting existing deployments until they are phased out. (probably in 3-5 years)

Cloud providers, and customer application architectures, still have a really long way to go to offer the same uptime and resiliency as a IBM Z system though. Cloud providers aren't likely to improve much here so then it's down to application design and it's actually much harder to get to the level of perfection and consistency as seen on these mainframes.

The amount of times there has been odd load balancer issues, or DB cache weirdness or any other number of strange thing that you do encounter in an HA distributed application does actually result in end users being impacted. The issue here is we are talking financial transactions, the level of scrutiny, auditing, and exacting correctness requirements is really high and just really can not tolerate these types of problems. Such is why incumbent known working solutions for these customers is attractive.

Banks are moving resources in to cloud but so far most of that is front end systems and online banking, not as much transaction processing.

Quote

“This goes to our updated thinking around cloud following technology and regulatory changes over the years, our focus being on both public and onsite environments.”

One thing Bright made clear in his interview with iTnews last Friday is that some established infrastructure, namely Westpac’s Hogan-based mainframe transactional rig that runs on IBM’s Z-series, still has some miles left in it.

Heritage or legacy, big, fat distributed architecture still cuts it in the same way that Detroit Diesel remains under the hood of most prime movers still ploughing our highways, despite dreams of an autonomous and electric fleet.

But the design and testing of a new model architecture, namely an equity investment in 10x, is underway, and Westpac wants a say in where it goes.

Bright says IBM’s “overarching contract” has just been renewed.

https://www.itnews.com.au/news/how-westpacs-new-cio-slashed-his-ibm-mainframe-bill-533848

Westpac from memory is one of the more keen AUS/NZ bank cloud adopters, they still aren't in a great hurry to ditch completely their mainframes.

williamcll · September 6, 2021

Now if only they make this affordable (comparatively)

cj09beira · September 6, 2021

1 minute ago, williamcll said:

Now if only they make this affordable (comparatively)

just gotta wait 10-15 years, and stalk government auctions, and other smaller auction sites

CarlBar · September 6, 2021

15 hours ago, leadeater said:

The issue here is we are talking financial transactions, the level of scrutiny, auditing, and exacting correctness requirements is really high and just really can not tolerate these types of problems. Such is why incumbent known working solutions for these customers is attractive.

And some of that is coming from outside the institution in the form of government regulations. There's also the fact that at the top end the integrity, accuracy, and availability of their transaction services is itself a form of value to their customers. Any major SNAFU there risks seriously damaging them by eroding that. Worst case it so erodes that financial trading done by other entities through them comes under scrutiny crashing the banks and their customers share prices, big enough bank and that could crash the entire global economy pretty much overnight. And no institution wants to find itself in that situation, the kind of international scrutiny, (and action), from world governments that would invoke isn't the kind of thing any company, or it's executives wants.

Quackers101 · September 7, 2021

IBM, I've Been Making silicon pancakes for 200 years! Scrubs! (madlad engineers turn into anime season 21)

But how much cache do they need? until they cache me ousside, how about that. Straight to the IGPU the cache goes.

Sign In

Madlad engineers at IBM have made a chip with 32MB of L2 cache per core! (IBM Telum processor)

But it doesn’t stop there. We have to go deeper.

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Create an account or sign in to comment