AMD Announces their new 3D stacking process

Juanitology · June 1, 2021

Summary

AMD announced their new 3D stacking process, which enables them to package up to 192MB of cache on a single CPU

Quotes

Quote

Here are some significant quotes from Dr. Lisa Su
The first application of this technology will be to enable a 3D vertical cache. [...] We're taking our leadership Ryzen 5000 series processor and stacking a 64MB 7nm SRAM directly on top of each core complex, effectively tripling the amount of high speed L3 cache feeding our Zen 3 cores.

We're using a hybrid bond approach with through silicon vias that provides over 200 times the interconnect density of 2D chiplets and more than 15 times the density compared to other 3D stacking solutions.

The 3D chiplet prototype improves performance by 12% on average. In fact, when you look across many of today's games, they have intense demands on the PC memory subsystem. If you look across a number of popular game titles with the 3D V-cache technology, we're seeing an average improvement of 15% at 1080p.

We'll be ready to start production on our highest end products with 3D chiplets by the end of this year.

My thoughts

In another genius stroke from AMD and TSMC engineers, they've used their new 3D stacking process to add a previously unthinkable amount of SRAM cache to their CPUs. It really blows me away to think that Intel's latest CPUs come with just 16MB of cache, a number that has increased very gradually over the past decade, and now AMD are bringing a quantum leap forward with over an order of magnitude more cache available. It will take a lot of software optimization, but we can expect to see massive increases in data-intensive applications that take advantage of that massive 2TB/s bandwidth and huge cache. The numbers they showed were just for a prototype 5900X with the new technology, but given the end-of-year manufacturing timeframe mentioned, I guess we can expect the top-end Ryzen 6000 series CPUs to include the new packaging technology and launch sometime in Q1 2022.

These CPUs will have more cache than my first PC had harddrive space. Consider me mindblown.

Sources : AMD Keyote

thechinchinsong · June 1, 2021

Pretty awesome technology. It'll definitely benefit performance in a large range of tasks, but I wonder how much it costs to implement.

Taf the Ghost · June 1, 2021

This is really important technology, but I feel bad for anyone trying to explain it. It's a big accomplishment and AMD has been working with TSMC for years on it, but you lose the almost everyone the instant you start mentioning TSVs and micro-bumps. lol. (I've read a few of the patents on it. It's really very impressive.)

What we basically got was a demonstration of the rumored Zen3+ devices. They mentioned locking the chips to 4 Ghz in the testing shown, which probably explains a little why we aren't going to Zen3+ devices. The real power of massive L3 caches is for Data Center applications, so it would make more sense to bring those out. Minus the fact Genoa is nearly here.

porina · June 1, 2021

When I hear AMD talk bigger caches, I also hear AMD saying they're not going to have enough ram bandwidth, or even interconnect bandwidth. More cache is one solution to feed the CPU cores. Infinity Fabric has been a choke point since it came with Zen 2, and the core execution potential to ram bandwidth ratio is about as bad as I've known it.

I know some workloads that'll love the cache so it will be interesting to see exactly what configurations they will eventually offer to market.

On comparing AMD to Intel, you have to compare best to best. Zen 3 desktop cores in full configuration are 4MB L3 and 0.5MB L2. Tiger Lake is 3MB L3 and 1.25MB L2. You can combine both since they are exclusive/non-inclusive. AMD does that more than Intel, think since Ryzen era they list combined L2+L3 on the retail box, where Intel tends to only display L3 probably as a carry over from Skylake which was inclusive cache so you can't add both. Anyway, point is for latest CPUs you can buy from either side today, the cache/core is about the same.

leadeater · June 1, 2021

9 minutes ago, Taf the Ghost said:

but I feel bad for anyone trying to explain it

Cache get bigger, performance go WEEEEeeee.

Wasn't that hard

Moonzy · June 1, 2021

When are they gonna strap HBM onto their CPU?

Just skip ddr5, give us 16gb HBM on the CPU itself

leadeater · June 1, 2021

1 minute ago, Moonzy said:

Just skip ddr5, give us 16gb HBM on the CPU itself

Because HBM latency is garbage compared to DDR so it's only useful for non interactive non responsive requirement computational workloads.

Moonzy · June 1, 2021

Just now, leadeater said:

Because HBM latency is garbage compared to DDR so it's only useful for non interactive non responsive requirement computational workloads.

Ah, well just make something of sorts then

This just reminds me of HBM, that's why

leadeater · June 1, 2021

1 minute ago, Moonzy said:

Ah, well just make something of sorts then

This just reminds me of HBM, that's why

Well maybe with large enough L3 cache HBM might not be as bad. These L3 cache dies stacked on top of the CCDs, a smaller 7nm IOD and HBM2e or HBM3. Seems more of a server CPU product cost thing though.

Doobeedoo · June 1, 2021

That take on cache is interesting, neat to see new way of building chips, now to see how it acts in the real world and latencies.

cj09beira · June 1, 2021

3 hours ago, leadeater said:

Because HBM latency is garbage compared to DDR so it's only useful for non interactive non responsive requirement computational workloads.

what is hbm's latency i can not find any info on it

leadeater · June 1, 2021

50 minutes ago, cj09beira said:

what is hbm's latency i can not find any info on it

DDR first Word latency is 10ns or less, HBM first Word latency is ~100ns

Taf the Ghost · June 1, 2021

4 hours ago, porina said:

When I hear AMD talk bigger caches, I also hear AMD saying they're not going to have enough ram bandwidth, or even interconnect bandwidth. More cache is one solution to feed the CPU cores. Infinity Fabric has been a choke point since it came with Zen 2, and the core execution potential to ram bandwidth ratio is about as bad as I've known it.

I know some workloads that'll love the cache so it will be interesting to see exactly what configurations they will eventually offer to market.

On comparing AMD to Intel, you have to compare best to best. Zen 3 desktop cores in full configuration are 4MB L3 and 0.5MB L2. Tiger Lake is 3MB L3 and 1.25MB L2. You can combine both since they are exclusive/non-inclusive. AMD does that more than Intel, think since Ryzen era they list combined L2+L3 on the retail box, where Intel tends to only display L3 probably as a carry over from Skylake which was inclusive cache so you can't add both. Anyway, point is for latest CPUs you can buy from either side today, the cache/core is about the same.

Odds are we'll only see sort of "high end" models with the L3+ Cache. Probably rarely outside of Epycs. AMD could always slap on a version for some high-end parts, but cooling is always its own issue with 3D stacking. So it makes a lot more sense in the server space.

StDragon · June 1, 2021

5 hours ago, leadeater said:

Cache get bigger, performance go WEEEEeeee.

Wasn't that hard

As I understand it, there's diminishing returns for most workloads as you increase the cache size. Running databases is the immediate task that comes to mind that would benefit, but what consumer apps would?

leadeater · June 1, 2021

5 minutes ago, StDragon said:

As I understand it, there's diminishing returns for most workloads as you increase the cache size. Running databases is the immediate task that comes to mind that would benefit, but what consumer apps would?

Surprisingly most games do, less applications in general. Personally I would like to see high density VM benchmarks, sometimes cache trashing can be a big issue.

StDragon · June 1, 2021

18 minutes ago, leadeater said:

Surprisingly most games do, less applications in general. Personally I would like to see high density VM benchmarks, sometimes cache trashing can be a big issue.

Oh good point! Totally forgot about the VM aspect of this, which is exactly what host servers do, run them on CPUs with large core count

porina · June 1, 2021

3 hours ago, Taf the Ghost said:

Odds are we'll only see sort of "high end" models with the L3+ Cache. Probably rarely outside of Epycs. AMD could always slap on a version for some high-end parts, but cooling is always its own issue with 3D stacking. So it makes a lot more sense in the server space.

We'll have to see what they do, but I'd agree they might put this in a limited number of halo skus to help keep their leadership claims. Still need to watch that presentation, but I understood they demoed it with gaming, suggesting at least some consumer level focus.

trag1c · June 1, 2021

3 hours ago, StDragon said:

As I understand it, there's diminishing returns for most workloads as you increase the cache size. Running databases is the immediate task that comes to mind that would benefit, but what consumer apps would?

Cache is huge in games. From a programming perspective there are large sections of text books that deal with game and game engine development that are purely all about cache and how to optimize for it. Having a massive cache makes it a lot easier on a dev since your far less likely to have a cache miss which absolutely trashes your frame rate because you have to wait for that data to be fetched from ram. Ultimately any application that is heavily data dependent and requires frequent processing should benefit from a large cache since you can ultimately store more frequently used data and store more upcoming data.

Quote from a article on Gamasutra from 2007 (Bit dated but the idea of optimizing for cache still applies)

https://www.gamasutra.com/view/feature/130296/the_top_10_myths_of_video_game_.php?page=2

Quote

Myth 5: Reducing Instruction Processing Is Our Primary Goal In CPU Optimization

When comparing the growth rate of instructions retired in the past five years, the GPU is the winner. The CPU, by means of increased instruction level parallelism and multi-core is in second place. The slowest growth (of resources commonly utilized in game runtime) is the memory system.

The reason is simple, when used correctly, memory is very fast. The problem is that games, which are getting close to the 32 bit OS limit of 4 gigs, frequently abuse our fragile memory architecture.

Many traditional optimizations, made famous before the requirement of a tiered cache system, can be harmful to modern architectures. For example, a look-up table trades memory for instruction processing. If this increase in memory causes a cache miss that requires a fetch from system memory, you have done little to increase your performance. A cache miss that causes a fetch to system memory is many times slower than the slowest instruction. In attempting to save instructions, you have created latency and a data dependency.

When optimizing the CPU, we have a tendency to seek out the slowest instruction loops in our engine. The usual suspects are AI, culling, and physics. If you are not optimizing your engine for cache efficiency you are doing yourself a disservice. If you are reducing instructions and increasing cache misses, you are committing a sin.

Game Engine Architecture from Jason Gregory who is a lead programmer at Naughty Dog (and part time lecturer at U of SC) talks significantly about the caching system on modern consoles and computer systems because it is absolutely fundamental to performant games and engines. The textbook is on its 3rd edition and every edition has increased the content on memory systems because it keeps becoming more and more important. https://www.gameenginebook.com/

Edit: Also a big part of game development is having a suite of tools both external and internal to the engine for gathering information about cache misses, coherency, memory usage patterns etc.

porina · June 1, 2021

Hot news from Dr Wafer Eater:

porina · June 1, 2021

And more new news:

Quote

This technology will be productized with 7nm Zen 3-based Ryzen processors. Nothing was said about EPYC.

Those processors will start production at the end of the year. No comment on availability, although Q1 2022 would fit into AMD's regular cadence.

This V-Cache chiplet is 64 MB of additional L3, with no stepped penalty on latency. The V-Cache is address striped with the normal L3 and can be powered down when not in use. The V-Cache sits on the same power plane as the regular L3.

The processor with V-Cache is the same z-height as current Zen 3 products - both the core chiplet and the V-Cache are thinned to have an equal z-height as the IOD die for seamless integration

As the V-Cache is built over the L3 cache on the main CCX, it doesn't sit over any of the hotspots created by the cores and so thermal considerations are less of an issue. The support silicon above the cores is designed to be thermally efficient.

The V-Cache is a single 64 MB die, and is relatively denser than the normal L3 because it uses SRAM-optimized libraries of TSMC's 7nm process, AMD knows that TSMC can do multiple stacked dies, however AMD is only talking about a 1-High stack at this time which it will bring to market.

https://www.anandtech.com/show/16725/amd-demonstrates-stacked-vcache-technology-2-tbsec-for-15-gaming

StDragon · June 1, 2021

1 hour ago, trag1c said:

Cache is huge in games. From a programming perspective there are large sections of text books that deal with game and game engine development that are purely all about cache and how to optimize for it. Having a massive cache makes it a lot easier on a dev since your far less likely to have a cache miss which absolutely trashes your frame rate because you have to wait for that data to be fetched from ram. Ultimately any application that is heavily data dependent and requires frequent processing should benefit from a large cache since you can ultimately store more frequently used data and store more upcoming data.

I don't disagree, but if the issue is fetching game assets with a large enough dataset that it can only come from RAM, I don't see how a larger cache would help unless it's of monstrous capacity.

Now if the cache is large enough to hold algorithms such as physics, then I could see a major improvement.

BTW, don't modern game engines such as Unreal Engine 5 use assembly code to reduce footprint in cache?

Kisai · June 1, 2021

18 minutes ago, StDragon said:

I don't disagree, but if the issue is fetching game assets with a large enough dataset that it can only come from RAM, I don't see how a larger cache would help unless it's of monstrous capacity.

Now if the cache is large enough to hold algorithms such as physics, then I could see a major improvement.

BTW, don't modern game engines such as Unreal Engine 5 use assembly code to reduce footprint in cache?

Not assets, the cpu likely never sees the assets other than a decompression phase if the GPU can't decompress it. Try code, like lua.

Case in point, at least one MMO I play can have like 200 models on screen, the bottle neck isn't the GPU, it's the cpu-decompression-decryption on the netcode, as every single moving thing on the screen sends a packet to update it's position on screen. In a recent patch announcement, they are removing further things from the game model to leave space and reduce "damage point inflation", because the damage numbers are 16-bit numbers. In a previous patch they removed a few other variables being transmitted.

Ultimately "cloud"/"live" services will just make this more of a thing. Even if the game is single player, there are constantly checks for things that you have purchased in the game engine, and if you can just keep the entirety of the script code in L3 cache, then it's going to be a lot faster to do these kinds of checks.

trag1c · June 1, 2021

1 hour ago, StDragon said:

I don't disagree, but if the issue is fetching game assets with a large enough dataset that it can only come from RAM, I don't see how a larger cache would help unless it's of monstrous capacity.

Now if the cache is large enough to hold algorithms such as physics, then I could see a major improvement.

BTW, don't modern game engines such as Unreal Engine 5 use assembly code to reduce footprint in cache?

@Kisaiis bang on. The CPU rarely see's assets and when it does it is for an extremely short window (e.g. Loading a level). Your cache would be used when you're actually processing the game updates. During the physics update you would probably see any physics components and related data prefetched just before your code execution begins on the physics portion of the game. So transform data (position, rotation, scale), force/velocity, vectors, colliders ect. would be in cache for a bunch of your game objects so that the processor can resolve collisions and update positions as fast as possible without waiting for RAM. Not every object will be in cache but if you're code and objects are correctly made it makes it easy for the processor to begin requesting that data early so that as its processing objects A&B it can begin fetching the objects C&D. Works much the same for instructions. It will look ahead and try to predict what is coming next and load those instructions into cache.

Different levels of cache act differently too. L3 would have a significant portion of your game residing in for low latency access and then as stuff moves closer to execution they will move to L2 and then when it comes down to actually processing it will move into L1/CPU registers. Improperly structured and managed data can cause a cache miss at any cache level as well. Each type of miss gets worse as you move further away from the CPU cores. An L1 cache miss might cost you only like 10 or 20 CPU cycles where as an L3 miss could be well over 300 (10ns give or take) cpu cycles just to fetch a single word. (Note those numbers are kind of of ballpark, really depends on a lot of factors)

As far as assembly code goes I honestly don't know with Unreal, I would have to take a look at the source but its probably a safe bet. Assembly isn't widely used since 99 times out of 100 the compiler will be able to produce better assembly than a human ever could. Hand crafted assembly will be used on very tight looping sections of code so you might translate 20 lines of C to finely optimized assembly. Though its not uncommon for projects to have hand made assembly but like I said it will be used very judiciously since compilers have gotten to the point where they're much smarter than the programmer. Additionally, programmers will often feed hints to the compiler to influence an outcome so that the code it produces is extremely small among other things. You're far more likely to read assembly than you're to actually write it since reading what the assembler is outputting lets gain a big insight into what is going on operations wise.

DeScruff · June 1, 2021

Its crazy to see processors having more Cache then my family's first PC had RAM.
And not just by a little. Its significantly more then the motherboard could actually support.

ravenshrike · June 1, 2021

9 hours ago, StDragon said:

As I understand it, there's diminishing returns for most workloads as you increase the cache size. Running databases is the immediate task that comes to mind that would benefit, but what consumer apps would?

Figure at least a couple of percent in any game up to 20-30 percent uplift. Depends entirely on the game. Total War and Ashes of the Benchmark on lower resolutions would be interesting to see.

Sign In

AMD Announces their new 3D stacking process

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Myth 5: Reducing Instruction Processing Is Our Primary Goal In CPU Optimization

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment