zen 4 AMD Ryzen 7000X3D series coming February/April, 16-core Ryzen 9 7950X3D features 144MB cache: 21-30% higher gaming performance at 1080p (Update #3)

GhostRoadieBL · January 6, 2023

8 hours ago, lotus10101 said:

Me, I'm going to get one of these and a 4070ti for the ultimate 1080p experience

Whats wrong with my 1080p 144Hz monitor? some people still rock crt HA

I'll never knock a gamer for playing on the setup they like.
I am a bit curious if the card can even be fully utilized at 1080p with that fast of a CPU and if the frame latency would be so short as to cause scheduler slowdowns and increased latency or input errors.

if your system was capable of rendering insane amounts of frames and the game engine's code becomes the bottleneck, could you cause graphic or user input errors as the frames get discarded and the instructions from the game engine for movement become slower than the rendered frames or complete the instruction while the frame is being discarded, carrying the wrong input data to change the next frame. This is obviously theoretical since 99.9% of the time the game's code being held in the CPU cache isn't the slowdown or GPU performance isn't so high for the card to shuffle tasks between idle cores. I'm just thinking back to the LTT video on the editor's NVME server being under utilized and causing latency issues because of the OS's old code and the CPU architecture fighting with an abnormally fast storage setup.

tim0901 · January 6, 2023

12 hours ago, ewitte said:

AM6? It was almost 6 years between the release of AM4 and the release of AM5! It will be way outdated by then. The processors we have now can't even keep up with the 4090 at 4k including 7000 series!

*Looks at PC*

... I wish I had your problems.

Fendrick · January 6, 2023

Not really impressed in all honesty.

HenrySalayne · January 6, 2023

7 hours ago, BiG StroOnZ said:

In our older article, we explored two possibilities—one that the 3DV cache is available on both CCDs but halved in size for whatever reason; and the second more outlandish possibility that only one of the two CCDs has stacked 3DV cache, while the other is a normal planar CCD with just the on-die 32 MB L3 cache. As it turns out, the latter theory is right!

Adding half the 3DV cache on each CCD doesn't make sense. Manufacturing costs would be pretty much identical to a 64 MB 3DV die for each CCD.

But I'm honestly surprised they didn't show a 7990X3D with 208 MB of cache (two 3DV cache dice). Not because it makes sense, but people would still by it for over a thousand dollars.

3 hours ago, StDragon said:

How would the OS CPU scheduler know which thread is optimized for more cache vs higher clock rates?

The scheduler could measure how long a task waits for data from RAM compared to the execution time and shift a task with a long waiting time to the 3DV cache CCD.

HenrySalayne · January 6, 2023

1 hour ago, tim0901 said:

*Looks at PC*

... I wish I had your problems.

starsmine · January 6, 2023

1 hour ago, Fendrick said:

Not really impressed in all honesty.

What would have impressed you?

1 minute ago, HenrySalayne said:

Adding half the 3DV cache on each CCD doesn't make sense. Manufacturing costs would be pretty much identical to a 64 MB 3DV die for each CCD.

But I'm honestly surprised they didn't show a 7990X3D with 208 MB of cache (two 3DV cache dice). Not because it makes sense, but people would still by it for over a thousand dollars.

The scheduler could measure how long a task waits for data from RAM compared to the execution time and shift a task with a long waiting time to the 3DV cache CCD.

7950x3d2 for 1600 USD, make like 25 for the whole world.

Idk, I think between all the Numa core handling windows has now, This might be better.
I feel like the venn diagram of games that can take advantage of 32 threads and the additional cache is like 1. There are more likely games that fall into the box of either Cache dependent, or Frequency dependent.

Perhaps they might bring down Genoa-X down to the threadripper 4 line (Storm Peak-X?) for people that play that single game benefits from High thread count and Cache count size. But as of now, I have heard zero rumors of such a product existing.

1 hour ago, tim0901 said:

*Looks at PC*

... I wish I had your problems.

Yea I went from an i7-2600k to an R7 7700x in November. And will replace it with the x3d of zen 5 or zen 6 myself since no platform change. and then run that for 8 years

Shepanator · January 6, 2023

As far as I understand it L3 Cache is shared between CCDs (local L3 cache is checked first then the L3 cache on other CCDs), so theoretically the higher clocked cores on one CCD can still benefit from the huge cache on the other, though obviously with a latency penalty (but still a lot faster than going to system RAM). So you could kind of say the 3D v-cache is sort of like L4 cache for the higher clocked cores.

With that in mind I'm very interested to see reviews & benchmarks to see if this is the case, if not I can imagine it would be very frustrating to see windows putting your game threads on the fastest clocked cores only to get less FPS than if they were on the cores which have the 3D v-cache. I wonder if the windows scheduler is smart enough to put threads which are more frequency vs cache sensitive on the appropriate cores.

porina · January 6, 2023

5 hours ago, StDragon said:

How would the OS CPU scheduler know which thread is optimized for more cache vs higher clock rates?

Will programs have to be updated with the developers intention to target a CCD based on their own testing / bench-marking?

In previously linked video, AMD spokesperson said it includes work from AMD, Microsoft and game developers.

5 hours ago, StDragon said:

Will AMD provide a utility to manually pin specific processes to a CCD?

You can manually do it now either manually through Task Manager, or maybe more easily with tools like Process Lasso.

5 hours ago, StDragon said:

Is that V-Cache shared across the Infinity Fabric? If so, what's the performance penalty traversing through and would that negate higher clock speeds from the non-V-Cache CCD?

4 minutes ago, Shepanator said:

As far as I understand it L3 Cache is shared between CCDs (local L3 cache is checked first then the L3 cache on other CCDs), so theoretically the higher clocked cores on one CCD can still benefit from the huge cache on the other, though obviously with a latency penalty (but still a lot faster than going to system RAM). So you could kind of say the 3D v-cache is sort of like L4 cache for the higher clocked cores.

To my understanding, at least on Zen 2 and Zen 3, cores on one CCX can't directly access cache on another CCX (regardless if on same or different CCD). To access that data, you have two options: go back to system ram, or do a core to core transfer which can be done. Having a uniform cache really helps in a workload I have, and this has been a pain point with AMD's CCX/CCD implementation.

Shepanator · January 6, 2023

6 minutes ago, porina said:

To my understanding, at least on Zen 2 and Zen 3, cores on one CCX can't directly access cache on another CCX (regardless if on same or different CCD). To access that data, you have two options: go back to system ram, or do a core to core transfer which can be done. Having a uniform cache really helps in a workload I have, and this has been a pain point with AMD's CCX/CCD implementation.

Do you have a source for that?

AFAIK L1 cache is for a single core, L2 cache is shared between 2-4 cores, and L3 cache is shared across the whole CPU. In a chiplet design CPU such as Zen4 the L3 cache on the local CCD is checked first and then the L3 cache on other CCDs with a latency penalty. There is another caveat that a core can only evict data from the L3 cache on its own CCD (but it can read the L3 cache anywhere)

porina · January 6, 2023

27 minutes ago, Shepanator said:

Do you have a source for that?

I could ask you the same. I'll have to dig as I think I saw it around Zen 2 era, and I'm not aware that has changed since.

23 minutes ago, Shepanator said:

AFAIK L1 cache is for a single core, L2 cache is shared between 2-4 cores, and L3 cache is shared across the whole CPU.

L1 and L2 is local to a core. L3 is the first tier shared between multiple cores.

Shepanator · January 6, 2023

58 minutes ago, porina said:

I could ask you the same. I'll have to dig as I think I saw it around Zen 2 era, and I'm not aware that has changed since.

L1 and L2 is local to a core. L3 is the first tier shared between multiple cores.

I was being general about the cache hierarchy not specific to Zen4, where you're correct that L3 is shared by all cores in a CCD however cores can also access the L3 cache of other CCDs over the infinity fabric. We both agree I just missed the word directly in your sentence "cores on one CCX can't directly access cache on another CCX"

Since accessing data in the L3 cache of a foreign CCD is still a hell of a lot faster than going to system memory, it could theoretically mean that an application could run on the faster clocked cores but then still benefit from the huge cache on the other CCD, albeit with a bit of extra latency (hence why I said it would essentially be like an L4 cache). I could be wrong but I guess we will only see once it is in the hands of reviewers.

leadeater · January 6, 2023

2 hours ago, porina said:

To my understanding, at least on Zen 2 and Zen 3, cores on one CCX can't directly access cache on another CCX (regardless if on same or different CCD).

They cannot, communication and data flow between functional blocks is done through SCF/SDF (IF) and passes through the L3 cache, to be pulled down to lower cache levels. So data must go up, out, in, down. Although typically the data will already exist at upper layers, and potentially already at the target external L3 cache of an opposing CCD/CCX.

There's probably a bunch of cache hints you can do in code to ensure data is already loaded in CCD/L3 caches when you know your code will work in that way, there's likely far more deeper knowledge and information on this than I could possibly hope to remember.

porina · January 6, 2023

17 minutes ago, Shepanator said:

We both agree I just missed the word directly in your sentence "cores on one CCX can't directly access cache on another CCX"

We don't agree. If you can't directly access data in the L3 of another CCX, you indirectly access it. As mentioned I see two scenarios for that. The most likely is you punt it back to system ram and fetch it from there, since this is taken care by hardware and doesn't need software consideration. The other is a direct core to core contact, but this would be a programming nightmare and unlikely to be efficient outside of very specific compute use cases.

17 minutes ago, Shepanator said:

albeit with a bit of extra latency (hence why I said it would essentially be like an L4 cache). I could be wrong but I guess we will only see once it is in the hands of reviewers.

The extra latency is going to ram and back. This is why people have hoped for AMD to provide an actual L4 cache in IOD to save that trip back to ram.

leadeater · January 6, 2023

10 minutes ago, porina said:

The extra latency is going to ram and back. This is why people have hoped for AMD to provide an actual L4 cache in IOD to save that trip back to ram.

Going back to ram shouldn't be necessary, data between CCXs and CCDs is supported without going through system memory. The potential issue I see is the imbalance in L3 cache capacity. You could end up with cache trashing and a lot of cache evictions as well as heavy utilization of the IF bandwidth links which will effect the other cores if they are also trying to access system memory.

There is only so much bandwidth to go around, going to system memory might actually be better sometimes than going across to the adjacent V-Cache.

The SoC as a whole is cache coherent so all CCDs/CCXs are aware of all cache entries and where they are and can call on them, how that is done changes where that data actually is but it's handled by the architecture and not the software so system memory access isn't necessary. The architecture may decide however to pull from system memory, it's smarter than I am at knowing what is best.

Making sure your data is where it is needed is a software thing however.

Ripred · January 6, 2023

7 hours ago, GhostRoadieBL said:

I'll never knock a gamer for playing on the setup they like.
I am a bit curious if the card can even be fully utilized at 1080p with that fast of a CPU and if the frame latency would be so short as to cause scheduler slowdowns and increased latency or input errors.

if your system was capable of rendering insane amounts of frames and the game engine's code becomes the bottleneck, could you cause graphic or user input errors as the frames get discarded and the instructions from the game engine for movement become slower than the rendered frames or complete the instruction while the frame is being discarded, carrying the wrong input data to change the next frame. This is obviously theoretical since 99.9% of the time the game's code being held in the CPU cache isn't the slowdown or GPU performance isn't so high for the card to shuffle tasks between idle cores. I'm just thinking back to the LTT video on the editor's NVME server being under utilized and causing latency issues because of the OS's old code and the CPU architecture fighting with an abnormally fast storage setup.

Its all good, was a joke anyway because 4070ti falls short in 4k gaming but shines in 1080p lol

I would have a 1440 4k monitor but my 1080p 144 monitor looks just fine and I see no reason to change it atm, plus my current rig config is a way to fast cpu and puny gpu so 1080/1440 is ideal, although I have plugged it into my 4k tv and my card still pushes 100fps+ in Doom eternal High/ Ultra

Shepanator · January 6, 2023

6 minutes ago, porina said:

We don't agree. If you can't directly access data in the L3 of another CCX, you indirectly access it. As mentioned I see two scenarios for that. The most likely is you punt it back to system ram and fetch it from there, since this is taken care by hardware and doesn't need software consideration. The other is a direct core to core contact, but this would be a programming nightmare and unlikely to be efficient outside of very specific compute use cases.

The extra latency is going to ram and back. This is why people have hoped for AMD to provide an actual L4 cache in IOD to save that trip back to ram.

you're incorrect here. If it worked how you just said then cache coherence wouldn't be preserved which is a big no no. AMD CPUs use the MOESI protocol of which a core tenet is direct cache-to-cache transfer of data.

Here is the key quote:

This protocol, a more elaborate version of the simpler MESI protocol (but not in extended MESI - see Cache coherency), avoids the need to write a dirty cache line back to main memory when another processor tries to read it. Instead, the Owned state allows a processor to supply the modified data directly to the other processor. This is beneficial when the communication latency and bandwidth between two CPUs is significantly better than to main memory.

porina · January 6, 2023

2 hours ago, leadeater said:

Going back to ram shouldn't be necessary, data between CCXs and CCDs is supported without going through system memory.

2 hours ago, Shepanator said:

you're incorrect here. If it worked how you just said then cache coherence wouldn't be preserved which is a big no no.

I get coherency is a necessary thing, but it doesn't necessarily follow there is a path for moving data the same way the coherency is maintained.

I'll give it may be a misunderstanding on my part that the limitations I see in AMD's architecture may be due to the limited bandwidth of their IF links, which per CCD is at best comparable to ram access (for Zen 2/3 anyway, Zen 4 seems to be a little different). This is in part why if I were to buy another AMD desktop CPU, I'd probably limit myself to 8 core (1 CCD) models.

The main observation I have is that scaling performance for bandwidth sensitive workloads using the same data takes a dive on AMD CPUs when crossing multiple CCX, which is not seen with unified L3 cache such as Intel consumer CPUs, and single CCX AMD CPUs.

leadeater · January 6, 2023

4 hours ago, porina said:

I'll give it may be a misunderstanding on my part that the limitations I see in AMD's architecture may be due to the limited bandwidth of their IF links, which per CCD is at best comparable to ram access (for Zen 2/3 anyway, Zen 4 seems to be a little different). This is in part why if I were to buy another AMD desktop CPU, I'd probably limit myself to 8 core (1 CCD) models.

Bandwidth isn't the same thing as access latency or the number of required cycles, the same thing as access latency really. It's always going to be faster to access a different L3 cache than system memory, latency combined with data size is actual bandwidth so on an instruction level for what a core is doing there is more bandwidth accessing L3 cache than going to system memory.

4 hours ago, porina said:

The main observation I have is that scaling performance for bandwidth sensitive workloads using the same data takes a dive on AMD CPUs when crossing multiple CCX, which is not seen with unified L3 cache such as Intel consumer CPUs, and single CCX AMD CPUs.

It's seen on Intel CPUs it's just the impact is less, L3 cache is made up of slices. For AMD it's greater when crossing those boundaries but it's still better than to system memory.

Quote

On the L3 side, we expect a large shift of the latency curve into deeper memory regions given that a single core now has access to the full 32MB, double that of the previous generation.

Quote

The fact that this test now behaves completely different throughout the L2 to L3 and DRAM compared to Zen2 means that AMD is now employing a very different cache line replacement policy on Zen3. The test’s curve in the L3 no longer actually matching the cache’s size means that AMD is now optimising the replacement policy to reorder/move around cache lines within the sets to reduce unneeded replacements within the cache hierarchies. In this case it’s a very interesting behaviour that we hadn’t seen to this degree in any microarchitecture and basically breaks our TLB+CLR test which we previously relied on for estimating the physical structural latencies of the designs.

It’s this new cache replacement policy which I think is cause for the more smoothed out curves when transitioning between the L2 and L3 caches as well as from the L3 to DRAM – the latter behaviour which now looks closer to what Intel and some other competing microarchitectures have recently exhibited.

Quote

Latencies past 8MB still go up even though the L3 is 32MB deep, and that’s simply because it exceeds the L2 TLB capacity of 2K pages with a 4K page size.

https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-dive-review-5950x-5900x-5800x-and-5700x-tested/5

Also compare the graph between the Zen2 and Zen3 at 16MB, you'll see the significant difference.

porina · January 6, 2023

12 minutes ago, leadeater said:

Bandwidth isn't the same thing as access latency or the number of required cycles, the same thing as access latency really.

I'll accept that, but bandwidth tends towards an enough or not enough thing. Broadwell-C L4 cache is rated at only 50GB/s, which could be easily exceeded by modern ram, but for those 4 cores it was enough and practically unlimited. Problem now is we have much faster cores, and many more of them. I never dug deep enough to see if latency was much of a consideration. At least, at ram-scale latencies (100ns?) it didn't seem to make much difference compared to bandwidth.

12 minutes ago, leadeater said:

Also compare the graph between the Zen2 and Zen3 at 16MB, you'll see the significant difference.

I was debating using this or similar for earlier consideration, but this doesn't necessarily represent the case I'm presenting which is cores on a single CCX accessing cache on another CCX. I did find a "single core" bandwidth chart somewhere but it was less than conclusive, since a single core doesn't result in sufficient loading to stress memory bandwidth.

I could probably describe a test using prime95 if anyone wants to try it, but I don't personally have access to a multi-CCX AMD system myself any more.

BiG StroOnZ · January 7, 2023

Was watching GN's coverage of the livestream and came across this slide, which I haven't seen before:

Will add to OP, because I'm sure people are interested in seeing productivity advancements from the 3D V-Cache.

leadeater · January 7, 2023

4 hours ago, porina said:

I'll accept that, but bandwidth tends towards an enough or not enough thing. Broadwell-C L4 cache is rated at only 50GB/s

That's only possible maximum bandwidth the bus supports, not throughput. As I mentioned if you are pulling a data page(s) from cache for an instruction then the actual throughput is data size over time.

Zen 3 can do 3 loads per cycle however there are restrictions on that depending on quite a few things but the reported maximum possible bandwidth is 2 256byte loads from FP/SIMD, so total of 512bytes. If this is done to system memory at say 100ns then the throughput is 4.76GB/s. If this is done to L3 cache at 10ns then it's 47.6GB/s, unless maximum bus bandwidth is hit.

Just keep in mind aggregated possible bandwidth is not the same thing as throughput at the instruction level, core level or thread level at any given time.

leadeater · January 7, 2023

5 hours ago, porina said:

I was debating using this or similar for earlier consideration, but this doesn't necessarily represent the case I'm presenting which is cores on a single CCX accessing cache on another CCX. I did find a "single core" bandwidth chart somewhere but it was less than conclusive, since a single core doesn't result in sufficient loading to stress memory bandwidth.

The chart does show that, you go to the data size past the bounds of the L3 cache in a single CCD and that is the latency to access non-local L3 cache.

Within a CCD it's ~16ns and across a CCD it's ~80ns.

porina · January 7, 2023

8 hours ago, leadeater said:

Just keep in mind aggregated possible bandwidth is not the same thing as throughput at the instruction level, core level or thread level at any given time.

Agreed, but if I consider ram bandwidth to be low, and IF bandwidth is less than that, it is still low. Some day I want to understand the access pattern of the workload of personal interest as it is mixed reads/writes and it doesn't correlate well to synthetic read/write/copy rates.

8 hours ago, leadeater said:

Within a CCD it's ~16ns and across a CCD it's ~80ns.

80ns is comparable to system ram latency.

leadeater · January 7, 2023

6 minutes ago, porina said:

80ns is comparable to system ram latency.

It is and it isn't, that's latency of something more meaningful. It's not comparable to AID64 figures for example. I think memory latency for that is around 110ns to 140ns so it's still a lot better.

Can be more though since accessing timing and required cycles is different with in the IF compared to going out to system memory.

9 minutes ago, porina said:

Agreed, but if I consider ram bandwidth to be low, and IF bandwidth is less than that, it is still low. Some day I want to understand the access pattern of the workload of personal interest as it is mixed reads/writes and it doesn't correlate well to synthetic read/write/copy rates.

Actual memory bandwidth usage is a lot lower than spec, many instances where AMD Zen architecture achieve real world higher memory bandwidth and that's through the IF because it has to go through it.

Anyway you'll find these interesting probably:

https://chipsandcheese.com/2022/11/08/amds-zen-4-part-2-memory-subsystem-and-conclusion/

I'll read the source later better, going to pickup pizza haha

WereCat · January 7, 2023

For gaming I feel like the 7800X3D will be the best one as it has way more cache per CCD than 7950X3D and despite it's lower clock speed it will probably beat it in games where the VCache can be heavily used and lose in games where the VCache is not as important and the clock speed will pull ahead in games like CS:GO for example.

I also find it interesting that AMD compares 5800X3D vs the 7xxxX3D in some games where VCache does not really do much and most of the % improvement is from the clock speed and IPC alone.

That said, I'm not even considering upgrading to AM5 now but the 7950X3D will be an absolute beast for productivity. Just going from 3900X to 5800X3D makes Blender so much faster it's insane. I'm talking about viewport performance where you actually spend most of the time creating stuff, rendering is done on GPU anyways. I wish we actually got some benchmarks for this and not just the tile rendering performance which is not really that important but it's just very simple to make.

Sign In

zen 4 AMD Ryzen 7000X3D series coming February/April, 16-core Ryzen 9 7950X3D features 144MB cache: 21-30% higher gaming performance at 1080p (Update #3)

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites