More Ryzen 3000 info - 4.5GHZ boost & +10-15% IPC

ravenshrike · April 30, 2019

1 hour ago, CarlBar said:

and if they had they could just fix the ring bus to work at bigger core counts by working on that

It's a ring. There's no working on it, it will always be more inefficient the bigger it is. Now, you could do multiple rings among smaller core sets, but that gets very complex very quickly and will have an effect on yields.

Trixanity · April 30, 2019

7 minutes ago, ravenshrike said:

It's a ring. There's no working on it, it will always be more inefficient the bigger it is. Now, you could do multiple rings among smaller core sets, but that gets very complex very quickly and will have an effect on yields.

Isn't Intel already doing dual rings for their 6-8 core processors? If that is the case, it sounds like Intel deemed it not feasible to add a third for higher core counts.

ravenshrike · April 30, 2019

3 minutes ago, Trixanity said:

Isn't Intel already doing dual rings for their 6-8 core processors? If that is the case, it sounds like Intel deemed it not feasible to add a third for higher core counts.

That was just to allow bi-directional travel. I'm talking about a bunch of smaller rings for 16 core and up processors. But even there the complexities quickly become too large to keep up with the comparative simplicity of mesh.

porina · April 30, 2019

1 hour ago, CarlBar said:

(though i couldn't find a good explanation of what the snoop bus was for)

Cache coherency. You don't want a situation where data in one part of the CPU is different from another part that is supposed to be the same data.

21 minutes ago, Trixanity said:

Isn't Intel already doing dual rings for their 6-8 core processors?

It's been a while since I looked it up, but from memory a single ring bus went up to 8 or 10 cores. Only above that did they start implementing multiple rings that had to be connected together. Historically not a problem consumers had to worry about.

DuckDodgers · April 30, 2019

34 minutes ago, Trixanity said:

Isn't Intel already doing dual rings for their 6-8 core processors? If that is the case, it sounds like Intel deemed it not feasible to add a third for higher core counts.

Intel limited a single ring up to 10 cores. Dual-ring Xeons were essentially two processors "glued" together on a single chip, since each ring had its own memory partition (Home Agent) and on those CPUs Intel provided an option to treat the memory access as NUMA-like, to reduce the cross-ring latency penalty for NUMA-aware applications at a cost of slightly lower L3 hit-rate.

Dylanc1500 · April 30, 2019

6 minutes ago, porina said:

It's been a while since I looked it up, but from memory a single ring bus went up to 8 or 10 cores. Only above that did they start implementing multiple rings that had to be connected together. Historically not a problem consumers had to worry about.

@Trixanity

If you look at Broadwell EP it goes up to 10 cores, single ring, with the LCC die. It's almost identical with Haswell EP, but capped to 18 cores.

MCC Die, 12-14

HCC Die, 16+

maartendc · April 30, 2019

On 4/29/2019 at 1:44 AM, Bananasplit_00 said:

This sounds quite literally hot. Hope it's not an Intel level thermal disaster

As opposed to Intel, they are going to 7nm, so I don't think so.

Stefan Payne · April 30, 2019

for the AMD Hype, ASUS just released a statement about compatibility of 300 and 400 series Boards and Zen2, thought they don't name the names:

https://www.asus.com/News/EtaH71Hbjuio1arV

Bananasplit_00 · April 30, 2019

1 hour ago, maartendc said:

As opposed to Intel, they are going to 7nm, so I don't think so.

7nm means high density, so it could have thermal problems anyway

leadeater · April 30, 2019

7 hours ago, CarlBar said:

Each of those rings has, (last time i did the math i need to dig my articles up again though as it's been a few weeks), 4 times the bandwidth, albit uni directional compared to AMD's bi-directional.But thats kinda one of the things i was getting at. AMD has to put all the traffic thats distributed across 4 buses, (though i couldn't find a good explanation of what the snoop bus was for), through a single much lower speed bus.

You can see the bandwidth figures in the tables I posted, that the difference between theoretical bandwidth and usable bandwidth in practice.

Here's what you're talking about:

Quote

Per core you get the same amount of L3 cache bandwidth as in high end Westmere parts - 96GB/s. Aggregate bandwidth is 4x that in a quad-core system since you get a ring stop per core (384GB/s).

Problem is it just doesn't work out this way.

7 hours ago, CarlBar said:

Also saying we don't know what effect the IF links have on latency is like saying we don't know what effect making everyone use 30mph mopeds will have on traffic congestion. You just have to look at side data and use a bit of common sense on said data.

The data is in the tables I posted, IF has higher bandwidth....

Ring Bus, in smaller core counts, has better latency. Intel also has much better L3 memory than AMD has and it's also a different L3 architecture.

Because of this superior latency small data access bandwidth is higher on Intel.

Quote

L3-cache sizes have increased steadily over the years. The Xeon E5 v1 had up to 20 MB, v3 came with 45 MB, and v4 "Broadwell EP" further increased this to 55 MB. But the fatter the cache, the higher the latency became. L3 latency doubled from Sandy Bridge-EP to Broadwell-EP. So it is no wonder that Skylake went for a larger L2-cache and a smaller but faster L3. The L2-cache offers 4 times lower latency at 512 KB.

AMD's unloaded latency is very competitive under 8 MB, and is a vast improvement over previous AMD server CPUs. Unfortunately, accessing more 8 MB incurs worse latency than a Broadwell core accessing DRAM. Due to the slow L3-cache access, AMD's DRAM access is also the slowest. The importance of unloaded DRAM latency should of course not be exaggerated: in most applications most of the loads are done in the caches. Still, it is bad news for applications with pointer chasing or other latency-sensitive operations.

7 hours ago, CarlBar said:

Obviously on HEDT and Server parts where intel clocks are lower and they're using a mesh that starts to fall off as intel's bandwidth drops, (because of the frequency point you mentioned), and their mesh just isn't quite as blazingly fast for a given bandwidth in the first place AMD's issues get less, helped by the fact that the lower clocks also mean each core is doing less work reducing the needed bandwidth for the communications. Not to mention cross die communications are as you noted different in Zen1 and Zen+ from interCCX.

Intel moved to Mesh because Ring Bus doesn't scale with high core counts, it gets worse. That's why they had to go with Dual Rings but that has some significant down sides and bandwidth plus latency penalties when going between Rings and cores in different Rings. Mesh is a superior design for large core counts and it's not because of lower Ring Bus or Core frequency as in some metrics Ring Bus on those Xeons still outpaces Mesh, just not where it actually matters.

PianoPlayer88Key · April 30, 2019

I would dream about 5 GHz, but 4.5 or even 4 GHz would be fine with me, so long as ....

what I *REALLY* want to see is not just a 10-15% IPC improvement. I want to see AMD's Zen 2 jump ahead of Ice Lake (didn't AMD say that the Zen 2 Epyc samples were supposed to be competitive with Ice Lake?) by at least as big of a factor as how far Bulldozer / Piledriver were behind Kaby Lake (the CPU Intel released before Zen 1 came out).

CarlBar · May 1, 2019

4 hours ago, leadeater said:

You can see the bandwidth figures in the tables I posted, that the difference between theoretical bandwidth and usable bandwidth in practice.

Here's what you're talking about:

Problem is it just doesn't work out this way.

The data is in the tables I posted, IF has higher bandwidth....

Ring Bus, in smaller core counts, has better latency. Intel also has much better L3 memory than AMD has and it's also a different L3 architecture.

Because of this superior latency small data access bandwidth is higher on Intel.

Intel moved to Mesh because Ring Bus doesn't scale with high core counts, it gets worse. That's why they had to go with Dual Rings but that has some significant down sides and bandwidth plus latency penalties when going between Rings and cores in different Rings. Mesh is a superior design for large core counts and it's not because of lower Ring Bus or Core frequency as in some metrics Ring Bus on those Xeons still outpaces Mesh, just not where it actually matters.

Yes you posted tables. Those tables are completely irrelevant to the entire point i'm making. hence why i've been ignoring them. As i tried to point out in my first response to me. Your arguing somthing completely different to what i am.

This isn't about the performance of the entire subsystem including cache. This is about how the performance of one aspect of the system differs between the Intel and AMD methods and the effects that has and the potential benefits in that area that we know zen 2 has.

Let me try to go through this step by step because i feel like somthing is getting lost in the communicaition between us.

Ring Bus at 8 cores or less works well. At above 8 cores it starts to break down. Why though. Well the only thing thats different between 8 core and 10 cores, (thats unique to the ring bus setup, obviously bigger cache means more cache latency but any 10 core is going to experiance that so thats not unique to ring bus), is the amount of data being transmitted. That means that 10 cores are overloading the data transmission rate of the ring bus and/or producing too many instances of 1 core having to wait because another

core is doing somthing on the same clock cycle and blocking the route.

That in turn tells us a lot about exactly how much bandwidth must actually be required in practise to keep 8 cores fed and by looking up the specs for the ruing bus, (32 bytes wide per ring, upto 5Ghz operating frequency on the fastest chips, and 4 rings to seperate different data types out), gives us an idea of how much bandwidth, (and what kind of operating frequencies), any 8 core processor is going to need. (Intel is getting upto 160GB/s per ring, even if we assume the Data ring is the bottleneck and the others are lightly loaded by comparison, (doesn't make a lot of sense as they wouldn't make the other rings so wide if they didn't; need the bandwidth but ok i'll bite), that still puts the value somwhere in the region of 160GB/s).

Now obviously since AMD use 4 core CCX's this gts more than a little messy as all 8 cores aren't using the IF links to communicate with the other 7. (I think this is where the confusion is arising). But we can still expect it to be no worse than half what an 8 core ring bus system would require as any given core has to communicate with at least 4 other cores this way.

The thing is Ryzens IF links don't even come close to that, (46.667Gb/s Bi Directional according to chip wiki). Thus we know that in workloads that heavily stress the inter core communication the bandwidth and operating frequency must be producing congestion based delays and thus additional latency above and beyond any cache access induced latency issues. Zen 2 mitigates this by vastly raising the frequency and bandwidth of the IF link available.

his dosen;t make the other issues unimportant, (DuH), but it's an additional factor above and beyond what you where talking about i the post that started this little side discussion.

ouroesa · May 1, 2019

Error

leadeater · May 1, 2019

2 hours ago, CarlBar said:

Yes you posted tables. Those tables are completely irrelevant to the entire point i'm making. hence why i've been ignoring them. As i tried to point out in my first response to me. Your arguing somthing completely different to what i am.

It's not irrelevant, those are measures of the bandwidth through the buses you are talking about. It's showing you real world actually in use bandwidth of the very thing you are talking about. How is this not relevant? The other aspects of what that Bus is used for isn't relevant like how it connects the iGPU. We were talking about how the different Buses effects core to core communication and it's literally impossible to talk about that without talking about L3 cache hence those graphs show you in use, what you will get bandwidth.

Having a bigger pipe doesn't make it faster, the flow of data can only go at the same speed.

I won't not discuss the L3 cache aspect because it's the thing that matters most, unless the Bus is the limiting factor which in limited cases is for Infinity Fabric but not for latency only bandwidth. What I originally posted and you responded to was about core to core bandwidth and latency and I was explaining how that is not caused by Infinity Fabric but the L3 design.

.

2 hours ago, CarlBar said:

even if we assume the Data ring is the bottleneck and the others are lightly loaded by comparison, (doesn't make a lot of sense as they wouldn't make the other rings so wide if they didn't; need the bandwidth but ok i'll bite), that still puts the value somwhere in the region of 160GB/s).

They are, the data ring is used for data the others are not.

Quote

Per core you get the same amount of L3 cache bandwidth as in high end Westmere parts - 96GB/s. Aggregate bandwidth is 4x that in a quad-core system since you get a ring stop per core (384GB/s).

This is from a much older generation (Sandy Bridge) so it's faster now due to the clock increases, about what you said it is. Again does you no good if you are bandwidth limited by the L3 cache.

2 hours ago, CarlBar said:

doesn't make a lot of sense as they wouldn't make the other rings so wide if they didn't

This is documented, if you were looking up this information then you should have seen it.

You can read more about each Ring here: https://www.theregister.co.uk/2010/09/16/sandy_bridge_ring_interconnect?page=2

Quote

The ring is an ingenious beast. For one thing, as Kahn explains: "The ring itself is really not a [single] ring: we have four different rings: a 32-byte data ring, so every cache-line transfer — because a cache line is 64 bytes — every cache line is two packets on the ring. We have a separate request ring, and acknowledge ring, and a [cache] snoop ring — they're used, each one of these, for separate phases of a transaction."

2 hours ago, CarlBar said:

The thing is Ryzens IF links don't even come close to that, (46.667Gb/s Bi Directional according to chip wiki). Thus we know that in workloads that heavily stress the inter core communication the bandwidth and operating frequency must be producing congestion based delays and thus additional latency above and beyond any cache access induced latency issues. Zen 2 mitigates this by vastly raising the frequency and bandwidth of the IF link available.

I think you're greatly over estimating the amount of inter core communication actually happens and if you have 8 cores constantly cross communicating then there is something seriously wrong. Real world examples would show this up, like Blender which performs exceedingly well on Ryzen.

leadeater · May 1, 2019

On 4/30/2019 at 1:08 PM, CarlBar said:

I was just pointing out how the higher bandwidth and lower transmission time, (as well as seperate rings for each group of things), give intel a big advantage in latency ATM above and beyond write/read/write/read stuff you where talking about. And how Zen2 looks set to really close up that aspect regardless of any other changes.

Just to clear up this point, this is due to L3 Cache not the IF. Correct point but incorrect identification of the cause. For a core to pass data to another core in a different CCX it must write to L3 cache then the receiving core must read it from L3 cache and bring it in to it's L2 and L1 cache. Intel architecture doesn't have to do that. Direct core to core communication is only possible within a CCX, direct core to core communication is possible on Intel architecture (both Ring and Mesh) excluding dies that use Dual Ring Bus.

Edit:

I realize I'm being hyper specific but this is because this isn't changing in Zen2 so it's critically important to know it's the L3 Cache design.

Stefan Payne · May 1, 2019

5 hours ago, leadeater said:

Direct core to core communication is only possible within a CCX

...wich is why I believe that AMD might eventually/possibly increase the CCX to 8 Cores and have 1 Die = 1 CCX. Also the position of the Dies on the 8 Die Rome I find strange. Or rather: there might be two links in each CPU Die: One to the I/O Die, one to the closest other CPU Die.

So that they then form a 2 CCX CPU just like we had before.

Anyway, the stuff you say about a CCX sounds a lot like "2 CPUs glued together in a Die", so is it better to see it as a dual CPU Package??

What you said about writing to L3 Cache, wasn't it similar back in the day with the good old K10 and Bulldozer CPUs?? When one CPU communicated with another it went through the L3??

leadeater · May 1, 2019

58 minutes ago, Stefan Payne said:

...wich is why I believe that AMD might eventually/possibly increase the CCX to 8 Cores and have 1 Die = 1 CCX

Yea that would be nice, not sure if there is any specific reasons why keeping it to 4 is best or preferred or increasing that a to be coming thing. Maybe it has something to do with that the L3 cache is in 2MB slices per core and connecting them up gets much harder beyond 4 i.e. full mesh connection point count.

58 minutes ago, Stefan Payne said:

Also the position of the Dies on the 8 Die Rome I find strange. Or rather: there might be two links in each CPU Die: One to the I/O Die, one to the closest other CPU Die.

So that they then form a 2 CCX CPU just like we had before.

It makes a bit more sense when you look at the diagram components of the die, at least for current Zen/Zen+. It's easy enough to ignore/cut out the stuff that won't be in the chiplet.

From the IF down is the chiplet.

Slightly more detailed diagram, but with multiple dies however shows what we need.

So basically the chiplet should be the 2 CCXs, the SDF and the GMI interfaces. The GMI interfaces will connect to the I/O die SDF (or equivalent).

Direct connections between chiplets I'd classify as unlikely as then you'd be back to non-uniform memory access (NUMA) again. You don't really want 'close' and 'far' cores otherwise you're in compiler optimization hell that people might not do, the hell part of it. I don't know, anything is possible.

58 minutes ago, Stefan Payne said:

What you said about writing to L3 Cache, wasn't it similar back in the day with the good old K10 and Bulldozer CPUs?? When one CPU communicated with another it went through the L3??

That's a though one, I never really looked at Bulldozer Opterons because they were DOA and K10 is getting rather old. I think so? HyperTransport was used between sockets and was directly between the CPU/die but it's cache coherent so has to be written to a local cache of the remote CPU before any of the cores can use the data. Both CPUs must have the same cache data. To me that sounds like L3 cache would be used.

Trixanity · May 1, 2019

7 hours ago, leadeater said:

Direct core to core communication is only possible within a CCX, direct core to core communication is possible on Intel architecture (both Ring and Mesh) excluding dies that use Dual Ring Bus.

Edit:

I realize I'm being hyper specific but this is because this isn't changing in Zen2 so it's critically important to know it's the L3 Cache design.

Why wouldn't AMD adopt a more 'mesh' like structure if it helps alleviate the latency issues? Is it because the current way with cross CCX communication going through the L3 helps facilitate the scalable nature (and the chiplet design)? Or is it a different reason like patents or time-to-market?

leadeater · May 1, 2019

33 minutes ago, Trixanity said:

Why wouldn't AMD adopt a more 'mesh' like structure if it helps alleviate the latency issues? Is it because the current way with cross CCX communication going through the L3 helps facilitate the scalable nature (and the chiplet design)? Or is it a different reason like patents or time-to-market?

Mesh is fat on die area, but mainly it's to allow it to scale well in both low CCX configurations and high CCX configurations. To me is appears as if chiplet was always the design goal and is why the Zen architecture is the way it is. Mesh also implements internal groupings of cores and memory controllers, because going across the die is higher latency and that means lower memory bandwidth per core even though total bandwidth is high. That's where Broadwell-EP is superior to Skylake-SP. The SDF in Zen allows every core full access to both memory channels on it's own die.

Quote

The new Skylake-SP offers mediocre bandwidth to a single thread: only 12 GB/s is available despite the use of fast DDR-4 2666. The Broadwell-EP delivers 50% more bandwidth with slower DDR4-2400. It is clear that Skylake-SP needs more threads to get the most of its available memory bandwidth.

Meanwhile a single thread on a Naples core can get 27,5 GB/s if necessary. This is very promissing, as this means that a single-threaded phase in an HPC application will get abundant bandwidth and run as fast as possible. But the total bandwidth that one whole quad core CCX can command is only 30 GB/s.

Overall, memory bandwidth on Intel's Skylake-SP Xeon behaves more linearly than on AMD's EPYC. All off the Xeon's cores have access to all the memory channels, so bandwidth more directly increases with the number of threads.

So in the context of Mesh vs Ring vs SDF Ring for single thread Ring is best followed by SDF then Mesh. Mesh scales well for very high thread utilization without having to optimize and sub group tasks, Infinity Fabric & SDF scales well for total memory bandwidth but you need to take care with thread placement and communication.

Taf the Ghost · May 1, 2019

25 minutes ago, Trixanity said:

Why wouldn't AMD adopt a more 'mesh' like structure if it helps alleviate the latency issues? Is it because the current way with cross CCX communication going through the L3 helps facilitate the scalable nature (and the chiplet design)? Or is it a different reason like patents or time-to-market?

Cores actually don't communicate that much and most of the CCX Penalty went away when MS fixed the Windows Scheduler. The memory latency issue is down to Intel having a more advanced IMC, but AMD might catch up on that. So, for the purposes of especially Servers, the CCX layout is just fine, as long as your schedulers are aware of what is going on.

It doesn't come up much, but those WX Threadripper parts with 4 dies don't really suffer much, under Linux, without direct memory access from two the dies. So even that massive penalty isn't really that much of an worry for most things. It matters for Desktop & Gaming use far more than any professional tasks, so AMD has reasons to always be trying to improve it on those platforms.

From a mathematical stand point, the 4c CCX design is likely here to stay for a while. But what you do is stack 4 within a new "node" and repeat the process. Zen3 might see 12c chiplets (so 3 CCX per chiplet), but that's just speculation. (Logical speculation, as the package has space and the minor shrink is enough for it.) But, we could see a 16c chiplet approach for Zen4, which means you repeat the 4c direct connections again, this time among 4 CCXs. (Zen4 will be really wild again, as even TSMC has confirmed that much for 5nm.)

Taf the Ghost · May 1, 2019

@leadeaterAMD was repping neural net/AI on Zen1. They were never quite clear exactly what they meant, but Zen has some amazing on-chip load balancing to get the most out of its memory. Which is likely going to keep advancing and we'll see that tech as the backbone for the eventual Network on Package they'll be doing at 5nm.

RejZoR · May 1, 2019

11 hours ago, ONOTech said:

I'm taking everything with a grain of salt, but cache latency has been the biggest problem for Ryzen afaik. The cycles to L3 is like 2-3 times that of Intel.

Not sure what's going on architecturally, but if they improve that latency and the hit rate I'm sure Ryzen 3K will be a true winner in value and overall performance (excluding high FPU stuff...AVX and all)

Regardless, I'm looking forward to the APUS the most. Sadly, Q4 2019 is probably the earliest they'll arrive

Apart from highly synthetic benchmarks, not really.

leadeater · May 1, 2019

8 minutes ago, Taf the Ghost said:

@leadeaterAMD was repeating neural net/AI on Zen1. They were never quite clear exactly what they meant, but Zen has some amazing on-chip load balancing to get the most out of its memory. Which is likely going to keep advancing and we'll see that tech as the backbone for the eventual Network on Package they'll be doing at 5nm.

Most of that was marketing fluff, the design itself does well in specific areas like the single core access, but there is definitely some good smart logic in there to handle multiple CCXs, IMC and GMI interfaces along with all that data flow and coherency. 1 day of open question time with the people that designed Zen would be so awesome, dreams etc etc

Taf the Ghost · May 1, 2019

3 minutes ago, leadeater said:

Most of that was marketing fluff, the design itself does well in specific areas like the single core access, but there is definitely some good smart logic in there to handle multiple CCXs, IMC and GMI interfaces along with all that data flow and coherency. 1 day of open question time with the people that designed Zen would be so awesome, dreams etc etc

With the picture a lot clearer of where AMD is heading with the design, a lot of the tech in Zen makes a lot more sense. The data fabric and other management systems seem "overbuilt" for the design, especially Zen1. But, if you assume they were going to rapidly iterate into Active Interposers with a Network on Package system, the design approach makes a lot more sense. They got most of that worked out early, giving them a lot of flexibility in packaging for Zen Phase 1, while at the same time just carrying over a lot of design throughout, which should limit issues with other, complex design interactions later.

wasab · May 1, 2019

10-15% ipc uplifts plus 0.5ghz high clock means roughly around Intel coffee lake performance maybe?

what is the current Intel CPU refresh lineup? I completely lost track after Intel derail from it's tick tock model and begin just refreshing one CPU after the another.

Sign In

More Ryzen 3000 info - 4.5GHZ boost & +10-15% IPC

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites