Ryzen on 12nm might get a 50% core increase

Taf the Ghost · December 16, 2017

42 minutes ago, leadeater said:

The other option is to do either a smaller core count increase or none and add in extra instruction sets to give targeted performance uplift. Things like improving AVX, increasing cache sizes, double the memory channels (16 channel EPYC ftw), increase the CCX core count and remove the on die IF link and only use it for inter-die, decouple the IF from the memory controller if doing previous mentioned removal of IF intra-die.

Decoupling the IF would pretty much just using a single CCX die, but it wouldn't be suitable for the MCM setups in TR or Epyc. Which they could very well do with a singular Mainstream design, but I doubt that's going to happen. The power of the IF has as much to do with AMD's risk profile being Fabless as it does the computational power of the designs, so I think that stays.

I do think Zen2 is getting full AVX2 ports and a greatly improved 128bit Geometry processing units as well. My assumption is those are the types of things that Jim Keller meant by AMD knowing the easy stuff to address. Even increasing the Cache size and the relative % of the Core size, they're looking at around a 55% area shrink going to the next node. (7nm is roughly a 1.5 node jump for AMD.)

But I also believe we're getting a Mainstream and a Server design. I would believe the 16c rumor for the Server part. I could also see 8c as the Mainstream and it ends up really, really tiny. Or they toss just enough of Vega on there so it can run a web browser. Don't know on that one. Maybe the Ryzen 7 3000 series parts will be 125 mm² dies. You thought the yields were good before?

Oh, I don't think we're getting beyond 8 channel memory for a bit. Having seen some of the dual-Eypc boards, they're mostly RAM with a little space for some CPUs and some I/O. Though if DDR5 allows for running multiple channels per DIMM, that would change things. (This also reminds me that I really should read up on the changes coming with DDR5, as they're pretty big.)

leadeater · December 16, 2017

8 minutes ago, Taf the Ghost said:

Decoupling the IF would pretty much just using a single CCX die, but it wouldn't be suitable for the MCM setups in TR or Epyc.

It would still work, the IF can happily still connect dies. What I mean by decouple the IF from the memory controller is use an on package (off die) reference clock that all dies in an MCM package can use for timing of the IF links. For a single die product like Ryzen it just won't have it, on die components will still be present but inactive. Once you go TR/EPYC with MCM you put the clock chip on the package and link that up to all the dies.

This way you are putting less work on the memory sub system and allowing for diverged development of them, right now it's a bit hard to make the IF faster if there is a memory development stall.

8 minutes ago, Taf the Ghost said:

Oh, I don't think we're getting beyond 8 channel memory for a bit. Having seen some of the dual-Eypc boards, they're mostly RAM with a little space for some CPUs and some I/O. Though if DDR5 allows for running multiple channels per DIMM, that would change things. (This also reminds me that I really should read up on the changes coming with DDR5, as they're pretty big.)

Just run single ram modules per channel so the number of slots is the same. More channels doesn't have to mean more ram slots on the motherboard, it will past a certain point of course though.

Taf the Ghost · December 16, 2017

2 minutes ago, leadeater said:

It would still work, the IF can happily still connect dies. What I mean by decouple the IF from the memory controller is use an on package (off die) reference clock that all dies in an MCM package can use for timing of the IF links. For a single die product like Ryzen it just won't have it, on die components will still be present but inactive. Once you go TR/EPYC with MCM you put the clock chip on the package and link that up to all the dies.

This way you are putting less work on the memory sub system and allowing for diverged development of them, right now it's a bit hard to make the IF faster if there is a memory development stall.

Just run single ram modules per channel so the number of slots is the same. More channels doesn't have to mean more ram slots on the motherboard, it will past a certain point of course though.

We'll see where AMD's thinking is on segmenting Client & Server. The intra-CCX latency is actually insanely fast, but that's not normally something we're talking about. While the Zen Design Philosophy is great, I'm not sure AMD can always get away with using Mainstream & Server designs within the same part. The design allows them to make some massive high core count Epycs and some of the IP would likely scale better if design independently.

Thinking about it a bit more, there's actually a way to do the Mainstream & Server designs but still use a lot of Mainstream parts in the Server. If Mainstream is still 8c and Server is 16c, you end up with 2 separate Epyc products. You have "high speed" and "high cores". If there's some other, extra features on the Server design (maybe a large AVX512 implementation not on the Mainstream?), you end up with some interesting product segmentation.

Related: AMD really could have used actually having Epyc getting a full rollout before the final design tapeout of Zen2. Having some actual feedback of what people are buying is pretty useful.

System Error Message · December 16, 2017

Despite an IMC, the memory channels are not part of the CCX. Every processor can be scaled differently in some components, even GPUs. This allows you to make all sorts of monstrosities that you desire in the processor world. Processors have been made in modular manners (except for intel), and have been so for many years in GPUs. For example, a single nvidia xx80 chip can be cut down into xx60 and xx20 chip, which can further be cut down. The memory controllers can be different too, so the same GPU chip can sometimes have GDDR5 and HBM variants for example.

So you wont see too many memory channels if you want the most optimal and cost effective package. The more memory channels the more expensive the motherboard will be too. This allows for very cheap microATX that cost $30 as the CPU has everything integrated and not too many of something for mainstream.

To put it simply, the processor itself consists of transistors, Each processor has an interconnect with a bus so each CCX core or even intel core has this same config. So the core itself usually consists of

- transistors

- controller (for busses)

- cache

- sensors (like thermometer)

On the package, you get

- core clusters

- memory controller

- sensors (like voltage and amps, more thermometers)

- capacitors and resistors

- higher level cache/RAM

- Northbridge (PCIe controller for non CPUs)

- bus interconnects (some processors)

Thats the basics of CPU scaling. You cant decouple somethings and not everything can be coupled. You have to balance everything because there is the actual physical size of the package/die which is crucial in power use, heat, max clocks and latency. Have a package too big, and latency could be an issue (which is why you dont see manycores doing well in desktops) not to mention that clocks get less accurate with distance. This is why GPUs and manycores are good at specific tasks while our standard multicore CPUs are better at serial tasks. Big packages can be hard to make full use of .

leadeater · December 16, 2017

1 minute ago, System Error Message said:

So you wont see too many memory channels if you want the most optimal and cost effective package. The more memory channels the more expensive the motherboard will be too. This allows for very cheap microATX that cost $30 as the CPU has everything integrated and not too many of something for mainstream.

That's actually one if the really smart things about Zen, TR/EPYC and MCM, being able to increase the memory channels without increasing the design and production cost of the die.

leadeater · December 16, 2017

6 minutes ago, Taf the Ghost said:

The intra-CCX latency is actually insanely fast, but that's not normally something we're talking about.

That's why I'm wondering if for Ryzen at least it's better to go for a single CCX, though from my understanding the inter-ccx latency is mostly impacted by the L3 cache not the IF itself so improving that could actually be the silver bullet. Maybe it's not even the L3 cache that is slow but the timing alignment between the cores, caches, IMC and IF which would actually make my idea of using a dedicated IF clock much worse for that, much much worse.

System Error Message · December 16, 2017

12 minutes ago, leadeater said:

That's why I'm wondering if for Ryzen at least it's better to go for a single CCX, though from my understanding the inter-ccx latency is mostly impacted by the L3 cache not the IF itself so improving that could actually be the silver bullet. Maybe it's not even the L3 cache that is slow but the timing alignment between the cores, caches, IMC and IF which would actually make my idea of using a dedicated IF clock much worse for that, much much worse.

LGA 1366 intel iseries, the L3 cache can be overclocked for some amazing results. I find the limit of the CPU to be the speed of the L3 cache. when the L3 cache exceeds L2 cache bandwidth the CPU resets itself (reboots). If only that could be achieved on ryzen.

Taf the Ghost · December 16, 2017

13 minutes ago, leadeater said:

That's why I'm wondering if for Ryzen at least it's better to go for a single CCX, though from my understanding the inter-ccx latency is mostly impacted by the L3 cache not the IF itself so improving that could actually be the silver bullet. Maybe it's not even the L3 cache that is slow but the timing alignment between the cores, caches, IMC and IF which would actually make my idea of using a dedicated IF clock much worse for that, much much worse.

I think the intra-CCX latency being so low has a lot to do with being able to call to a core via direct attachment rather than going through any other bus. (Which is also why I don't think we're getting larger than 4c CCX for a bit.) The CCXs actually communicate through the L3 Cache, if I understand the situation properly. That's why CCX to CCX moves hit a latency penalty beyond 8mb, since that's the L3 size.

Also, the latency, at least on-die CCX to CCX isn't that bad at stock 2666 Memory. If AMD can drop it by about 40% over the next 2 generations, it'll be as fast as the Ring Bus currently is, though I think keeping it as "wide" connection is probably more important, going forward, than dropping the latency that much.

leadeater · December 16, 2017

28 minutes ago, Taf the Ghost said:

I think the intra-CCX latency being so low has a lot to do with being able to call to a core via direct attachment rather than going through any other bus. (Which is also why I don't think we're getting larger than 4c CCX for a bit.) The CCXs actually communicate through the L3 Cache, if I understand the situation properly. That's why CCX to CCX moves hit a latency penalty beyond 8mb, since that's the L3 size.

Also, the latency, at least on-die CCX to CCX isn't that bad at stock 2666 Memory. If AMD can drop it by about 40% over the next 2 generations, it'll be as fast as the Ring Bus currently is, though I think keeping it as "wide" connection is probably more important, going forward, than dropping the latency that much.

That's the main issue though, compared to the core to core average latency across all cores Zen is a fair amount higher than Intel and for applications that aren't Zen aware it leads to those weird performance issues we saw at launch.

Anything that goes between CCX's causes an L3 cache read, store and read operation as the data must travel through both L3 cache chunks. That's why inter-CCX core latency is double the L3 latency. This isn't an IF issue, see below as you can see the IF adds near zero latency, it's a design limitation.

Quote

The local "inside the CCX" 8 MB L3-cache is accessed with very little latency. But once the core needs to access another L3-cache chunk – even on the same die – unloaded latency is pretty bad: it's only slightly better than the DRAM access latency.

Quote

Mem
Hierarchy AMD EPYC 7601
DDR4-2400 Intel Skylake-SP
DDR4-2666 Intel Broadwell
Xeon E5-2699v4
DDR4-2400

L1 Cache cycles 4 4 4

L2 Cache cycles 12 14-22 12-15

L3 Cache 4-8 MB - cycles 34-47 54-56 38-51

16-32 MB - ns 89-95 ns 25-27 ns
(+/- 55 cycles?) 27-42 ns
(+/- 47 cycles)

Memory 384-512 MB - ns 96-98 ns 89-91 ns 95 ns

https://www.anandtech.com/show/11544/intel-skylake-ep-vs-amd-epyc-7000-cpu-battle-of-the-decade/13

Zen L3 latency is actually very good, better than Intel's, however Ryzen gets let down by design due to the double cache access when comparing to Intel architectures. Any improvement here will have a big impact on inter-CCX latency be it simply lowering the L3 cache latency or improving prefetching of data so it's already in local L3 cache chunk, my guess this is why AMD made such a big deal of SenseMI (neural net my ass though).

Edit:

More interesting data: http://www.7-cpu.com/cpu/Zen.html

cj09beira · December 16, 2017

what about moving the communication over to the L2 cache, If bandwidth would have too be increased but you would then have lower latency and a more unified cache, it would be less modular though, as L3 would become its own thing

System Error Message · December 16, 2017

20 minutes ago, cj09beira said:

what about moving the communication over to the L2 cache, If bandwidth would have too be increased but you would then have lower latency and a more unified cache, it would be less modular though, as L3 would become its own thing

The problem is to do with size and localisation. L1 and L2 caches are localised while L3 is shared. This means that communication via L2 is slower if another core has to access another core's L2. This is why L3 exists.

Jito463 · December 16, 2017

6 hours ago, leadeater said:

Anything that goes between CCX's causes an L3 cache read, store and read operation as the data must travel through both L3 cache chunks. That's why inter-CCX core latency is double the L3 latency.

6 hours ago, System Error Message said:

The problem is to do with size and localisation. L1 and L2 caches are localised while L3 is shared. This means that communication via L2 is slower if another core has to access another core's L2. This is why L3 exists.

I wonder if they could design a common L3 cache that could be shared between CCXs, or perhaps add an L4 cache before hitting system RAM. Not sure of the technical limitations of doing so, though.

leadeater · December 16, 2017

9 minutes ago, Jito463 said:

I wonder if they could design a common L3 cache that could be shared between CCXs, or perhaps add an L4 cache before hitting system RAM. Not sure of the technical limitations of doing so, though.

Common L3 cache is probably the only one that would work, L4 cache would have the same issue as data would have to be pushed up to L4 then down to local L3 cache. Data that passes between CCXs doesn't actually hit ram at all it's just that due to design it's basically just as slow.

Trixanity · December 16, 2017

1 hour ago, leadeater said:

Common L3 cache is probably the only one that would work, L4 cache would have the same issue as data would have to be pushed up to L4 then down to local L3 cache. Data that passes between CCXs doesn't actually hit ram at all it's just that due to design it's basically just as slow.

Isn't it pretty much impossible to detach the L3 cache from the CCX design without breaking the modularity and therefore scalability? The L3 cache is important to the coherency between the CCXs but by design it has to be part of a CCX. They might want to change it from a victim cache to an inclusive cache though so that it's more useful instead of being a data dump however I can imagine it suddenly becoming very complicated and less predictable especially if you want to share data between the CCXs.

Ultimately it seems like the only solution is to somehow make transferring data from CCX to CCX faster or otherwise abandon it and go monolithic (which obviously isn't gonna happen). Whether that's changing the cache or relying on improvements to the interconnect. It seems to me that's it's a very tough nut to crack as it's a core limitation of MCM designs that you kinda have to work around as best as possible. It seems like the best you can do is to increase bandwidth to attempt to hide the latency.

Princess Luna · December 16, 2017

On 12/10/2017 at 3:40 PM, Jon Jon said:

Who was happy with the 7700K? It performed clock for clock identical to the 6700K at launch. the i7 8700K was (and is) an exciting product because of the additional cores and overclock-ability. I think people who were "perfectly happy" with the 7700K and then magically weren't, were probably just fanboys of some sort anyway. the 7700K is still a great chip, but now we have chips that are overall better, which is what we should want.

the i7 7700k was disappointing we all know it, sure there was a lot of optimization and polishing but it still was so close to the i7 6700k that it was hardly worth any one's time if you had any of the past 3 generations i7... We all know Intel wished to keep things locked to quadcores to milk us money...

the i7 8700k is a game change but only thanks to AMD that is indisputable.

ARikozuM · December 16, 2017

25 minutes ago, Princess Cadence said:

the i7 7700k was disappointing we all know it, sure there was a lot of optimization and polishing but it still was so close to the i7 6700k that it was hardly worth any one's time if you had any of the past 3 generations i7... We all know Intel wished to keep things locked to quadcores to milk us money...

You had very little reason to switch from an Intel Core i5 or i7 unless you were going to the enthusiast platforms and you needed the cores. I see the KBL chips as Intel's last go at grabbing more sales until CFL was scheduled to release (was scheduled for 2018 since 2015) and Ryzen only pushed it forward a few months. I can't recall if KBL was on the release schedules before or after CFL was added.

TheCherryKing · December 16, 2017

I doubt that they will produce anymore 5 GHz processors in the near future. 5 GHz will burn out a CPU very quickly!

Princess Luna · December 16, 2017

5 minutes ago, ARikozuM said:

.

Oh trust me that I know architectures are in research and such for years and while direct competitors their releases are not straight answers to one another, that doesn't change the fact Coffee Lake might not have been originally meant to be quadcores only just the same

Who knows... we're not in Intel to know, I just meant that from Sandy Bridge all the way to Kaby Lake Intel milked their costumers hard selling more of the same over and over, mind the advances were mostly in node shrinks that just makes their manufacturing costs lower for the most part and it is impossible to deny that'd have continued if AMD hadn't put their shit together and made actual good value processors unlike FX line up.

ARikozuM · December 16, 2017

5 minutes ago, TheCherryKing said:

I doubt that they will produce anymore 5 GHz processors in the near future. 5 GHz will burn out a CPU very quickly!

Thermal runaway is the bigger killer. Which begs the question, why don't they make the dies larger to compensate for heat dissipation?

Trixanity · December 16, 2017

12 minutes ago, ARikozuM said:

Thermal runaway is the bigger killer. Which begs the question, why don't they make the dies larger to compensate for heat dissipation?

Price.

TheCherryKing · December 17, 2017

2 hours ago, ARikozuM said:

Thermal runaway is the bigger killer. Which begs the question, why don't they make the dies larger to compensate for heat dissipation?

That same concept can be applied to the Intel Core X Series. The Intel Core X Processors should have used the LGA 3647 socket with larger dies.

Taf the Ghost · December 17, 2017

6 hours ago, Princess Cadence said:

Oh trust me that I know architectures are in research and such for years and while direct competitors their releases are not straight answers to one another, that doesn't change the fact Coffee Lake might not have been originally meant to be quadcores only just the same

Who knows... we're not in Intel to know, I just meant that from Sandy Bridge all the way to Kaby Lake Intel milked their costumers hard selling more of the same over and over, mind the advances were mostly in node shrinks that just makes their manufacturing costs lower for the most part and it is impossible to deny that'd have continued if AMD hadn't put their shit together and made actual good value processors unlike FX line up.

Coffee Lake was added somewhere in 2015 when it was clear that 10nm was going to be running late. 14nm++ (ugh, this naming) was the first that would allow for solid clocks on 6c and mainstream thermals. By then, Intel would have known almost completely the nature of Zen designs and how many cores they'd have had. While Intel would have had no way to know the Gaming performance or other differentials as we understand them now, they can put some basic math together. 4c vs 8c at anywhere close to similar IPC and clocks is going to lose badly on anything that can leverage the cores.

The interesting thing is if 14nm++ hadn't seen the performance uplift it ended up having, Intel would have been stuck with a performance degradation at the top SKUs, which is the result of pushing the clocks up further and further with limited IPC uplift. For what consumers do with Mainstream Intel parts, there's also a question if they can actually increase IPC there by any significant amount. There's a reason they keep adding more instruction sets, and it appears that "Sapphire Rapids" will break certain backwards compatibilities with x86.

leadeater · December 17, 2017

10 hours ago, Trixanity said:

Isn't it pretty much impossible to detach the L3 cache from the CCX design without breaking the modularity and therefore scalability? The L3 cache is important to the coherency between the CCXs but by design it has to be part of a CCX. They might want to change it from a victim cache to an inclusive cache though so that it's more useful instead of being a data dump however I can imagine it suddenly becoming very complicated and less predictable especially if you want to share data between the CCXs.

Ultimately it seems like the only solution is to somehow make transferring data from CCX to CCX faster or otherwise abandon it and go monolithic (which obviously isn't gonna happen). Whether that's changing the cache or relying on improvements to the interconnect. It seems to me that's it's a very tough nut to crack as it's a core limitation of MCM designs that you kinda have to work around as best as possible. It seems like the best you can do is to increase bandwidth to attempt to hide the latency.

Larger slightly faster L3 cache with better prefetch is probably enough to give a performance uplift. Realistically it's a problem that doesn't need fixing in hardware though, why should data be passing between CCXs. Most things don't use 4+ threads and ones that do don't have single tasks distributed that wide, a lot of the things that do distribute across cores at scale don't actually require much inter-core data communication.

Compiler and code optimization is a lot easier than architecture changes that's for sure.

Taf the Ghost · December 17, 2017

54 minutes ago, leadeater said:

Larger slightly faster L3 cache with better prefetch is probably enough to give a performance uplift. Realistically it's a problem that doesn't need fixing in hardware though, why should data be passing between CCXs. Most things don't use 4+ threads and ones that do don't have single tasks distributed that wide, a lot of the things that do distribute across cores at scale don't actually require much inter-core data communication.

Compiler and code optimization is a lot easier than architecture changes that's for sure.

Most of the issues with the CCX to CCX penalty only show up in Gaming, as that's the only one moving small enough data across, at high enough of a need, to run into issues. Well, and Adobe products, which somehow mostly work faster on desktop Intel than HEDT Intel.

I expect the 16 mb per CCX L3 Cache rumor to be true for Zen2, along with running at improved speed. Cache speed has actually been the one place Intel has seen a sizable improvement in the Skylake generation, so I expect it's an area of heavy development for both companies going forward.

Mem Hierarchy	AMD EPYC 7601 DDR4-2400	Intel Skylake-SP DDR4-2666	Intel Broadwell Xeon E5-2699v4 DDR4-2400
L1 Cache cycles	4	4	4
L2 Cache cycles	12	14-22	12-15
L3 Cache 4-8 MB - cycles	34-47	54-56	38-51
16-32 MB - ns	89-95 ns	25-27 ns (+/- 55 cycles?)	27-42 ns (+/- 47 cycles)
Memory 384-512 MB - ns	96-98 ns	89-91 ns	95 ns

Sign In

Ryzen on 12nm might get a 50% core increase

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites