Jump to content

Ryzen on 12nm might get a 50% core increase

cj09beira
Go to solution Solved by Nowak,

Fake news, this was posted on r/AyyMD (an AMD circlejerk sub) a month ago and the media just now fell for it.

 

 

42 minutes ago, leadeater said:

The other option is to do either a smaller core count increase or none and add in extra instruction sets to give targeted performance uplift. Things like improving AVX, increasing cache sizes, double the memory channels (16 channel EPYC ftw), increase the CCX core count and remove the on die IF link and only use it for inter-die, decouple the IF from the memory controller if doing previous mentioned removal of IF intra-die.

Decoupling the IF would pretty much just using a single CCX die, but it wouldn't be suitable for the MCM setups in TR or Epyc. Which they could very well do with a singular Mainstream design, but I doubt that's going to happen. The power of the IF has as much to do with AMD's risk profile being Fabless as it does the computational power of the designs, so I think that stays.

 

I do think Zen2 is getting full AVX2 ports and a greatly improved 128bit Geometry processing units as well. My assumption is those are the types of things that Jim Keller meant by AMD knowing the easy stuff to address. Even increasing the Cache size and the relative % of the Core size, they're looking at around a 55% area shrink going to the next node. (7nm is roughly a 1.5 node jump for AMD.)  

 

But I also believe we're getting a Mainstream and a Server design. I would believe the 16c rumor for the Server part. I could also see 8c as the Mainstream and it ends up really, really tiny. Or they toss just enough of Vega on there so it can run a web browser. Don't know on that one. Maybe the Ryzen 7 3000 series parts will be 125 mm2 dies. You thought the yields were good before?

 

Oh, I don't think we're getting beyond 8 channel memory for a bit. Having seen some of the dual-Eypc boards, they're mostly RAM with a little space for some CPUs and some I/O. Though if DDR5 allows for running multiple channels per DIMM, that would change things. (This also reminds me that I really should read up on the changes coming with DDR5, as they're pretty big.)

Link to comment
Share on other sites

Link to post
Share on other sites

8 minutes ago, Taf the Ghost said:

Decoupling the IF would pretty much just using a single CCX die, but it wouldn't be suitable for the MCM setups in TR or Epyc.

It would still work, the IF can happily still connect dies. What I mean by decouple the IF from the memory controller is use an on package (off die) reference clock that all dies in an MCM package can use for timing of the IF links. For a single die product like Ryzen it just won't have it, on die components will still be present but inactive. Once you go TR/EPYC with MCM you put the clock chip on the package and link that up to all the dies.

 

This way you are putting less work on the memory sub system and allowing for diverged development of them, right now it's a bit hard to make the IF faster if there is a memory development stall.

 

8 minutes ago, Taf the Ghost said:

Oh, I don't think we're getting beyond 8 channel memory for a bit. Having seen some of the dual-Eypc boards, they're mostly RAM with a little space for some CPUs and some I/O. Though if DDR5 allows for running multiple channels per DIMM, that would change things. (This also reminds me that I really should read up on the changes coming with DDR5, as they're pretty big.)

Just run single ram modules per channel so the number of slots is the same. More channels doesn't have to mean more ram slots on the motherboard, it will past a certain point of course though.

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, leadeater said:

It would still work, the IF can happily still connect dies. What I mean by decouple the IF from the memory controller is use an on package (off die) reference clock that all dies in an MCM package can use for timing of the IF links. For a single die product like Ryzen it just won't have it, on die components will still be present but inactive. Once you go TR/EPYC with MCM you put the clock chip on the package and link that up to all the dies.

 

This way you are putting less work on the memory sub system and allowing for diverged development of them, right now it's a bit hard to make the IF faster if there is a memory development stall.

 

Just run single ram modules per channel so the number of slots is the same. More channels doesn't have to mean more ram slots on the motherboard, it will past a certain point of course though.

We'll see where AMD's thinking is on segmenting Client & Server. The intra-CCX latency is actually insanely fast, but that's not normally something we're talking about. While the Zen Design Philosophy is great, I'm not sure AMD can always get away with using Mainstream & Server designs within the same part. The design allows them to make some massive high core count Epycs and some of the IP would likely scale better if design independently.

 

Thinking about it a bit more, there's actually a way to do the Mainstream & Server designs but still use a lot of Mainstream parts in the Server. If Mainstream is still 8c and Server is 16c, you end up with 2 separate Epyc products. You have "high speed" and "high cores". If there's some other, extra features on the Server design (maybe a large AVX512 implementation not on the Mainstream?), you end up with some interesting product segmentation. 

 

Related: AMD really could have used actually having Epyc getting a full rollout before the final design tapeout of Zen2. Having some actual feedback of what people are buying is pretty useful.

Link to comment
Share on other sites

Link to post
Share on other sites

Despite an IMC, the memory channels are not part of the CCX. Every processor can be scaled differently in some components, even GPUs. This allows you to make all sorts of monstrosities that you desire in the processor world. Processors have been made in modular manners (except for intel), and have been so for many years in GPUs. For example, a single nvidia xx80 chip can be cut down into xx60 and xx20 chip, which can further be cut down. The memory controllers can be different too, so the same GPU chip can sometimes have GDDR5 and HBM variants for example.

 

So you wont see too many memory channels if you want the most optimal and cost effective package. The more memory channels the more expensive the motherboard will be too. This allows for very cheap microATX that cost $30 as the CPU has everything integrated and not too many of something for mainstream.

 

To put it simply, the processor itself consists of transistors, Each processor has an interconnect with a bus so each CCX core or even intel core has this same config. So the core itself usually consists of

- transistors

- controller (for busses)

- cache

- sensors (like thermometer)

 

On the package, you get

- core clusters

- memory controller

- sensors (like voltage and amps, more thermometers)

- capacitors and resistors

- higher level cache/RAM

- Northbridge (PCIe controller for non CPUs)

- bus interconnects (some processors)

 

Thats the basics of CPU scaling. You cant decouple somethings and not everything can be coupled. You have to balance everything because there is the actual physical size of the package/die which is crucial in power use, heat, max clocks and latency. Have a package too big, and latency could be an issue (which is why you dont see manycores doing well in desktops) not to mention that clocks get less accurate with distance. This is why GPUs and manycores are good at specific tasks while our standard multicore CPUs are better at serial tasks. Big packages can be hard to make full use of :P .

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, System Error Message said:

So you wont see too many memory channels if you want the most optimal and cost effective package. The more memory channels the more expensive the motherboard will be too. This allows for very cheap microATX that cost $30 as the CPU has everything integrated and not too many of something for mainstream.

That's actually one if the really smart things about Zen, TR/EPYC and MCM, being able to increase the memory channels without increasing the design and production cost of the die.

Link to comment
Share on other sites

Link to post
Share on other sites

6 minutes ago, Taf the Ghost said:

The intra-CCX latency is actually insanely fast, but that's not normally something we're talking about.

That's why I'm wondering if for Ryzen at least it's better to go for a single CCX, though from my understanding the inter-ccx latency is mostly impacted by the L3 cache not the IF itself so improving that could actually be the silver bullet. Maybe it's not even the L3 cache that is slow but the timing alignment between the cores, caches, IMC and IF which would actually make my idea of using a dedicated IF clock much worse for that, much much worse.

Link to comment
Share on other sites

Link to post
Share on other sites

12 minutes ago, leadeater said:

That's why I'm wondering if for Ryzen at least it's better to go for a single CCX, though from my understanding the inter-ccx latency is mostly impacted by the L3 cache not the IF itself so improving that could actually be the silver bullet. Maybe it's not even the L3 cache that is slow but the timing alignment between the cores, caches, IMC and IF which would actually make my idea of using a dedicated IF clock much worse for that, much much worse.

LGA 1366 intel iseries, the L3 cache can be overclocked for some amazing results. I find the limit of the CPU to be the speed of the L3 cache. when the L3 cache exceeds L2 cache bandwidth the CPU resets itself (reboots). If only that could be achieved on ryzen.

Link to comment
Share on other sites

Link to post
Share on other sites

13 minutes ago, leadeater said:

That's why I'm wondering if for Ryzen at least it's better to go for a single CCX, though from my understanding the inter-ccx latency is mostly impacted by the L3 cache not the IF itself so improving that could actually be the silver bullet. Maybe it's not even the L3 cache that is slow but the timing alignment between the cores, caches, IMC and IF which would actually make my idea of using a dedicated IF clock much worse for that, much much worse.

I think the intra-CCX latency being so low has a lot to do with being able to call to a core via direct attachment rather than going through any other bus. (Which is also why I don't think we're getting larger than 4c CCX for a bit.) The CCXs actually communicate through the L3 Cache, if I understand the situation properly. That's why CCX to CCX moves hit a latency penalty beyond 8mb, since that's the L3 size.

 

Also, the latency, at least on-die CCX to CCX isn't that bad at stock 2666 Memory. If AMD can drop it by about 40% over the next 2 generations, it'll be as fast as the Ring Bus currently is, though I think keeping it as "wide" connection is probably more important, going forward, than dropping the latency that much.

Link to comment
Share on other sites

Link to post
Share on other sites

28 minutes ago, Taf the Ghost said:

I think the intra-CCX latency being so low has a lot to do with being able to call to a core via direct attachment rather than going through any other bus. (Which is also why I don't think we're getting larger than 4c CCX for a bit.) The CCXs actually communicate through the L3 Cache, if I understand the situation properly. That's why CCX to CCX moves hit a latency penalty beyond 8mb, since that's the L3 size.

 

Also, the latency, at least on-die CCX to CCX isn't that bad at stock 2666 Memory. If AMD can drop it by about 40% over the next 2 generations, it'll be as fast as the Ring Bus currently is, though I think keeping it as "wide" connection is probably more important, going forward, than dropping the latency that much.

That's the main issue though, compared to the core to core average latency across all cores Zen is a fair amount higher than Intel and for applications that aren't Zen aware it leads to those weird performance issues we saw at launch.

 

Anything that goes between CCX's causes an L3 cache read, store and read operation as the data must travel through both L3 cache chunks. That's why inter-CCX core latency is double the L3 latency. This isn't an IF issue, see below as you can see the IF adds near zero latency, it's a design limitation.

 

Quote

The local "inside the CCX" 8 MB L3-cache is accessed with very little latency. But once the core needs to access another L3-cache chunk – even on the same die – unloaded latency is pretty bad: it's only slightly better than the DRAM access latency.

 

Quote
Mem
Hierarchy
AMD EPYC 7601
DDR4-2400
Intel Skylake-SP
DDR4-2666
Intel Broadwell
Xeon E5-2699v4
DDR4-2400
L1 Cache cycles 4
L2 Cache cycles  12 14-22  12-15
L3 Cache 4-8 MB - cycles 34-47 54-56 38-51
16-32 MB - ns 89-95 ns 25-27 ns
(+/- 55 cycles?)
27-42 ns
(+/- 47 cycles)
Memory 384-512 MB - ns 96-98 ns 89-91 ns 95 ns

https://www.anandtech.com/show/11544/intel-skylake-ep-vs-amd-epyc-7000-cpu-battle-of-the-decade/13

 

Zen L3 latency is actually very good, better than Intel's, however Ryzen gets let down by design due to the double cache access when comparing to Intel architectures. Any improvement here will have a big impact on inter-CCX latency be it simply lowering the L3 cache latency or improving prefetching of data so it's already in local L3 cache chunk, my guess this is why AMD made such a big deal of SenseMI (neural net my ass though).

 

Edit:

More interesting data: http://www.7-cpu.com/cpu/Zen.html

Link to comment
Share on other sites

Link to post
Share on other sites

what about moving the communication over to the L2 cache, If bandwidth would have too be increased but you would then have lower latency and a more unified cache, it would be less modular though, as L3 would become its own thing

Link to comment
Share on other sites

Link to post
Share on other sites

20 minutes ago, cj09beira said:

what about moving the communication over to the L2 cache, If bandwidth would have too be increased but you would then have lower latency and a more unified cache, it would be less modular though, as L3 would become its own thing

The problem is to do with size and localisation. L1 and L2 caches are localised while L3 is shared. This means that communication via L2 is slower if another core has to access another core's L2. This is why L3 exists.

Link to comment
Share on other sites

Link to post
Share on other sites

6 hours ago, leadeater said:

Anything that goes between CCX's causes an L3 cache read, store and read operation as the data must travel through both L3 cache chunks. That's why inter-CCX core latency is double the L3 latency.

6 hours ago, System Error Message said:

The problem is to do with size and localisation. L1 and L2 caches are localised while L3 is shared. This means that communication via L2 is slower if another core has to access another core's L2. This is why L3 exists.

I wonder if they could design a common L3 cache that could be shared between CCXs, or perhaps add an L4 cache before hitting system RAM.  Not sure of the technical limitations of doing so, though.

Link to comment
Share on other sites

Link to post
Share on other sites

9 minutes ago, Jito463 said:

I wonder if they could design a common L3 cache that could be shared between CCXs, or perhaps add an L4 cache before hitting system RAM.  Not sure of the technical limitations of doing so, though.

Common L3 cache is probably the only one that would work, L4 cache would have the same issue as data would have to be pushed up to L4 then down to local L3 cache. Data that passes between CCXs doesn't actually hit ram at all it's just that due to design it's basically just as slow.

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, leadeater said:

Common L3 cache is probably the only one that would work, L4 cache would have the same issue as data would have to be pushed up to L4 then down to local L3 cache. Data that passes between CCXs doesn't actually hit ram at all it's just that due to design it's basically just as slow.

Isn't it pretty much impossible to detach the L3 cache from the CCX design without breaking the modularity and therefore scalability? The L3 cache is important to the coherency between the CCXs but by design it has to be part of a CCX. They might want to change it from a victim cache to an inclusive cache though so that it's more useful instead of being a data dump however I can imagine it suddenly becoming very complicated and less predictable especially if you want to share data between the CCXs.

 

Ultimately it seems like the only solution is to somehow make transferring data from CCX to CCX faster or otherwise abandon it and go monolithic (which obviously isn't gonna happen). Whether that's changing the cache or relying on improvements to the interconnect. It seems to me that's it's a very tough nut to crack as it's a core limitation of MCM designs that you kinda have to work around as best as possible. It seems like the best you can do is to increase bandwidth to attempt to hide the latency.

Link to comment
Share on other sites

Link to post
Share on other sites

On 12/10/2017 at 3:40 PM, Jon Jon said:

Who was happy with the 7700K? It performed clock for clock identical to the 6700K at launch. the i7 8700K was (and is) an exciting product because of the additional cores and overclock-ability. I think people who were "perfectly happy" with the 7700K and then magically weren't, were probably just fanboys of some sort anyway. the 7700K is still a great chip, but now we have chips that are overall better, which is what we should want.

the i7 7700k was disappointing we all know it, sure there was a lot of optimization and polishing but it still was so close to the i7 6700k that it was hardly worth any one's time if you had any of the past 3 generations i7... We all know Intel wished to keep things locked to quadcores to milk us money... :/

 

the i7 8700k is a game change but only thanks to AMD that is indisputable.

Personal Desktop":

CPU: Intel Core i7 10700K @5ghz |~| Cooling: bq! Dark Rock Pro 4 |~| MOBO: Gigabyte Z490UD ATX|~| RAM: 16gb DDR4 3333mhzCL16 G.Skill Trident Z |~| GPU: RX 6900XT Sapphire Nitro+ |~| PSU: Corsair TX650M 80Plus Gold |~| Boot:  SSD WD Green M.2 2280 240GB |~| Storage: 1x3TB HDD 7200rpm Seagate Barracuda + SanDisk Ultra 3D 1TB |~| Case: Fractal Design Meshify C Mini |~| Display: Toshiba UL7A 4K/60hz |~| OS: Windows 10 Pro.

Luna, the temporary Desktop:

CPU: AMD R9 7950XT  |~| Cooling: bq! Dark Rock 4 Pro |~| MOBO: Gigabyte Aorus Master |~| RAM: 32G Kingston HyperX |~| GPU: AMD Radeon RX 7900XTX (Reference) |~| PSU: Corsair HX1000 80+ Platinum |~| Windows Boot Drive: 2x 512GB (1TB total) Plextor SATA SSD (RAID0 volume) |~| Linux Boot Drive: 500GB Kingston A2000 |~| Storage: 4TB WD Black HDD |~| Case: Cooler Master Silencio S600 |~| Display 1 (leftmost): Eizo (unknown model) 1920x1080 IPS @ 60Hz|~| Display 2 (center): BenQ ZOWIE XL2540 1920x1080 TN @ 240Hz |~| Display 3 (rightmost): Wacom Cintiq Pro 24 3840x2160 IPS @ 60Hz 10-bit |~| OS: Windows 10 Pro (games / art) + Linux (distro: NixOS; programming and daily driver)
Link to comment
Share on other sites

Link to post
Share on other sites

25 minutes ago, Princess Cadence said:

the i7 7700k was disappointing we all know it, sure there was a lot of optimization and polishing but it still was so close to the i7 6700k that it was hardly worth any one's time if you had any of the past 3 generations i7... We all know Intel wished to keep things locked to quadcores to milk us money... :/

You had very little reason to switch from an Intel Core i5 or i7 unless you were going to the enthusiast platforms and you needed the cores. I see the KBL chips as Intel's last go at grabbing more sales until CFL was scheduled to release (was scheduled for 2018 since 2015) and Ryzen only pushed it forward a few months. I can't recall if KBL was on the release schedules before or after CFL was added. 

 

Cor Caeruleus Reborn v6

Spoiler

CPU: Intel - Core i7-8700K

CPU Cooler: be quiet! - PURE ROCK 
Thermal Compound: Arctic Silver - 5 High-Density Polysynthetic Silver 3.5g Thermal Paste 
Motherboard: ASRock Z370 Extreme4
Memory: G.Skill TridentZ RGB 2x8GB 3200/14
Storage: Samsung - 850 EVO-Series 500GB 2.5" Solid State Drive 
Storage: Samsung - 960 EVO 500GB M.2-2280 Solid State Drive
Storage: Western Digital - Blue 2TB 3.5" 5400RPM Internal Hard Drive
Storage: Western Digital - BLACK SERIES 3TB 3.5" 7200RPM Internal Hard Drive
Video Card: EVGA - 970 SSC ACX (1080 is in RMA)
Case: Fractal Design - Define R5 w/Window (Black) ATX Mid Tower Case
Power Supply: EVGA - SuperNOVA P2 750W with CableMod blue/black Pro Series
Optical Drive: LG - WH16NS40 Blu-Ray/DVD/CD Writer 
Operating System: Microsoft - Windows 10 Pro OEM 64-bit and Linux Mint Serena
Keyboard: Logitech - G910 Orion Spectrum RGB Wired Gaming Keyboard
Mouse: Logitech - G502 Wired Optical Mouse
Headphones: Logitech - G430 7.1 Channel  Headset
Speakers: Logitech - Z506 155W 5.1ch Speakers

 

Link to comment
Share on other sites

Link to post
Share on other sites

I doubt that they will produce anymore 5 GHz processors in the near future. 5 GHz will burn out a CPU very quickly! 

Link to comment
Share on other sites

Link to post
Share on other sites

5 minutes ago, ARikozuM said:

Oh trust me that I know architectures are in research and such for years and while direct competitors their releases are not straight answers to one another, that doesn't change the fact Coffee Lake might not have been originally meant to be quadcores only just the same xD

 

Who knows... we're not in Intel to know, I just meant that from Sandy Bridge all the way to Kaby Lake Intel milked their costumers hard selling more of the same over and over, mind the advances were mostly in node shrinks that just makes their manufacturing costs lower for the most part and it is impossible to deny that'd have continued if AMD hadn't put their shit together and made actual good value processors unlike FX line up.

Personal Desktop":

CPU: Intel Core i7 10700K @5ghz |~| Cooling: bq! Dark Rock Pro 4 |~| MOBO: Gigabyte Z490UD ATX|~| RAM: 16gb DDR4 3333mhzCL16 G.Skill Trident Z |~| GPU: RX 6900XT Sapphire Nitro+ |~| PSU: Corsair TX650M 80Plus Gold |~| Boot:  SSD WD Green M.2 2280 240GB |~| Storage: 1x3TB HDD 7200rpm Seagate Barracuda + SanDisk Ultra 3D 1TB |~| Case: Fractal Design Meshify C Mini |~| Display: Toshiba UL7A 4K/60hz |~| OS: Windows 10 Pro.

Luna, the temporary Desktop:

CPU: AMD R9 7950XT  |~| Cooling: bq! Dark Rock 4 Pro |~| MOBO: Gigabyte Aorus Master |~| RAM: 32G Kingston HyperX |~| GPU: AMD Radeon RX 7900XTX (Reference) |~| PSU: Corsair HX1000 80+ Platinum |~| Windows Boot Drive: 2x 512GB (1TB total) Plextor SATA SSD (RAID0 volume) |~| Linux Boot Drive: 500GB Kingston A2000 |~| Storage: 4TB WD Black HDD |~| Case: Cooler Master Silencio S600 |~| Display 1 (leftmost): Eizo (unknown model) 1920x1080 IPS @ 60Hz|~| Display 2 (center): BenQ ZOWIE XL2540 1920x1080 TN @ 240Hz |~| Display 3 (rightmost): Wacom Cintiq Pro 24 3840x2160 IPS @ 60Hz 10-bit |~| OS: Windows 10 Pro (games / art) + Linux (distro: NixOS; programming and daily driver)
Link to comment
Share on other sites

Link to post
Share on other sites

5 minutes ago, TheCherryKing said:

I doubt that they will produce anymore 5 GHz processors in the near future. 5 GHz will burn out a CPU very quickly! 

Thermal runaway is the bigger killer. Which begs the question, why don't they make the dies larger to compensate for heat dissipation? 

Cor Caeruleus Reborn v6

Spoiler

CPU: Intel - Core i7-8700K

CPU Cooler: be quiet! - PURE ROCK 
Thermal Compound: Arctic Silver - 5 High-Density Polysynthetic Silver 3.5g Thermal Paste 
Motherboard: ASRock Z370 Extreme4
Memory: G.Skill TridentZ RGB 2x8GB 3200/14
Storage: Samsung - 850 EVO-Series 500GB 2.5" Solid State Drive 
Storage: Samsung - 960 EVO 500GB M.2-2280 Solid State Drive
Storage: Western Digital - Blue 2TB 3.5" 5400RPM Internal Hard Drive
Storage: Western Digital - BLACK SERIES 3TB 3.5" 7200RPM Internal Hard Drive
Video Card: EVGA - 970 SSC ACX (1080 is in RMA)
Case: Fractal Design - Define R5 w/Window (Black) ATX Mid Tower Case
Power Supply: EVGA - SuperNOVA P2 750W with CableMod blue/black Pro Series
Optical Drive: LG - WH16NS40 Blu-Ray/DVD/CD Writer 
Operating System: Microsoft - Windows 10 Pro OEM 64-bit and Linux Mint Serena
Keyboard: Logitech - G910 Orion Spectrum RGB Wired Gaming Keyboard
Mouse: Logitech - G502 Wired Optical Mouse
Headphones: Logitech - G430 7.1 Channel  Headset
Speakers: Logitech - Z506 155W 5.1ch Speakers

 

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, ARikozuM said:

Thermal runaway is the bigger killer. Which begs the question, why don't they make the dies larger to compensate for heat dissipation? 

That same concept can be applied to the Intel Core X Series. The Intel Core X Processors should have used the LGA 3647 socket with larger dies. 

Link to comment
Share on other sites

Link to post
Share on other sites

6 hours ago, Princess Cadence said:

Oh trust me that I know architectures are in research and such for years and while direct competitors their releases are not straight answers to one another, that doesn't change the fact Coffee Lake might not have been originally meant to be quadcores only just the same xD

 

Who knows... we're not in Intel to know, I just meant that from Sandy Bridge all the way to Kaby Lake Intel milked their costumers hard selling more of the same over and over, mind the advances were mostly in node shrinks that just makes their manufacturing costs lower for the most part and it is impossible to deny that'd have continued if AMD hadn't put their shit together and made actual good value processors unlike FX line up.

Coffee Lake was added somewhere in 2015 when it was clear that 10nm was going to be running late. 14nm++ (ugh, this naming) was the first that would allow for solid clocks on 6c and mainstream thermals. By then, Intel would have known almost completely the nature of Zen designs and how many cores they'd have had. While Intel would have had no way to know the Gaming performance or other differentials as we understand them now, they can put some basic math together. 4c vs 8c at anywhere close to similar IPC and clocks is going to lose badly on anything that can leverage the cores.

 

The interesting thing is if 14nm++ hadn't seen the performance uplift it ended up having, Intel would have been stuck with a performance degradation at the top SKUs, which is the result of pushing the clocks up further and further with limited IPC uplift. For what consumers do with Mainstream Intel parts, there's also a question if they can actually increase IPC there by any significant amount. There's a reason they keep adding more instruction sets, and it appears that "Sapphire Rapids" will break certain backwards compatibilities with x86.

Link to comment
Share on other sites

Link to post
Share on other sites

10 hours ago, Trixanity said:

Isn't it pretty much impossible to detach the L3 cache from the CCX design without breaking the modularity and therefore scalability? The L3 cache is important to the coherency between the CCXs but by design it has to be part of a CCX. They might want to change it from a victim cache to an inclusive cache though so that it's more useful instead of being a data dump however I can imagine it suddenly becoming very complicated and less predictable especially if you want to share data between the CCXs.

 

Ultimately it seems like the only solution is to somehow make transferring data from CCX to CCX faster or otherwise abandon it and go monolithic (which obviously isn't gonna happen). Whether that's changing the cache or relying on improvements to the interconnect. It seems to me that's it's a very tough nut to crack as it's a core limitation of MCM designs that you kinda have to work around as best as possible. It seems like the best you can do is to increase bandwidth to attempt to hide the latency.

Larger slightly faster L3 cache with better prefetch is probably enough to give a performance uplift. Realistically it's a problem that doesn't need fixing in hardware though, why should data be passing between CCXs. Most things don't use 4+ threads and ones that do don't have single tasks distributed that wide, a lot of the things that do distribute across cores at scale don't actually require much inter-core data communication.

 

Compiler and code optimization is a lot easier than architecture changes that's for sure.

Link to comment
Share on other sites

Link to post
Share on other sites

54 minutes ago, leadeater said:

Larger slightly faster L3 cache with better prefetch is probably enough to give a performance uplift. Realistically it's a problem that doesn't need fixing in hardware though, why should data be passing between CCXs. Most things don't use 4+ threads and ones that do don't have single tasks distributed that wide, a lot of the things that do distribute across cores at scale don't actually require much inter-core data communication.

 

Compiler and code optimization is a lot easier than architecture changes that's for sure.

Most of the issues with the CCX to CCX penalty only show up in Gaming, as that's the only one moving small enough data across, at high enough of a need, to run into issues. Well, and Adobe products, which somehow mostly work faster on desktop Intel than HEDT Intel.

 

I expect the 16 mb per CCX L3 Cache rumor to be true for Zen2, along with running at improved speed. Cache speed has actually been the one place Intel has seen a sizable improvement in the Skylake generation, so I expect it's an area of heavy development for both companies going forward.

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now


×