Jump to content

[Rumour] AMD's Zen To Have Ten Pipelines Per Core

HKZeroFive

As heard earlier this year, Zen will use SMT and an improved cache subsystem while being designed from scratch with new ideas combined with reusing existing components (to reduce the effort). This might even include already existing and somewhat developed ideas not realized in previous designs. A lot of the new functionality has been filed for patenting. For example there was a mention of checkpointing, which is good for quick reversion of mispredicted branches and other reasons for restarting the pipelines. Some patents suggest, that Zen might use some slightly modified Excavator branch prediction.

And the new patch also suggests nicely low int/fp mul, fp add, int/fp div and fp square root latencies. Some of these lower latencies (div/sqrt) were introduced with Excavator, as an Aida64 instruction latency dump provided by Anandtech forum user monstercameron revealed. Due to an Aida problem with measured and reported clock frequencies (although it was fixed at 1.4GHz), you have to multiply the measured times by 1.4 to get the real number of cycles. Ok, back to Zen.

Here are some quotes of the patch file:

+;; Decoders unit has 4 decoders and all of them can decode fast path

+;; and vector type instructions.

+;; Integer unit 4 ALU pipes.

+;; 2 AGU pipes.

+;; Floating point unit 4 FP pipes.

+ 32, /* size of l1 cache. */

+ 512, /* size of l2 cache. */

Excerpt:

4 wide decoders

4 integer ALUs

2 AGUs (for 2R 1W L1 cache according to a LinkedIn profile)

4 FP pipelines

That makes ten pipelines with a general four wide design.

Zen-Architektur%2BCore%2BV0.2.png

There is a lot more information, which I will collect over the next days. Some stuff is copy pasted from Excavator (bdver4) or Jaguar (btver2) and modified then. But careful comparing did show some clear differences, while at other places it's not clear, if there is new information or not (e.g. div latencies). But as btver2 has 2048 kB L2 and the rest of the block is more similar to bdver4 or btver2 than btver1 (Bobcat), which has 512 kb L2, it looks like no btver1 files were used as a source. So I assume, that this is a new entry of an L2 cache size, indicating fast L2 caches per core. The L1 data cache still has the same size as that of Jaguar or Excavator. Some patents mention an 8-way 32kb L1 D$.

Interestingly, as there are two 128b FP mul and two 128b FP add units (with only 3 cycles latency for these ops), the FMA instructions will be executed by combining one FP MUL and one FP ADD unit, resulting in 2 issues and 5 cycles latency (as that of the Bulldozer family). This saves some register file ports and increases throughput and reduces latencies of the more common FP ops. It even remembers me of the bridged FMA unit.

These latencies also clearly suggest, that this is no high clock frequency design. But at 14nm (or 16nm from TSMC as some rumours suggest) clocks of 3.5 to 4 GHz should be reachable without stretching the thermal limits too much.

So, some of you may be asking 'HKZeroFive, what does this mumbo-jumbo mean exactly?'.

Well, the article claims that Zen has four instruction decoders (or hyperthreads) per core, in which all of them support decoding fast path and vector instructions. It also claims that each core has four ALUs, meaning that you can operate four integer operations on an ALU simultaneously. This also applies to the FPUs and floating point operations.

TL;DR - Hyperthreading stuff. One core can execute multiple instructions/commands at once, such as integer and floating point operations.

Sauce: http://dresdenboy.blogspot.com/2015/10/amds-zen-core-family-17h-to-have-ten.html

'Fanboyism is stupid' - someone on this forum.

Be nice to each other boys and girls. And don't cheap out on a power supply.

Spoiler

CPU: Intel Core i7 4790K - 4.5 GHz | Motherboard: ASUS MAXIMUS VII HERO | RAM: 32GB Corsair Vengeance Pro DDR3 | SSD: Samsung 850 EVO - 500GB | GPU: MSI GTX 980 Ti Gaming 6GB | PSU: EVGA SuperNOVA 650 G2 | Case: NZXT Phantom 530 | Cooling: CRYORIG R1 Ultimate | Monitor: ASUS ROG Swift PG279Q | Peripherals: Corsair Vengeance K70 and Razer DeathAdder

 

Link to comment
Share on other sites

Link to post
Share on other sites

Holy damn that is insane...

"Great minds discuss ideas; average minds discuss events; small minds discuss people."

Main rig:

i7-4790 - 24GB RAM - GTX 970 - Samsung 840 240GB Evo - 2x 2TB Seagate. - 4 monitors - G710+ - G600 - Zalman Z9U3

Other devices

Oneplus One 64GB Sandstone

Surface Pro 3 - i7 - 256Gb

Surface RT

Server:

SuperMicro something - Xeon e3 1220 V2 - 12GB RAM - 16TB of Seagates 

Link to comment
Share on other sites

Link to post
Share on other sites

Why would amd need that level of hyper threading, they already give their cpus 8 physical cores fairly regulary, Zen looks stranger every time I look at it, hopefully it doesn't fall flat so it promotes competition.

https://linustechtips.com/main/topic/631048-psu-tier-list-updated/ Tier Breakdown (My understanding)--1 Godly, 2 Great, 3 Good, 4 Average, 5 Meh, 6 Bad, 7 Awful

 

Link to comment
Share on other sites

Link to post
Share on other sites

Interesting. Hope their branch prediction is good and that they dont focus on high clocks/long pipeline. otherwise its shaping up to be fast

"Unofficially Official" Leading Scientific Research and Development Officer of the Official Star Citizen LTT Conglomerate | Reaper Squad, Idris Captain | 1x Aurora LN


Game developer, AI researcher, Developing the UOLTT mobile apps


G SIX [My Mac Pro G5 CaseMod Thread]

Link to comment
Share on other sites

Link to post
Share on other sites

Why would amd need that level of hyper threading, they already give their cpus 8 physical cores fairly regulary, Zen looks stranger every time I look at it, hopefully it doesn't fall flat so it promotes competition.

Because they have the transistors to afford it. Intel has as well, but without any competition, they dont need to add more decodes

"Unofficially Official" Leading Scientific Research and Development Officer of the Official Star Citizen LTT Conglomerate | Reaper Squad, Idris Captain | 1x Aurora LN


Game developer, AI researcher, Developing the UOLTT mobile apps


G SIX [My Mac Pro G5 CaseMod Thread]

Link to comment
Share on other sites

Link to post
Share on other sites

Zen-Architektur%2BCore%2BV0.2.png

So, some of you may be asking 'HKZeroFive, what does this mumbo-jumbo mean exactly?'.

Well, the article claims Zen has four instruction decoders (or hyperthreads) per core, in which all of them support decoding fast path and vector instructions. It also claims that each core has four ALUs, meaning that you can operate four integer operations on an ALU simultaneously. This also applies to the FPUs and floating point operations.

TL;DR - Hyperthreading stuff. One core can execute multiple instructions/commands at once, such as integer and floating point operations.

Sauce: http://dresdenboy.blogspot.com/2015/10/amds-zen-core-family-17h-to-have-ten.html

*sigh* This sounds like CMT all over again.

"We also blind small animals with cosmetics.
We do not sell cosmetics. We just blind animals."

 

"Please don't mistake us for Equifax. Those fuckers are evil"

 

This PSA brought to you by Equifacks.
PMSL

Link to comment
Share on other sites

Link to post
Share on other sites

AMD is falling behind but I wish AMD could just pull their stuff together cause real competition is best for us :D

Link to comment
Share on other sites

Link to post
Share on other sites

I hope this ends up being a good thing. The question is, will software for regular users be optimized to use this? (I could probably word this better, but I woke up 10 minutes ago and my brain isn't at full speed yet)

Ketchup is better than mustard.

GUI is better than Command Line Interface.

Dubs are better than subs

Link to comment
Share on other sites

Link to post
Share on other sites

*sigh* This sounds like CMT all over again.

No it does not. It sounds very PowerPC9 if anything, and thats good

"Unofficially Official" Leading Scientific Research and Development Officer of the Official Star Citizen LTT Conglomerate | Reaper Squad, Idris Captain | 1x Aurora LN


Game developer, AI researcher, Developing the UOLTT mobile apps


G SIX [My Mac Pro G5 CaseMod Thread]

Link to comment
Share on other sites

Link to post
Share on other sites

No it does not. It sounds very PowerPC9 if anything, and thats good

AMD. 2 ALU per "core". Now its 4. They still won't be "cores" in the same sense as Intel's current design or AMD's K10. Also, shared cache cripples CPU cores. 

"We also blind small animals with cosmetics.
We do not sell cosmetics. We just blind animals."

 

"Please don't mistake us for Equifax. Those fuckers are evil"

 

This PSA brought to you by Equifacks.
PMSL

Link to comment
Share on other sites

Link to post
Share on other sites

*sigh* This sounds like CMT all over again.

No it doesn't, at all. Seriously, don't open your mouth when you haven't a clue. There's no shared resources between 2 different cores (other than L3 cache, just like Intel) as was in Bulldozer and its derivatives (CMT). This is 1 core hosting (probably 2, not 4) threads with internally shared resources, but if you run 1 thread per core, you get access to all resources unfettered. That's the big difference between CMT and SMT. This is SMT.

 

 

No it does not. It sounds very PowerPC9 if anything, and thats good

It would PPC 7 if it's 4 threads per core, or just like Intel's Hyperthreading if 2 per core. Power 8 is 8 threads per core, and Power 9 hasn't been released yet.

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

AMD. 2 ALU per "core". Now its 4. They still won't be "cores" in the same sense as Intel's current design or AMD's K10. Also, shared cache cripples CPU cores. 

Shared L3 cache, just like Intel's design. Also, Intel has 4 ALUs per core in Haswell onward. Seriously, just quit while you're only so far behind. And shared cache can very much be a good thing if you have a staggered parallel workload on one data set where each core is doing different manipulations or calculations based on that data and flags are used to say which data pieces are ready for the next stage. It's all about the tasks the tool is used for.

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

No it doesn't, at all. Seriously, don't open your mouth when you haven't a clue. There's no shared resources between 2 different cores as was in Bulldozer and its derivatives (CMT). This is 1 core hosting (probably 2, not 4) threads with internally shared resources, but if you run 1 thread per core, you get access to all resources unfettered. That's the big difference between CMT and SMT. This is SMT.

 

 

It would PPC 7 if it's 4 threads per core, or just like Intel's Hyperthreading if 2 per core. Power 8 is 8 threads per core, and Power 9 hasn't been released yet.

Sez you. Considering you never even provide sources for anything.

 

Shared L3 cache, just like Intel's design. Also, Intel has 4 ALUs per core in Haswell onward. Seriously, just quit while you're only so far behind. And shared cache can very much be a good thing if you have a staggered parallel workload on one data set where each core is doing different manipulations or calculations based on that data and flags are used to say which data pieces are ready for the next stage. It's all about the tasks the tool is used for.

Shared cache is shared cache. And Zen doesn't appear to have much in the way of L2 Cache at all-so it will still be crippled by it.

"We also blind small animals with cosmetics.
We do not sell cosmetics. We just blind animals."

 

"Please don't mistake us for Equifax. Those fuckers are evil"

 

This PSA brought to you by Equifacks.
PMSL

Link to comment
Share on other sites

Link to post
Share on other sites

Sez you. Considering you never even provide sources for anything.

I provide sources for everything. You're the one who refuses to read the writing on the wall. It's out of my hands. Now, don't go derailing the thread. -snip-

Edited by Blade of Grass

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

For rendering this could easily add 50-100% performance out of each core...

Thats that. If you need to get in touch chances are you can find someone that knows me that can get in touch.

Link to comment
Share on other sites

Link to post
Share on other sites

I provide sources for everything. You're the one who refuses to read the writing on the wall. It's out of my hands. Now, don't go derailing the thread. -snip-

No you don't. Plenty of people have seen you deflecting in every way possible to avoid providing a source. -snip-

Edited by Blade of Grass

"We also blind small animals with cosmetics.
We do not sell cosmetics. We just blind animals."

 

"Please don't mistake us for Equifax. Those fuckers are evil"

 

This PSA brought to you by Equifacks.
PMSL

Link to comment
Share on other sites

Link to post
Share on other sites

Did I read that correctly ? Basically one core will have 4 threads? So a Quad core will have 16 Threads? Unlike an i7(quad) which has 8 ? That's amazing!

Link to comment
Share on other sites

Link to post
Share on other sites

Did I read that correctly ? Basically one core will have 4 threads? So a Quad core will have 16 Threads? Unlike an i7(quad) which has 8 ? That's amazing!

I doubt it. Intel already has 4 decoders in Skylake and that's a 2-thread core. AMD could do it 4-way, but that brings up a whole slew of resource splitting problems I don't think x86 was ever designed to handle.

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

*sigh* This sounds like CMT all over again.

It sounds nothing like CMT. It sounds exactly like SMT to me.

My (incomplete) memory overclocking guide: 

 

Does memory speed impact gaming performance? Click here to find out!

On 1/2/2017 at 9:32 PM, MageTank said:

Sometimes, we all need a little inspiration.

 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

[removed]

ok this is simple to solve put a bet on it if this turns out to be the same style of CPU you can rule over us as our mighty saviour but if its more like intel/ibm then you have to eat something a sock maybe literally pull your work shoe off and digest your own sock.. i mean if your that confident it cant hurt you right?

I doubt it. Intel already has 4 decoders in Skylake and that's a 2-thread core. AMD could do it 4-way, but that brings up a whole slew of resource splitting problems I don't think x86 was ever designed to handle.

 

 

dont proclaim to know jack shit but maybe they might have solved that issue somehow.. that sound like it would be good. right??

Edited by Godlygamer23

Processor: Intel core i7 930 @3.6  Mobo: Asus P6TSE  GPU: EVGA GTX 680 SC  RAM:12 GB G-skill Ripjaws 2133@1333  SSD: Intel 335 240gb  HDD: Seagate 500gb


Monitors: 2x Samsung 245B  Keyboard: Blackwidow Ultimate   Mouse: Zowie EC1 Evo   Mousepad: Goliathus Alpha  Headphones: MMX300  Case: Antec DF-85

Link to comment
Share on other sites

Link to post
Share on other sites

It sounds nothing like CMT. It sounds exactly like SMT to me.

The architecture is different of course, but its a similar design philosophy of more threads=better. And when it comes to marketing bigger numbers seem to work well fore a while.

 

ok this is simple to solve put a bet on it if this turns out to be the same style of CPU you can rule over us as our mighty saviour but if its more like intel/ibm then you have to eat something a sock maybe literally pull your work shoe off and digest your own sock.. i mean if your that confident it cant hurt you right?

 

 
 

 

dont proclaim to know jack shit but maybe they might have solved that issue somehow.. that sound like it would be good. right??

See above.

"We also blind small animals with cosmetics.
We do not sell cosmetics. We just blind animals."

 

"Please don't mistake us for Equifax. Those fuckers are evil"

 

This PSA brought to you by Equifacks.
PMSL

Link to comment
Share on other sites

Link to post
Share on other sites

The architecture is different of course, but its a similar design philosophy of more threads=better. And when it comes to marketing bigger numbers seem to work well fore a while.

 
 

See above.

I understand why you thought it was CMT, i am just saying, this is SMT. If this block is an accurate representation of what we will see in the final product, you shouldn't have to worry. The #1 downfall of CMT was the way the resources were managed. Modularity. Great in theory, not so great in application. 

My (incomplete) memory overclocking guide: 

 

Does memory speed impact gaming performance? Click here to find out!

On 1/2/2017 at 9:32 PM, MageTank said:

Sometimes, we all need a little inspiration.

 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

 

Ahaha, no you haven't. And that text book is:

  • expensive
  • not easily accessible

 

You can get it on amazon, torrent it, or find it in a number of college libraries. It's not that hard to access. You people have impossible standards. The ACM journals aren't easily accessible (you have to pay to access them too). Does that make them an invalid resource? You being unwilling to do any work is none of my concern, thought it should greatly concern you and any potential employers.

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

No, people have cried foul when their long-held beliefs have been shattered by the real world. It's not my fault if the truth hurts and people can't accept it and choose to sling mud instead.

 

[removed] That said, enough of the argumentation. This isn't CMT, nothing like it. It's SMT by the book.

hell, even i can see this is SMT... CMT is fundamentally different in terms of schedulers  and decode placement

 

3428291_45cdf95bfa_m.png

 

this is CMT (in this case, this is AMD Bulldozer)

Edited by Godlygamer23
Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now


×