[Rumour] AMD's Zen To Have Ten Pipelines Per Core

HKZeroFive · October 3, 2015

As heard earlier this year, Zen will use SMT and an improved cache subsystem while being designed from scratch with new ideas combined with reusing existing components (to reduce the effort). This might even include already existing and somewhat developed ideas not realized in previous designs. A lot of the new functionality has been filed for patenting. For example there was a mention of checkpointing, which is good for quick reversion of mispredicted branches and other reasons for restarting the pipelines. Some patents suggest, that Zen might use some slightly modified Excavator branch prediction.

And the new patch also suggests nicely low int/fp mul, fp add, int/fp div and fp square root latencies. Some of these lower latencies (div/sqrt) were introduced with Excavator, as an Aida64 instruction latency dump provided by Anandtech forum user monstercameron revealed. Due to an Aida problem with measured and reported clock frequencies (although it was fixed at 1.4GHz), you have to multiply the measured times by 1.4 to get the real number of cycles. Ok, back to Zen.

Here are some quotes of the patch file:

+;; Decoders unit has 4 decoders and all of them can decode fast path

+;; and vector type instructions.

+;; Integer unit 4 ALU pipes.

+;; 2 AGU pipes.

+;; Floating point unit 4 FP pipes.

+ 32, /* size of l1 cache. */

+ 512, /* size of l2 cache. */

Excerpt:

4 wide decoders

4 integer ALUs

2 AGUs (for 2R 1W L1 cache according to a LinkedIn profile)

4 FP pipelines

That makes ten pipelines with a general four wide design.

There is a lot more information, which I will collect over the next days. Some stuff is copy pasted from Excavator (bdver4) or Jaguar (btver2) and modified then. But careful comparing did show some clear differences, while at other places it's not clear, if there is new information or not (e.g. div latencies). But as btver2 has 2048 kB L2 and the rest of the block is more similar to bdver4 or btver2 than btver1 (Bobcat), which has 512 kb L2, it looks like no btver1 files were used as a source. So I assume, that this is a new entry of an L2 cache size, indicating fast L2 caches per core. The L1 data cache still has the same size as that of Jaguar or Excavator. Some patents mention an 8-way 32kb L1 D$.

Interestingly, as there are two 128b FP mul and two 128b FP add units (with only 3 cycles latency for these ops), the FMA instructions will be executed by combining one FP MUL and one FP ADD unit, resulting in 2 issues and 5 cycles latency (as that of the Bulldozer family). This saves some register file ports and increases throughput and reduces latencies of the more common FP ops. It even remembers me of the bridged FMA unit.

These latencies also clearly suggest, that this is no high clock frequency design. But at 14nm (or 16nm from TSMC as some rumours suggest) clocks of 3.5 to 4 GHz should be reachable without stretching the thermal limits too much.

So, some of you may be asking 'HKZeroFive, what does this mumbo-jumbo mean exactly?'.

Well, the article claims that Zen has four instruction decoders (or hyperthreads) per core, in which all of them support decoding fast path and vector instructions. It also claims that each core has four ALUs, meaning that you can operate four integer operations on an ALU simultaneously. This also applies to the FPUs and floating point operations.

TL;DR - Hyperthreading stuff. One core can execute multiple instructions/commands at once, such as integer and floating point operations.

Sauce: http://dresdenboy.blogspot.com/2015/10/amds-zen-core-family-17h-to-have-ten.html

qwertywarrior · October 3, 2015

im hoping we can see it before AMD files for bankruptcy or it will be one of those rumors flying around like the voodoo 6000

Sidiox · October 3, 2015

Holy damn that is insane...

AresKrieger · October 3, 2015

Why would amd need that level of hyper threading, they already give their cpus 8 physical cores fairly regulary, Zen looks stranger every time I look at it, hopefully it doesn't fall flat so it promotes competition.

LukaP · October 3, 2015

Interesting. Hope their branch prediction is good and that they dont focus on high clocks/long pipeline. otherwise its shaping up to be fast

LukaP · October 3, 2015

Why would amd need that level of hyper threading, they already give their cpus 8 physical cores fairly regulary, Zen looks stranger every time I look at it, hopefully it doesn't fall flat so it promotes competition.

Because they have the transistors to afford it. Intel has as well, but without any competition, they dont need to add more decodes

Dabombinable · October 3, 2015

So, some of you may be asking 'HKZeroFive, what does this mumbo-jumbo mean exactly?'.

Well, the article claims Zen has four instruction decoders (or hyperthreads) per core, in which all of them support decoding fast path and vector instructions. It also claims that each core has four ALUs, meaning that you can operate four integer operations on an ALU simultaneously. This also applies to the FPUs and floating point operations.

TL;DR - Hyperthreading stuff. One core can execute multiple instructions/commands at once, such as integer and floating point operations.

Sauce: http://dresdenboy.blogspot.com/2015/10/amds-zen-core-family-17h-to-have-ten.html

*sigh* This sounds like CMT all over again.

DannyRyu · October 3, 2015

AMD is falling behind but I wish AMD could just pull their stuff together cause real competition is best for us

Trik'Stari · October 3, 2015

I hope this ends up being a good thing. The question is, will software for regular users be optimized to use this? (I could probably word this better, but I woke up 10 minutes ago and my brain isn't at full speed yet)

LukaP · October 3, 2015

*sigh* This sounds like CMT all over again.

No it does not. It sounds very PowerPC9 if anything, and thats good

Dabombinable · October 3, 2015

No it does not. It sounds very PowerPC9 if anything, and thats good

AMD. 2 ALU per "core". Now its 4. They still won't be "cores" in the same sense as Intel's current design or AMD's K10. Also, shared cache cripples CPU cores.

patrickjp93 · October 3, 2015

*sigh* This sounds like CMT all over again.

No it doesn't, at all. Seriously, don't open your mouth when you haven't a clue. There's no shared resources between 2 different cores (other than L3 cache, just like Intel) as was in Bulldozer and its derivatives (CMT). This is 1 core hosting (probably 2, not 4) threads with internally shared resources, but if you run 1 thread per core, you get access to all resources unfettered. That's the big difference between CMT and SMT. This is SMT.

No it does not. It sounds very PowerPC9 if anything, and thats good

It would PPC 7 if it's 4 threads per core, or just like Intel's Hyperthreading if 2 per core. Power 8 is 8 threads per core, and Power 9 hasn't been released yet.

patrickjp93 · October 3, 2015

AMD. 2 ALU per "core". Now its 4. They still won't be "cores" in the same sense as Intel's current design or AMD's K10. Also, shared cache cripples CPU cores.

Shared L3 cache, just like Intel's design. Also, Intel has 4 ALUs per core in Haswell onward. Seriously, just quit while you're only so far behind. And shared cache can very much be a good thing if you have a staggered parallel workload on one data set where each core is doing different manipulations or calculations based on that data and flags are used to say which data pieces are ready for the next stage. It's all about the tasks the tool is used for.

Dabombinable · October 3, 2015

No it doesn't, at all. Seriously, don't open your mouth when you haven't a clue. There's no shared resources between 2 different cores as was in Bulldozer and its derivatives (CMT). This is 1 core hosting (probably 2, not 4) threads with internally shared resources, but if you run 1 thread per core, you get access to all resources unfettered. That's the big difference between CMT and SMT. This is SMT.

It would PPC 7 if it's 4 threads per core, or just like Intel's Hyperthreading if 2 per core. Power 8 is 8 threads per core, and Power 9 hasn't been released yet.

Sez you. Considering you never even provide sources for anything.

Shared L3 cache, just like Intel's design. Also, Intel has 4 ALUs per core in Haswell onward. Seriously, just quit while you're only so far behind. And shared cache can very much be a good thing if you have a staggered parallel workload on one data set where each core is doing different manipulations or calculations based on that data and flags are used to say which data pieces are ready for the next stage. It's all about the tasks the tool is used for.

Shared cache is shared cache. And Zen doesn't appear to have much in the way of L2 Cache at all-so it will still be crippled by it.

patrickjp93 · October 3, 2015

Sez you. Considering you never even provide sources for anything.

I provide sources for everything. You're the one who refuses to read the writing on the wall. It's out of my hands. Now, don't go derailing the thread. -snip-

Edited October 3, 2015 by Blade of Grass

thekeemo · October 3, 2015

For rendering this could easily add 50-100% performance out of each core...

Dabombinable · October 3, 2015

I provide sources for everything. You're the one who refuses to read the writing on the wall. It's out of my hands. Now, don't go derailing the thread. -snip-

No you don't. Plenty of people have seen you deflecting in every way possible to avoid providing a source. -snip-

Edited October 3, 2015 by Blade of Grass

VanayadGaming · October 3, 2015

Did I read that correctly ? Basically one core will have 4 threads? So a Quad core will have 16 Threads? Unlike an i7(quad) which has 8 ? That's amazing!

patrickjp93 · October 3, 2015

Did I read that correctly ? Basically one core will have 4 threads? So a Quad core will have 16 Threads? Unlike an i7(quad) which has 8 ? That's amazing!

I doubt it. Intel already has 4 decoders in Skylake and that's a 2-thread core. AMD could do it 4-way, but that brings up a whole slew of resource splitting problems I don't think x86 was ever designed to handle.

MageTank · October 3, 2015

*sigh* This sounds like CMT all over again.

It sounds nothing like CMT. It sounds exactly like SMT to me.

Eroda · October 3, 2015

[removed]

ok this is simple to solve put a bet on it if this turns out to be the same style of CPU you can rule over us as our mighty saviour but if its more like intel/ibm then you have to eat something a sock maybe literally pull your work shoe off and digest your own sock.. i mean if your that confident it cant hurt you right?

I doubt it. Intel already has 4 decoders in Skylake and that's a 2-thread core. AMD could do it 4-way, but that brings up a whole slew of resource splitting problems I don't think x86 was ever designed to handle.

dont proclaim to know jack shit but maybe they might have solved that issue somehow.. that sound like it would be good. right??

Edited October 3, 2015 by Godlygamer23

Dabombinable · October 3, 2015

It sounds nothing like CMT. It sounds exactly like SMT to me.

The architecture is different of course, but its a similar design philosophy of more threads=better. And when it comes to marketing bigger numbers seem to work well fore a while.

ok this is simple to solve put a bet on it if this turns out to be the same style of CPU you can rule over us as our mighty saviour but if its more like intel/ibm then you have to eat something a sock maybe literally pull your work shoe off and digest your own sock.. i mean if your that confident it cant hurt you right?

dont proclaim to know jack shit but maybe they might have solved that issue somehow.. that sound like it would be good. right??

See above.

MageTank · October 3, 2015

The architecture is different of course, but its a similar design philosophy of more threads=better. And when it comes to marketing bigger numbers seem to work well fore a while.

See above.

I understand why you thought it was CMT, i am just saying, this is SMT. If this block is an accurate representation of what we will see in the final product, you shouldn't have to worry. The #1 downfall of CMT was the way the resources were managed. Modularity. Great in theory, not so great in application.

patrickjp93 · October 3, 2015

Ahaha, no you haven't. And that text book is:

expensive

not easily accessible

You can get it on amazon, torrent it, or find it in a number of college libraries. It's not that hard to access. You people have impossible standards. The ACM journals aren't easily accessible (you have to pay to access them too). Does that make them an invalid resource? You being unwilling to do any work is none of my concern, thought it should greatly concern you and any potential employers.

Prysin · October 3, 2015

No, people have cried foul when their long-held beliefs have been shattered by the real world. It's not my fault if the truth hurts and people can't accept it and choose to sling mud instead.

[removed] That said, enough of the argumentation. This isn't CMT, nothing like it. It's SMT by the book.

hell, even i can see this is SMT... CMT is fundamentally different in terms of schedulers and decode placement

this is CMT (in this case, this is AMD Bulldozer)

Edited October 3, 2015 by Godlygamer23

Sign In

[Rumour] AMD's Zen To Have Ten Pipelines Per Core

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites