Jump to content

Intel & AMD, Architectural Discussion, How Far Ahead Is Intel ?

yeah lol 14nm has been a bitch for intel. Which isn't a bad thing necessarily. Gives others more chance to compete.

Doesn't matter. Samsung has hit a brick wall with 10nm whereas Intel has already cracked it. TSMC and Global Foundries are asleep at the wheel on that front too as they focus on the 12nm GPU core process for Nvidia and AMD graphics processors, and that struggle is shaping up to be a repeat of the 20nm double scoop of delay.

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

Doesn't matter. Samsung has hit a brick wall with 10nm whereas Intel has already cracked it. TSMC and Global Foundries are asleep at the wheel on that front too as they focus on the 12nm GPU core process for Nvidia and AMD graphics processors, and that struggle is shaping up to be a repeat of the 20nm double scoop of delay.

Nope TSMC has 10nm ready for production next year. Same for Samsung and Global. BTW Samsung & Global are now in an alliance for the 14nm node and below so all R&D is shared. They're even sharing fabs at this point. Any progress Samsung makes globalfoundries can leverage and vice versa.

 

Link to comment
Share on other sites

Link to post
Share on other sites

I agree with the majority of what you've said, I don't have any problem with it but you seem to misinterpret what the specifications actually mean for performance.

We already know that Bulldozer has two FMAC units which can process 256bit AVX instructions together as a single unit it's a very similar approach to Intel's Sandy Bridge approach which borrows 128bit SIMD from integer rather than combining two 128bit floats, so they sacrificed integer execution to gain more throughput out of the FPU.

This approach saves die area & power but this mutually exclusive sharing also significantly diminshes the performance of integer/float mixed workloads.

From anandtech

http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/2

You seem to think that because a Bulldozer FPU can process FMA & AVX instructions that it's faster than Phenom which isn't true, it's more capable yes but not faster.

It can process different types of instructions but the processing itself isn't significantly faster than Phenom.

AMD still leads in integer even compared to Haswell no matter how you slice it, each AMD module has significantly more integer throughput than a Haswell core both of which have the same die size so form an architectural stand point that's how a comparison should be drawn. You can't compare a Bulldozer Integer core/cluster to a Haswell core because it's less than half the size. Similarly you can't compare one ALU to the other, not all ALUs are created equal.

You can clearly see that in both of these instances the similarly clocked 7850K ( dual module Steamroller) and i3 4330 (dual core Haswell) the 7850K is ahead in integer.

KAVERI-APU-41.jpg

KAVERI-APU-40.jpg

http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/65031-amd-kaveri-a10-7850k-a8-7600-review-9.html

Against Sandy Bridge Bulldozer was faster in both Integer and Floats but since the introduction of Haswell's more advanced FPU AMD lost their float lead, however AMD's upcoming Excavator will introduce the same float functionality as Haswell. It also should be mentioned that both Sandy & Ivy are fundamentally the same, Ivy is a die-shrunk Sandy with very fine improvements and tweaks. Again Broadwell will be a die-shrunk Haswell.

http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/2

 

Also what is a multi-threaded workload ? it's any workload that can utlize the multiple threads that the CPU has to offer. Any CPU intensive application is smart enough to take advantage of all of the available performance.  Exclusively single threaded workloads exist but they're far less common because as soon as you decide or are forced into creating a single threaded workload you instantly realize that you're throwing away the majority of the performance inside any processors today since all processors sold on the market today are multi-core CPUs & in power constrained environments this also means that you're going to throw away a significant amount of efficiency, meaning you're eating away battery life which is a huge concern.

"You seem to think that because a Bulldozer FPU can process FMA & AVX instructions that it's faster than Phenom which isn't true"

When exactly did I say that? I haven't even mentioned the Phenom.

"AMD still leads in integer even compared to Haswell no matter how

you slice it, each AMD module has significantly more integer throughput than a Haswell core both of which have the same die size so form an architectural stand point that's how a comparison should be drawn."

AMD aren't LEADING in integer. It is a VERY wide statement to make. They would need to be significant better overall in integer, to make such statement.

In some integer workloads piledriver WILL have the advantages, in some it WONT. Just like some SIMD is optimized for AMDs SIMD cluster, those will obviously fit better for AMDs SIMD cluster.

Lets go through some example; The software are starting to kick in some more complex integer instructions. The fetch will be overloaded starting to queue it. Once the complex instruction reach the IBB, things will slow down. It will go slowly through, the decoders will also be working like mad, to decode the "extra" complex instructions. At this point haswell should have started the LOAD instructions (because of a minor pipeline). After piledriver have been decoded it reach the retirement queue (which could be compared to haswells ROB). Then it need to successfully direct to the right PRF. Then we will see an LOAD instruction been issued, followed by whatever instruction and ended with a STORE instruction.

Note: Piledrivers SIMD uses it's integer clusters AGUs (not because it have the HUGE impact on things, but worth noting).

My point of been is that bulldozer and piledriver have some flaws the major ones are:

1) Not good enough branch predictor for a long pipeline

2) Have to many stops throughout the pipeline

3) Aren't effective at more complex instructions (One huge reason why it is good in parallel)

4) Bad cachemanagement

And the biggest flaw: Depends too much on software to fully utilize itself.

EDIT: And yes you can compare ALUs to eachother. All ALUs used in todays processors are the same. There aren't different versions of ALUs if that is what you are thinking about.

Link to comment
Share on other sites

Link to post
Share on other sites

Sandy does only have 128bit unit per FPU which combines with a 128bit unit in the integer pipe to process 256bit instructions.
The above slide talks about Haswell not Sandy not sure if you're just clueless or intentionally spreading misinformation.

Lol open your eyes Sandy bridge/nehalem was included in that picture. Clearly you only want to believe what you want to believe. AVX was introduced with Sandy bridge. Sandybridge/AVX can't even process 256 bit integers, thats what AVX2 brought up.

You were wrong in;

- 8350 has better float performance than 2600K -> 2600K at 4.8GHz -> 4770K at 4.2GHz basically you said the 8350 is better than the 4770K.
- 8350 outperforms the 4770k clock for clock which I proved you very sexy wrong
- Sandy bridge can't process 256 bit floats -> In order to double floating point throughput, the “Sandy Bridge” microarchitecture introduces Intel Advanced Vector Extensions (Intel AVX) instructions. Intel AVX extends the Intel® SSE floating point instruction set to 256 bits operand size and as a result, the execution units are able to handle 256 bit floating point operands.

Because you found 2 benchmarks out of the 50000 doesn't mean the 8350 has better integer performance and for sure not in terms of clock for clock. A 4770K at 5GHz wipes the floor with the 8350 at 5GHz in terms of floats/integer.

So asking my question again; why do you keep trying to say that AMD is leading the market in terms of performance with their 8350?

Link to comment
Share on other sites

Link to post
Share on other sites

Nope TSMC has 10nm ready for production next year. Same for Samsung and Global. BTW Samsung & Global are now in an alliance for the 14nm node and below so all R&D is shared. They're even sharing fabs at this point. Any progress Samsung makes globalfoundries can leverage and vice versa.

 

Did you read that article? That's the first avowed schedule. There'll be delays, as usual. In other words, there's no proof of yield yet.

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

No it isn't. Sandy only has one dedicated 128bit SIMD per floating point unit. Bulldozer has two.

In your benchmark your also showing the 8350 beating the 3960X which is 250mm² larger and has 4 more threads, 2 more clusters.

First of all you have 3 alu's, three SIMD integers at 128bit and the three SIMDS are 256bit. Haswell bumped the SIMD integers up from 128bit to 256bit which gives us 6x 256bit SIMD's and they gave us another ALU with haswell. Not every stack here is doing the same thing at all.

sandy-bridge-5.png?71da3d

haswell-3.png?71da3d

One dedicated 128bit SIMD = false.

The 3930K is an octacore with 2 cores lasercut, the 8350 is a fully enabled cpu you just can't compare both like this. The 4930K has a die size of 260mm², AMD needed 320mm² to bring out a quadcore with CMT like the 8150/8350 thats nowhere close to the 4930K in terms of performance and those 4770k's are half the size of a 8350 getting owned like fuck. The 8350 does not beat the 3930K in integer operations, CPU hash is clearly a fucked up benchmark after theyre showing a stupid APU outperforming a 2600K or that the 4770K does worse than the 4670K.

 

 

Sandy does only have 128bit unit per FPU which combines with a 128bit unit in the integer pipe to process 256bit instructions.

Now the FPU has a unit? A unit has a unit? Seems like you fapped too much on the AMD salesgimmick diagram. SB has 4 128bit FPU's in total and the 8350 8x 128bit FP's that made you officially declare why the 8350 outperforms the 2600K in terms of floating point performance when you got a shitload of evidence dragshooting at 8350's. You just can't make a simple story like this when you have jack understanding of Intels architecture. They haven't ganged the integer with the simd FP at all, instead of widening the datapaths to 256bit it's just the SIMD integer with the SIMD FP - that technique saves a bunch of area & power by re-using the execution sources that were already there. One thing does the lower part and the other one does the higher part.

Compare both quotes, you complety changed your statement in the 2nd quote.

 

 

TechFan@tic posted this which should've been pretty much enough to end this pointless argument.

clickity click

Doesn't rule out that SB has 256bit FMUL (Multiply) and FADD (ADD) except the Fusion thing. What does bulldozer have? Shitty shared 128bit fmacs?

A 2600K is still twice as fast as a 8150 in floating performance here: http://www.tomshardware.com/reviews/fx-8150-zambezi-bulldozer-990fx,3043-5.html

The performance gain of HT can be extremely benefitable, it lacks integer performance but thats what Haswell fixed with AVX2 and FMA3 support.

Haswell rockets forward in integer code, outstripping Ivy Bridge by a full 78%

 

He's both... actually.

You like people who wetlick AMD, right?

Link to comment
Share on other sites

Link to post
Share on other sites

Guest
This topic is now closed to further replies.


×