Skylake vs Zen vs Zen+, HT/SMT

porina · October 21, 2018

Apologies for the large images. If I made them smaller they might not be visible. Charts best viewed 100% on 1080p or higher pixel width screen...

This is a test I wanted to do for a while, and it is really time consuming, so I've only tested 3 examples. The goal is to look at how the architecture behaves, somewhat independent of clock and number of cores. I chose to use the three systems below, and in all cases limit them to 2 cores at 3 GHz only. The only thing I changed beyond that was turning HT/SMT on or off, so I can see its difference. 2 cores was chosen as it is the practical lowest I can set on the CPUs. By reducing the cores/clocks somewhat, any other limiting factors elsewhere in the system will become less so. While I haven't proved it is the case, the ram bandwidth should be practically unlimited for the purposes of this testing, although latency may still have some impact.

Test systems:

6700k, Asus Maximus IX Apex bios 1301
2600, Asrock B450 gaming ITX/ac bios P1.20
1800, Asus Prime X370-Pro bios 4012

Skylake should still be adequate to representing current Intel CPUs, as outside of some specific new instructions, the architecture essentially hasn't changed. The two AMD CPUs represent Zen and Zen+, the latter was supposed to have had some optimisations applied leading to reported small gains.

I would note the 2600 system offered me the choice of running CCX cores as 1+1 or 2+0. I chose 1+1 as kinda more representative of the typically offered configuration, although I later saw that all cores on one CCX performed better for gaming. This testing wasn't for gaming though. The CCX configuration may be something to follow up later.

Cooling shouldn't matter, at the reduced cores/clocks none of them were anywhere near throttling, and there's no variable clocks from turbo to worry about. All systems had the same G.Skill TridentZ 3000C14 (B-die) 2x8GB ram fitted for dual channel operation. Ram was set to use XMP (or whatever it is called on AMD boards) with the minor observation on the 1800 system, that would pick 2933 by itself, and I had to manually select 3000 to match the other systems. Timings were 14-14-14-34, 2T on Intel system, 1T on AMD systems. Operating system used was Windows 7.

Tests used:

3DPM v2.1 - written by Ian Cutress who's main job is as editor at Anandtech. I have no idea what it does or represents, but it gives some interesting HT/SMT scaling numbers. I only use the subscores here.
Cinebench - because everyone loves it as a benchmark, regardless if they use whatever it is representing. I used both R11.5 and R15.
Y-cruncher 0.76.9487 - I find this an interesting benchmark, as it is optimised to make use of CPU facilities in doing the Pi calculations. Sizes tested were 25m and 1b, which are the ones used by hwbot.
Prime95 29.4 build 8 - this is the current release version, and again is well optimised to use whatever performance a CPU has to offer. Tested at two FFT sizes: 64k and 2048k, with and without HT in software separately from system setting.
Aida64 5.98.4800 - this has a whole bunch of tests so why not do them too? I'd comment for now past experience suggests that PhotoWorxx is a ram bandwidth intensive test. I don't know if my configuration here is enough to negate its impact from the CPU performance itself.

Each test was run a minimum of 3 times. The best result obtained is the one used. My thinking is, there may be things that slow down a test making it less than ideal, but there isn't anything that would make it better than it is. By repeating the runs and choosing the best one, we should converge towards the best case.

This is a relative performance showing how Zen/Zen+ compares against Skylake, with and without HT/SMT comparing like for like. Below 1.0 is worse than Skylake, exactly 1.0 is the same, and over 1.0 is faster than Skylake. It has been widely discussed that Zen in general has lower single thread performance, and the HT/SMT off results show that generally it is slightly below 1.0. AMD's SMT is also stated as generally being better than Intel's, and again, we can see that they tend to be higher than Intel.

There are two groups which differ significantly. Y-cruncher, Prime95, and some of the Aida64 tests are down to the 0.6 ball park. These tests probably feature AVX heavily, and Zen in general only has about half the potential of a recent Intel. Not all code will be that, so it wont be the full drop to 0.5 necessarily. On the other side, AES and Hash are much higher than with Skylake. I have heard but not verified AMD put in specific elements to enhance performance in those areas.

For clarification, there are three states tested for Prime95:

System HT/SMT off
System HT/SMT on, not used in software (real cores only)
System HT/SMT on, used in software

On to the main purpose of this test, this is the improvement from turning on HT/SMT compared to not. I used to think that on Intel CPUs, the improvement was from 0 to 50%. This is not the case, as 3DPM BiPy allows it to go slightly over that. These results do show that AMD SMT does generally give a bigger boost than Intel HT, with some exceptions.

Prime95 may be considered a special case, especially for the smaller FFT size, as the overhead of implementing extra threads takes away performance. It requires a bigger size to efficiently split the work. I don't know why there would be a drop from SMT for AMD in that scenario.

And finally, this is a comparison of how Zen+ compares to Zen, so again, above 1.0 means Zen+ is better than Zen. It helps in some areas, like the three Aida64 subtests where it gains 4-5%. and 2-3% for Cinebench. You might also see the two big spikes for Prime95. I would caution there, while it is a big improvement, the absolute performance means those scenarios are not ones that likely would be used in practice. Running other configurations were faster. It only means that configuration was less bad than it was before.

This is also a good example why the 13%/16% number that is thrown around for Zen 2 should be taken with caution. Results can and do vary with the task at hand. Maybe there will be specific tasks that will have that benefit, but we can't assume it will be universal.

-------- 8< --------

Update 16 Nov. 2018

I've now got a set of data for Skylake-X, which I'll concentrate on comparing with Skylake. CPU used was 7800X, reduced to 2 cores at 3.0 GHz, on Asus X299 TUF Mark 2 mobo, bios 1503. I debated with myself about the ram situation. Would it matter at all, given one of my goals was not to be dependent on system factors. Then again, for non-AVX512 workloads, Skylake-X shouldn't be different than Skylake unless you do consider L3 cache and ram differences. The ram I used for the other testing is only a dual channel pair. It feels wrong to not use quad channel where supported, so I used Corsair Vengeance RGB 3000C15 4x8GB kit, actually half of a 8x8GB kit. So, same nominal speed but slightly slacker timing. Operating system was Windows 10 1803 with October patches.

So let's get straight to it, is Skylake-X faster than Skylake when run in the same core configuration? Well...

... it depends. For most things in this test, it is within a few % of Skylake, and it is hard to tell if this is a significant difference or just measurement variations, even with my "best of 3" methodology. Now there are several tests where there are significant differences, over 50% faster. These are y-cruncher, and some of the Aida64 FPU tests. For y-cruncher I can confidently say it is significantly from AVX-512 support, maybe ram bandwidth could have some impact also. I would assume similar for Aida64 FPU tests. Aida64 PhotoWorxx, I'm less clear about. Past experience had shown it to be ram bandwidth sensitive, so that may play more of a factor here than the CPU instruction support.

Not on this chart, but I also tried Prime95 29.5 build 3, which is a test version with AVX-512 support. I only ran this with CPU HT support off, as it seems to have a problem when HT is on. This resulted in 76% throughput improvement for 64k FFT, and 59% improvement for 2048k FFT. A nice uplift, but somewhat short of the theoretical maximum of around 100% improvement possible. I don't know enough to say why. In theory there should be enough bandwidth to not limit, unless the execution cores are so fast now even cache speed is limiting.

On cache speed, I should note, the system picked a speed of 2700 which was interesting to me. When I first got the system, it defaulted to 2000, and I usually ran it at a manual overclock to 3000. The slow cache was blamed for Skylake-X shortcomings early on in its life. It is nice to see the default speed has gone up with more recent bios updates.

For completeness, here's the HT improvement results, with Skylake-X shown alongside Skylake for comparison. It is interesting that in most cases it seems to lag behind Skylake, although in some cases it is about at parity or a little ahead.

Edited November 16, 2018 by porina
add Skylake-X

Jurrunio · October 21, 2018

so Zen has better SMT implementation than Skylake. right?

Curufinwe_wins · October 21, 2018

25 minutes ago, Jurrunio said:

so Zen has better SMT implementation than Skylake. right?

SMT is kinda a weird thing to compare in that way. Depending on your definition (on what constitutes a logical/physical distinct core), bulldozer had an amazing SMT scaling. In some circumstances it offered up to 200% more performance. But the base core was so much weaker.

See, one aspect of SMT is just using the other logical thread to fill in the gaps in the pipeline. But if your cpu decoding is already perfect, then there might not be as many gaps in the pipeline as is. (An analogous comparison is why DX12 is a lower relative boost for Nvidia gpu's even those that feature full a-sync compute, just because Nvidia has better load optimization.) In this fashion, it is almost certainly true that Intel has better pipeline optimizations that result in fewer unused cycles, and thus less to fill.

The other aspect is additional resources that the independent threads can use (ala bulldozer sharing some compute units while having other independent ones). In this case, higher distributed resources generally means you get more performance if the code can fill all the threads efficiently, but when you consider that every transistor has a power (and physical) cost, higher throughput in very parallel loads generally correlates inversely with throughput in more serial loads. In this situation, the amount of extra processor space is a distinct decision by the manufacture and reflects their belief as to the best tradeoff between different loading situations.

It isn't generally accurate to say that a larger performance delta between SMT on and off means "better". Because it depends on the type of workload and all. For example, AVX is a massive performance increase in anything that can use it. But AVX is also so incredibly power expensive that double loading it would not only be not worth it, but would probably be actively insane. Thus current implementations for both AMD and Intel ignore SMT for AVX loads (getting always negative scaling as a loss from load-balancing).

TD/LR It isn't necessarily accurate to claim it as better. One should merely claim that for most workloads SMT makes more of a difference for Zen than for Skylake.

Also really interesting work! @porina, I know it's tons of work, but I really would love to see the same data at 4 cores. Since all four chips are optimized most for 4 core scaling anyways.

porina · October 21, 2018

40 minutes ago, Curufinwe_wins said:

Also really interesting work! @porina, I know it's tons of work, but I really would love to see the same data at 4 cores. Since all four chips are optimized most for 4 core scaling anyways.

This testing wasn't meant to show running performance of actual CPU models, as that's what reviewers do already. My aim was to eliminate or reduce other factors, so we have a possible look at the performance of the architecture itself. It can be assumed to scale until it hits a limit, but finding that limit is not my intention here.

For example I know large Prime95 FFT sizes with quad core Intel CPUs and dual channel ram will be limited by ram bandwidth. Going to 6 or 8 cores isn't going to help there. Aida64 PhotoWorxx also seems to like ram bandwidth although I haven't looked at it in detail.

Possible further testing might include doing similar for 2+0 core configuration on Zen+, as my Zen system did not offer me that option so I'm not sure how that is running. I also want to do Skylake-X, but I'm wondering if it is worth it as it would be more a test of the new cache structure than the core itself.

Curufinwe_wins · October 21, 2018

33 minutes ago, porina said:

This testing wasn't meant to show running performance of actual CPU models, as that's what reviewers do already. My aim was to eliminate or reduce other factors, so we have a possible look at the performance of the architecture itself. It can be assumed to scale until it hits a limit, but finding that limit is not my intention here.

For example I know large Prime95 FFT sizes with quad core Intel CPUs and dual channel ram will be limited by ram bandwidth. Going to 6 or 8 cores isn't going to help there. Aida64 PhotoWorxx also seems to like ram bandwidth although I haven't looked at it in detail.

Possible further testing might include doing similar for 2+0 core configuration on Zen+, as my Zen system did not offer me that option so I'm not sure how that is running. I also want to do Skylake-X, but I'm wondering if it is worth it as it would be more a test of the new cache structure than the core itself.

I understand this, but you cant disable cache, and 4x the cache on a system is not an insignificant performance difference in many of these workloads. Particularly when you start using SMT.

The issue with using only one CCX is memory latency penalties across the wrong channel.

I know you have limited hardware options, but if you could find one of the Ryzen APUs that would probably be a much more similar comparison which would let you scale down even more and scale up as well.

porina · October 22, 2018

7 hours ago, Curufinwe_wins said:

I understand this, but you cant disable cache, and 4x the cache on a system is not an insignificant performance difference in many of these workloads. Particularly when you start using SMT.

Why do I want to disable cache? I want to look at the cores, so the more cache in proportion, the less likely it is to hold back the cores. This is in part why I'm questioning the point of testing Skylake-X, as I believe the cores are the same even if the cache is different, does the cache make enough difference?

7 hours ago, Curufinwe_wins said:

The issue with using only one CCX is memory latency penalties across the wrong channel.

I tried to look it up previously, but was unable to confirm if each CCX has a memory channel directly attached. From what I could find, it looked like the two CCX were connected over infinity fabric to the memory controller, so it wouldn't matter. If you know different, I'd appreciate a reference.

7 hours ago, Curufinwe_wins said:

I know you have limited hardware options, but if you could find one of the Ryzen APUs that would probably be a much more similar comparison which would let you scale down even more and scale up as well.

That wouldn't add value to my goals. If anything, to expand on my testing I'd like to include more tests. Specifically Blender as that seemed to be rather HT/SMT friendly, at least it was for the Ryzen demo, but I suspect software may have changed since then so it may not necessarily apply. It would be something to test later.

Curufinwe_wins · October 22, 2018

6 minutes ago, porina said:

Why do I want to disable cache? I want to look at the cores, so the more cache in proportion, the less likely it is to hold back the cores. This is in part why I'm questioning the point of testing Skylake-X, as I believe the cores are the same even if the cache is different, does the cache make enough difference?

I tried to look it up previously, but was unable to confirm if each CCX has a memory channel directly attached. From what I could find, it looked like the two CCX were connected over infinity fabric to the memory controller, so it wouldn't matter. If you know different, I'd appreciate a reference.

Infinite cache would theoretically mean the computer is doing no work except drawing from the cache, and thus the performance becomes only limited by the access latency of the cache.

Of course we aren't anywhere close to infinite cache, but increasing cache isnt decreasing the performance difference as a result of cache hits/misses. It's increasing it, as larger and larger proportions of instruction results (though more commonly load/stores) are drawn from cache instead of from calculation.

I don't know of too many modern examples of it being tested, but back in the P2/P3 era there were a lot of tests of this and doubling L2 cache (with the same L1) increases performance by 5% on average.

Each CCX has direct wired access to one memory channel, which means that whenever you try to access memory on the other channel, it has to go through the fabric and then hit the other ccx (this encurs the ccx latency penalty). With a single ring bus arch, latency from one side to another isnt nothing but isnt high either. With the Skylake-X mesh, we however see notable differences in the near side and far side core latencies, and it wouldn't make sense to suggest that the cache latency difference could be less than that of the ram difference. Of course, with ram min latencies on the order of 50-80 ns and the ccx penalty being around 80 ns, it should make a difference (basically doubling the ram latency at worst.)

Though again, faster ram helps, bigger cache helps a LOT to alleviate that issue if the program can fit in cache instead.

porina · October 22, 2018

14 minutes ago, Curufinwe_wins said:

Each CCX has direct wired access to one memory channel

Got a reference showing that?

leadeater · October 22, 2018

6 hours ago, porina said:

Got a reference showing that?

Fairly sure that's not correct. From the die shots, diagrams and labeling etc I have seen the memory controller is it's own part on the die like the CCXs are. Each die has a dual channel memory controller and the CCXs access that to get to memory, so each core should have dual channel memory bandwidth.

The Anandtech article on EPYC as far as I can tell confirms this.

Single thread memory bandwidth is a little shy of two threads different cores and well above a what a single channel can do.

P.S. Thanks for leading me here from that news thread

porina · November 16, 2018

Since I had the day off, I thought I'd have some fun and do the Skylake-X testing. Results were... interesting. I've edited the new data into OP.

Do I go back and do Haswell, Broadwell next?

Sign In

Skylake vs Zen vs Zen+, HT/SMT

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Topics

Latest From Linus Tech Tips:

This Perfectly Silent Fan Took 300 Years to Make

Latest From ShortCircuit:

The coolest looking monitor. Period. - ASUS ROG display at Computex (Sponsored)

Latest From TechLinked:

Nvidia's New CPU Is a Big Deal

Latest From GameLinked:

Wait wasn't this game dead??

Latest From Tech Quickie:

Who's Tracking Your Phone Right Now?

Latest From The WAN Show:

Pizza Hut is Being Sued Over AI