Reputation from Mysticial - Linus Tech Tips

Mysticial got a reaction from vanished in Prime95 29.5 now supports AVX-512 (Windows ver available) October 24, 2018

In the post-Kaby Lake era, I'd argue that it's pointless to say that AVX and AVX512 are "more stressful" than normal code.

The reason I say that is simply that AVX(512) aren't supposed to run at the same speed as normal code anyway. For the same reason, nobody runs their GPU at the same speed of the CPU. AVX and AVX512 fall somewhere in between.

To oversimplify a bit, a Skylake X chip has 5 different sets of silicon that run at different speeds:
Main core. AVX units AVX512 units uncore (L3/mesh) DRAM Most people are familiar with 1, 4, and 5. But very few people are aware of 2 and 3.

Most of the people who get "burned" (pun intended) by AVX or AVX512 either don't know about the existence of 2 and 3 and therefore are likely to set them wrong. Or they are aware, but don't realize their importance and thus ignore them. And from what I've been reading online, it seems to be a good mix of both.

Back in the Haswell/Skylake (non-X) days, 3 didn't exist and 1 was tied to 2. So you were forced to drop to down to the lower denominator. This is what gave AVX its bad rep among the OC community in the first place. But this bad rep seems to have stuck even after Intel has fixed AVX by separating 1 and 2 into different classes (which it really should've done from the very beginning in Sandy Bridge)

The problem now is that because the chip is so complicated now, it's become increasingly difficult to stabilize it against all workloads. So it's easy to screw up. And each time someone crashes on AVX or AVX512, it just adds fuel to FUD that "AVX is bad".

So I'm interested to see what the rest of the OC community does with the new Prime95 with AVX512.

Mysticial got a reaction from porina in Prime95 29.5 now supports AVX-512 (Windows ver available) October 23, 2018

In the post-Kaby Lake era, I'd argue that it's pointless to say that AVX and AVX512 are "more stressful" than normal code.

The reason I say that is simply that AVX(512) aren't supposed to run at the same speed as normal code anyway. For the same reason, nobody runs their GPU at the same speed of the CPU. AVX and AVX512 fall somewhere in between.

To oversimplify a bit, a Skylake X chip has 5 different sets of silicon that run at different speeds:
Main core. AVX units AVX512 units uncore (L3/mesh) DRAM Most people are familiar with 1, 4, and 5. But very few people are aware of 2 and 3.

Most of the people who get "burned" (pun intended) by AVX or AVX512 either don't know about the existence of 2 and 3 and therefore are likely to set them wrong. Or they are aware, but don't realize their importance and thus ignore them. And from what I've been reading online, it seems to be a good mix of both.

Back in the Haswell/Skylake (non-X) days, 3 didn't exist and 1 was tied to 2. So you were forced to drop to down to the lower denominator. This is what gave AVX its bad rep among the OC community in the first place. But this bad rep seems to have stuck even after Intel has fixed AVX by separating 1 and 2 into different classes (which it really should've done from the very beginning in Sandy Bridge)

The problem now is that because the chip is so complicated now, it's become increasingly difficult to stabilize it against all workloads. So it's easy to screw up. And each time someone crashes on AVX or AVX512, it just adds fuel to FUD that "AVX is bad".

So I'm interested to see what the rest of the OC community does with the new Prime95 with AVX512.

Mysticial got a reaction from porina in Intel HT vs AMD SMT scaling July 12, 2018

Ah, so you're mackerel from mersenneforum!

Not sure if you're a programmer, but if you want to artificially construct a benchmark that would benefit as much as possible from HT, try something that iterates over a very large linked-list such that every hop is a cache miss.

I don't know of any particular real-life application that does this. But I can't imagine it being too uncommon. Something like this could possibly get 4x speed-up on Knights Landing with 4-way SMT.

For y-cruncher, even AVX2 is somewhat memory-bound if you have enough cores. So you may have to drop all the way down to SSE4. (the "08-NHM" binary). But you'll have to experiment to see for sure.

I'll note that y-cruncher's memory-bandwidth usage is very different from Prime95.
Prime95's is very smooth and steady. So it's either 0% memory-bound or 100% memory-bound. In y-cruncher, the usage is bursty - so there's more of a distribution. Memory and CPU speeds will always have an effect and neither can completely bottleneck the other even at the extremes like scalar code on one end and AVX512 on the other.

Mysticial got a reaction from porina in Do Dual CPU Sockets Matter in 2018? June 26, 2018

Correct. The 25m is just a, "Does the benchmark work? And can you submit?"
1b and 10b are the "real" benchmarks. But even 1b is becoming too small as those are going under 20 seconds on the bigger Skylake Server machines.

*waves*

It's on the order of 45GB, but will vary a bit depending on which binary you're running and how many cores there are.

Another reason why large computations scale better is because they use a lot more memory. To over simplify things a bit, because there is so much more data, the probability that the two sockets "fight" over the same data at any given time is much smaller.

So Linus' explanation for why y-cruncher scales poorly on NUMA is correct. But it's not the dominant factor for a computation that takes only 1 second. Recent versions of the program (>= v0.7.3) are NUMA-aware and will (theoretically) scale onto 2 or 4 sockets - if the computation is large enough. (It does on my 4-socket Barcelona Opteron.)

It'll be far from perfect linear-scaling, but there should still be a noticeable speedup. But once you go beyond that where there are multiple physical motherboards, then it all goes downhill. I've had someone try this on a 32-socket/8-motherboard Broadwell system with 576 cores/1152 hyperthreads. It was hilariously bad.

Yeah, the AVX512 complicates things a lot more.
It destabilizes everyone's overclocks since nobody stress-tests for it. It's why AMD is getting killed so badly in this benchmark. It makes Skylake X so fast that the bottleneck is memory bandwidth. This last reason is why a lot of the reviews show little difference between the 7960X and the 7980XE with little to gain with a CPU overclock. Unless the memory is running at like 4500 MT/s or something, the cores are just gonna be sitting there waiting on memory for much of the computation.

Mysticial got a reaction from Mira Yurizaki in Do Dual CPU Sockets Matter in 2018? June 26, 2018

Author of y-cruncher here. Been a long time viewer of LTT videos!

The reason why the scaling from the 18-core to the dual-platinum is so bad in the multi-threaded test is probably because the computation is too small.

A 1 second computation has too little work to be effectively parallelized. On top of that, the overhead of spinning up and synchronizing that many threads is significant.

Try a computation of 1 billion or 10 billion digits and you should see a larger difference. This applies to most of the other hardware reviews as well.

I'd say any computation that takes less than 30 seconds is too small to fully utilize the system - regardless of the # of cores in the system.

Also, the y-cruncher benchmark is memory-bound on Skylake X. So bandwidth is a big deal. I'd expect the Platinums to benefit from having 3x the memory channels.

Mysticial got a reaction from porina in Do Dual CPU Sockets Matter in 2018? June 26, 2018

Author of y-cruncher here. Been a long time viewer of LTT videos!

The reason why the scaling from the 18-core to the dual-platinum is so bad in the multi-threaded test is probably because the computation is too small.

A 1 second computation has too little work to be effectively parallelized. On top of that, the overhead of spinning up and synchronizing that many threads is significant.

Try a computation of 1 billion or 10 billion digits and you should see a larger difference. This applies to most of the other hardware reviews as well.

I'd say any computation that takes less than 30 seconds is too small to fully utilize the system - regardless of the # of cores in the system.

Also, the y-cruncher benchmark is memory-bound on Skylake X. So bandwidth is a big deal. I'd expect the Platinums to benefit from having 3x the memory channels.

Mysticial got a reaction from scottyseng in Do Dual CPU Sockets Matter in 2018? June 26, 2018

Author of y-cruncher here. Been a long time viewer of LTT videos!

The reason why the scaling from the 18-core to the dual-platinum is so bad in the multi-threaded test is probably because the computation is too small.

A 1 second computation has too little work to be effectively parallelized. On top of that, the overhead of spinning up and synchronizing that many threads is significant.

Try a computation of 1 billion or 10 billion digits and you should see a larger difference. This applies to most of the other hardware reviews as well.

I'd say any computation that takes less than 30 seconds is too small to fully utilize the system - regardless of the # of cores in the system.

Also, the y-cruncher benchmark is memory-bound on Skylake X. So bandwidth is a big deal. I'd expect the Platinums to benefit from having 3x the memory channels.

Mysticial got a reaction from AlTech in Do Dual CPU Sockets Matter in 2018? June 25, 2018

Author of y-cruncher here. Been a long time viewer of LTT videos!

The reason why the scaling from the 18-core to the dual-platinum is so bad in the multi-threaded test is probably because the computation is too small.

A 1 second computation has too little work to be effectively parallelized. On top of that, the overhead of spinning up and synchronizing that many threads is significant.

Try a computation of 1 billion or 10 billion digits and you should see a larger difference. This applies to most of the other hardware reviews as well.

I'd say any computation that takes less than 30 seconds is too small to fully utilize the system - regardless of the # of cores in the system.

Also, the y-cruncher benchmark is memory-bound on Skylake X. So bandwidth is a big deal. I'd expect the Platinums to benefit from having 3x the memory channels.

Mysticial got a reaction from Hunter259 in Do Dual CPU Sockets Matter in 2018? June 25, 2018

Author of y-cruncher here. Been a long time viewer of LTT videos!

The reason why the scaling from the 18-core to the dual-platinum is so bad in the multi-threaded test is probably because the computation is too small.

A 1 second computation has too little work to be effectively parallelized. On top of that, the overhead of spinning up and synchronizing that many threads is significant.

Try a computation of 1 billion or 10 billion digits and you should see a larger difference. This applies to most of the other hardware reviews as well.

I'd say any computation that takes less than 30 seconds is too small to fully utilize the system - regardless of the # of cores in the system.

Also, the y-cruncher benchmark is memory-bound on Skylake X. So bandwidth is a big deal. I'd expect the Platinums to benefit from having 3x the memory channels.

Sign In

Mysticial

Posts

Joined

Last visited

Reputation Activity

My Activity Streams