Intel CPU ram bandwidth thoughts

porina · March 20, 2016

A while back I was looking the impact ram bandwidth has on Prime95-like applications. To do that, I used the Prime95 built in benchmark. I participate in various prime number finding projects, and other software used like LLR and PFGW also use the same underlying number crunching library as Prime95, so would be expected to be similar in behaviour. It is long known that this is a memory bandwidth intensive task once sizes get beyond a point, but I'm not aware of much work to quantify it. People just knew by the time you get to a quad core, you got relatively poor scaling compared to using 1 or 2 tasks, and sometimes running 4 tasks is no better in throughput than running 3.

The results could be summarised in the following chart:

This chart might take some explaining... the legend shows the tested CPU and clock, ram speed, and if the ram was single or dual rank. All were running in dual channel ram mode as single channel is not worth thinking about here.

You might have seen when running a Prime95 stress test, it uses various size FFT tasks. The FFT size is related to the number being tested. Bigger number = bigger FFT. What we need to know here is that the FFT size multiplied by 8 gives the amount of ram that task uses. Multiply again by the number of tasks run for the total. This assumes one task per core. Prime95 supports multiple cores on a single task, but LLR and PFGW don't, so I will only be looking at the one task per core case. For the i3+i7 CPUs, I've turned off hyper-threading as it doesn't help in this case. As FFT sizes increase, the workload increases also, so this chart has been normalised for size so that a constant computing performance would be a horizontal line.

Broadwell aside, i5 CPUs have 1.5MB L3 cache per core, and non-extreme i7 CPUs get 2MB/core. We can see this in the chart. To the left, below around 1.5MB/core the tasks are small enough to sit inside the L3 cache, and you get good scaling with CPU clock. Skylake enjoys about 14% IPC advantage over Haswell, and Broadwell seems to actually drop about 6% relative to Haswell. I don't have an explanation for that. To the right of the chart the tasks are too big to fit in the L3 cache, and it is ram bandwidth limited. The relative performance on the right is mostly dominated by the relative available ram bandwidth compared to the CPU demand. The two will be related, so faster CPU will demand more ram bandwidth.

Basically for this type of application, you should aim for as much ram bandwidth as you can. Two results stand out though. The i3 result is relatively flat. It has dual channel ram like the rest, but it only needs to feed two cores not 4. So the performance there remains strong with bigger task sizes.

Broadwell is the other interesting one. Any Broadwell users here at all? This has 128MB of eDRAM acting as L4 cache. This is stated as having 50GB/s bandwidth, which is comparable to dual channel 3200 ram. Note how it is fairly flat up to 32MB/core where it starts dropping in performance. 4x32MB = 128MB, the size of the L4 cache. Above that we're hitting the ram again. I do hope to see big L4 caches in more desktop processors. Possibly because I didn't aggressively overclock it yet, I don't seem to see a difference between operating out of L3 and L4 cache. I did try overclocking the eDRAM from default 1800 to 2200, which made no significant difference in benchmarks. Then again, it looks like this wasn't limiting anyway. Sure enough the performance is comparable to the quad core Skylake with dual channel 3200 ram.

The more observant might have noticed I haven't mentioned timings at all. This made a tiny difference but the effect is really minor compared to bandwidth and rank.

Congratulations to anyone still reading at this point. Now for some questions!

In this testing I found a massive performance difference when running ram that is dual rank (better) than single rank (not so good). Testing people have done elsewhere shows of the magnitude single digit % difference in AIDA ram bandwidth tests, but in this application it is more like 20% ball park, but that might depend on the balance between CPU and ram. I've tried reading around but still don't understand how rank affects ram performance. All I know is, dual rank helps a lot here. Anyone have a good description of how rank affects performance, particularly bandwidth?

Anyone know how Broadwell connects to the L4 cache? I think it might be asynchronous to the eDRAM as the clock doesn't relate nicely to the quoted bandwidth. So when overclocking the eDRAM, does it also affect the connection interface, or is that set elsewhere?

porina · September 19, 2016

9 hours ago, Curufinwe_wins said:

I saw it was clock clock normalized, but you have -S and -T sku's in there, and they have substantial thermal/power constraints that I don't remember how they work with turbo boost (for example the -U cpu's consume WAY over 15W during the 30 turbo and the HQ line tends to drop about 600 Mhz or more once the turbo timer expires.)

They're TDP lowered compared to standard desktop parts, mainly due to lower clocks and presumably lower voltages that come with it. They do not have any particular or special throttling. I can confirm in my normal use, which is 24/7 compute tasks, they do not throttle.

Quote

Even the 6600k with the 2R was showing huge regressions from the 6700k at higher FFT sizes, which was disconcerting. Taking a look at disabling HT might be a good idea for that one.

In fact I note that all of the large FFT regressions were non-HT processors (with the Broadwell-C not really showing it, but the huge L4 difference makes comparing them to the others kinda difficult).

HT was disabled on the CPUs that support it, as for the tasks I run it does not provide any benefit, and can cause additional problems.

I'm still not sure what you are referring to when you talk about regressions. In short, the chart can be viewed as two areas. On the left, say below 1MB per core, we're operating out of the CPU caches and are for practical purposes ram unlimited. Beyond 2MB/core the tasks no longer fit in cache, and we see the memory subsystem start to dominate. The transition between them can vary depending on the amount of cache on the CPU. Broadwell-C with its 128MB L4 cache means you can go up to 32MB/core before you see a drop in performance.

Quote

2x increase in channels, 50% increase in core count. Seems pretty reasonable to me. I haven't seen any application yet to show benefits of two dimms per channel (of same total capacity) on HW-E and in fact it seems outside of the higher end xeons it doesn't seem like they touch the memory bandwidth available to them.

To cut and paste myself from elsewhere:

Test system:
CPU: i7-6700k at 4.2 GHz, ring at 4.1 GHz, HT off
Mobo: MSI Gaming Pro, bios 1.7
GPU: 9500 GT (just to make sure no ram bandwidth is stolen by integrated graphics)
RAM: for the results presented I use two types
G.Skill F4-3333C16-4GRRD Ripjaws 4, 4x4GB kit
G.Skill F4-3200C16-8GVK Ripjaws V, 2x8GB kit

Testing was performed using Prime95 28.7 built in benchmark in Windows 7 64-bit. Each setting was run once, after the PC had been given time to settle down after rebooting. All test configurations had the ram in dual channel mode. Timing values listed are ordered CAS-RCD-RP-RAS as commonly shown in most software.

(note: multiply FFT size by 8 to get the amount of ram required to hold it. This this chart starts at 8MB and goes up to 64MB for comparison to the previous chart. I did not normalise for increasing FFT size hence the drop with bigger sizes here)

And finally, this is the cause of some unexpected behaviour I saw. I had two comparable systems, but I saw a massive performance difference between them which I struggled to explain. I tried various things and even wrongly blamed the mobo for being rubbish, but it would seem module rank has a major influence. This isn't so commonly discussed or even specified. I found Thaiphoon Burner as free software that can read this. The Ripjaws 4 modules are single rank, and the Ripjaws 5 module is dual rank (caution: other parts in the series may vary!). General consensus seems to be that having higher ranks can slightly increase bandwidth, at the cost of slightly higher latency.

This chart is going to take some explaining. The chart again shows the 4 worker throughput. The grey line is the Ripjaws V kit, and light blue line is Ripjaws 4 kit with 4 modules fitted, both at 3200. So on each memory channel is a total of 2 rank, and performance is so identical you can't see the light blue line under the grey line! So far so good? Let's take two of the Ripjaws 4 modules out, leaving it running in dual channel mode. Logically, this shouldn't make a difference. It is still 2 channels, running at the same clock and timings. Nope. We see a 19% drop in performance (orange line). This is massive! How massive? The yellow and blue lines are 4 modules running at 2666 and 2400 respectively, and they go neatly either side of the orange line. That is quite a performance drop!

The tentative conclusion from this is that, it seems it is worth having the higher rank modules, or running more modules to do so, otherwise you will reduce your potential significantly. Unfortunately it doesn't seem that easy to find out what rank a module is before buying it.