SIMD in Context: The Bandwidth Problem Part I
In the previous entry, I espoused and showed how AVX could produce a whopping 10x performance improvement for 1 specific workload in a game engine and showed mathematical proof of correctness for the algorithm. However, I did not show how the solution compares to accelerating the task by multithreading the scalar code. I also only briefly mentioned why the SIMD code would have memory bandwidth limitations. However, I haven't actually fleshed either issue out. This entry seeks to start that for the latter. Once again, critique and questions are welcome.
As a reminder, I am limiting myself to processors with the AVX1 instruction set available. AVX2 brought the same 256-bit vector functionality to integers as existed for floats in AVX1, but it also brought with it more than 80 new instructions (a near-even count for both data types as well as some random bit generation and character swizzling). AVX2 is also native only to Haswell and later for Intel as well as Excavator and later for AMD. AVX1 is available to Sandy Bridge onward, Bulldozer onward, and Jaguar onward, an important distinction due to current console hardware constructed on Jaguar.
On the subject of the limitations of SIMD scaling in any system, if you don't have the memory bandwidth to handle your operation, you will be leaving cycles on the table without very specialized, speculative code and a very good out of order engine to back it.
For Sandy Bridge, the standard memory configuration would be 2x1600MHz DDR3. The peak bandwidth per channel is 12,800 MB/s or 12.8GB/s. Thus, we can assume 25.6GB/s is a hard limit, and we should really cut this down by 5% to account for the noise of bank switching. But, this does not tell the whole story. The situation is actually much worse when we consider that a SIMD workload works best when it can just zip down the cache lines and go continuously through memory. The prefetchers even on modern CPUs are not tuned to jump from channel to channel to string together data in parallel that could sit in the cache linearly. The best we can assume with regards to my previous post is 12.8GB/s in bandwidth for our task without an enormous effort given to data layout in chunks throughout memory which are cache friendly without being contiguous. If you want to do the math and mapping for memory representations for 4 and 8-way associative caches, be my guest. And if you're that good, there are companies which will pay you a hell of a lot more to do that than the games industry and not demand nearly so grueling working hours.
So, we have 12.8 * 0.95 = 12.16GB/s in bandwidth to work with for any single mesh we wish to manipulate if we're constantly streaming from memory.
AVX instructions can process 8 single-precision floats per clock and can have multiple copies of the instruction in-flight, even if results can only be committed by one instruction per clock.
A Sandy Bridge 2600K at stock clocks boosts to 3.8GHz
Even at one core, we are looking at a total throughput (assuming we get 1 op per clock committed) of:
256/32 * 3.8*10^9 = 30.4*10^9 Floating point operations per second. This is already 2.6x the value of our given single-channel bandwidth, but it's even worse. Every float is 4 bytes. This means the bandwidth requirement in bytes per second to keep this fed must be equal to the byte count per op (4) divided by the number of compute ops on that data for the given task (1) multiplied by the total op count per second (30.4*10^9). We need at least (4/1)(30.4*10^9), or 121.6GB/s in bandwidth just on one stick of RAM solely to get the data to the CPU optimally. If you have quad-channel 4.266GHz DDR4, that's just over what is needed to feed this beast!
And we're still not done getting kicked while we're down. Data and instructions that sit in cache too long can be kicked out (limited space, lots of data to churn, power efficiency demands, memory location associativity constraints...). If a cache line is old enough and a core needs to fill it with data from another memory location, the old cache line is evicted, and a write may or may not occur at the time of eviction (write buffering). Every cache line which has been modified, when evicted from L2 on AMD Jaguar, AMD APUs and Athlons, or L3 on Intel or AMD FX, has to be written back to RAM. In short, our bandwidth requirement just doubled again to a whopping 243.2GB/s to keep the pipeline fed in one direction and draining freely in the other. That is the bandwidth of an uncut high-end GPU from 2010 required to feed 1 core of a mainstream CPU from 2011 at stock clocks. 1 CPU core needs all of this just to handle vertex translation using AVX. Notice also that this is exactly 8X the scalar value of our FLOPS. It takes exactly 8X the bandwidth in bytes/second to optimally feed our throughput in floating point ops/second. Clearly this solution cannot scale up or out.
If you think this requirement doesn't apply because meshes are only a megabyte or two in size each (300,000 vertices/triangles for a single figure is quite a lot), the rate of demand does not change for small and medium-sized data, and you can't dynamically change your RAM bandwidth mid-stream to get you that smaller amount of data at a rate faster than 12.16GB/s. You also have multiple meshes to stream through every frame. A megabyte each with 32 enemies = 32MB or 8 million floats to process with roughly 5% the bandwidth needed to keep this task fed optimally.
For just 32MB of data, the ~95% time overhead required to run this one small task may not seem like much in raw time, but what if we want to optimize everything in a game engine with SIMD where appropriate? And where does the bandwidth come from to optimize everything else using SIMD instructions where possible? I haven't even yet brought up the fact instructions have to be stored and brought in somehow, also chewing up bandwidth if we make a function call to instructions not in cache because of a misprediction by the prefetcher or a forced spill because of a hotly contested L2 or L3 cache.
~95% of just this one core's cycles are being left on the table purely waiting for data to arrive for one case of one relatively minor task in the context of a game engine. 100% of the cycles on the other cores are going without utilization too if they're waiting on this task to be finished. Calling this an issue is an understatement. It's a performance tragedy in the making.
<informative_rant>
As a side note, this is why there has been such a huge push to improve memory bandwidth for CPUs in enterprise. Vectorization is an incredibly important optimization technique. Enterprise generally has the luxury of having more operations to do on each datum. However, if all you need to do in your program is run a reduction on data using a single operand, bandwidth is the choke point, and throwing more cores at the problem solves absolutely nothing. You have to load RAM from disk or from another server. Splitting the data up among nodes to increase total bandwidth requires you have both the interconnect to support it and a way to feed the interconnect from disk and RAM in a way that is faster than feeding the CPU from RAM.
Amdahl was right, and Gustafson was right. Amdahl's Law basically states making fixed-size tasks faster becomes impossible at some point, bound in scaling by the single-core performance for the workload as a theoretical best case, before memory gets in the way, let alone local disk, and Heaven forbid the network itself if you are so unfortunate. Gustafson's Law basically states you can get better average scaling than Amdahl's Law would lead you to believe if you increase your problem size and complexity to the point further distribution of tasks and data is worthwhile. In the end the time to execute Gustafson's larger task, even with better distribution, will take more time than Amdahl's original problem, because you still have to tackle the original problem (which could scale no further) on top of all the new stuff you added.
</informative_rant>
In Part II I will outline and demonstrate two of the primary techniques used to mitigate this bandwidth problem. In Part III I intend to demonstrate why multithreading scalar code cannot possibly produce any better results and should always produce worse for reasons which go far beyond the simple overhead of launching and merging threads.
24 Comments