Jump to content
  • entries
    4
  • comments
    66
  • views
    1,657

SIMD in Context: The Bandwidth Problem Part I

patrickjp93

2,200 views

In the previous entry, I espoused and showed how AVX could produce a whopping 10x performance improvement for 1 specific workload in a game engine and showed mathematical proof of correctness for the algorithm. However, I did not show how the solution compares to accelerating the task by multithreading the scalar code. I also only briefly mentioned why the SIMD code would have memory bandwidth limitations. However, I haven't actually fleshed either issue out. This entry seeks to start that for the latter. Once again, critique and questions are welcome.

 

As a reminder, I am limiting myself to processors with the AVX1 instruction set available. AVX2 brought the same 256-bit vector functionality to integers as existed for floats in AVX1, but it also brought with it more than 80 new instructions (a near-even count for both data types as well as some random bit generation and character swizzling). AVX2 is also native only to Haswell and later for Intel as well as Excavator and later for AMD. AVX1 is available to Sandy Bridge onward, Bulldozer onward, and Jaguar onward, an important distinction due to current console hardware constructed on Jaguar.

 

 

 

On the subject of the limitations of SIMD scaling in any system, if you don't have the memory bandwidth to handle your operation, you will be leaving cycles on the table without very specialized, speculative code and a very good out of order engine to back it.

 

For Sandy Bridge, the standard memory configuration would be 2x1600MHz DDR3. The peak bandwidth per channel is 12,800 MB/s or 12.8GB/s. Thus, we can assume 25.6GB/s is a hard limit, and we should really cut this down by 5% to account for the noise of bank switching. But, this does not tell the whole story. The situation is actually much worse when we consider that a SIMD workload works best when it can just zip down the cache lines and go continuously through memory. The prefetchers even on modern CPUs are not tuned to jump from channel to channel to string together data in parallel that could sit in the cache linearly. The best we can assume with regards to my previous post is 12.8GB/s in bandwidth for our task without an enormous effort given to data layout in chunks throughout memory which are cache friendly without being contiguous. If you want to do the math and mapping for memory representations for 4 and 8-way associative caches, be my guest. And if you're that good, there are companies which will pay you a hell of a lot more to do that than the games industry and not demand nearly so grueling working hours.

 

So, we have 12.8 * 0.95 = 12.16GB/s in bandwidth to work with for any single mesh we wish to manipulate if we're constantly streaming from memory.

AVX instructions can process 8 single-precision floats per clock and can have multiple copies of the instruction in-flight, even if results can only be committed by one instruction per clock.

A Sandy Bridge 2600K at stock clocks boosts to 3.8GHz

 

Even at one core, we are looking at a total throughput (assuming we get 1 op per clock committed) of:

256/32 * 3.8*10^9 = 30.4*10^9 Floating point operations per second. This is already 2.6x the value of our given single-channel bandwidth, but it's even worse. Every float is 4 bytes. This means the bandwidth requirement in bytes per second to keep this fed must be equal to the byte count per op (4) divided by the number of compute ops on that data for the given task (1) multiplied by the total op count per second (30.4*10^9)We need at least (4/1)(30.4*10^9), or 121.6GB/s in bandwidth just on one stick of RAM solely to get the data to the CPU optimally. If you have quad-channel 4.266GHz DDR4, that's just over what is needed to feed this beast! 

 

 

And we're still not done getting kicked while we're down. Data and instructions that sit in cache too long can be kicked out (limited space, lots of data to churn, power efficiency demands, memory location associativity constraints...). If a cache line is old enough and a core needs to fill it with data from another memory location, the old cache line is evicted, and a write may or may not occur at the time of eviction (write buffering). Every cache line which has been modified, when evicted from L2 on AMD Jaguar, AMD APUs and Athlons, or L3 on Intel or AMD FX, has to be written back to RAM. In short, our bandwidth requirement just doubled again to a whopping 243.2GB/s to keep the pipeline fed in one direction and draining freely in the other. That is the bandwidth of an uncut high-end GPU from 2010 required to feed 1 core of a mainstream CPU from 2011 at stock clocks. 1 CPU core needs all of this just to handle vertex translation using AVX. Notice also that this is exactly 8X the scalar value of our FLOPS. It takes exactly 8X the bandwidth in bytes/second to optimally feed our throughput in floating point ops/second. Clearly this solution cannot scale up or out.

 

If you think this requirement doesn't apply because meshes are only a megabyte or two in size each (300,000 vertices/triangles for a single figure is quite a lot), the rate of demand does not change for small and medium-sized data, and you can't dynamically change your RAM bandwidth mid-stream to get you that smaller amount of data at a rate faster than 12.16GB/s. You also have multiple meshes to stream through every frame. A megabyte each with 32 enemies = 32MB or 8 million floats to process with roughly 5% the bandwidth needed to keep this task fed optimally.

 

For just 32MB of data, the ~95% time overhead required to run this one small task may not seem like much in raw time, but what if we want to optimize everything in a game engine with SIMD where appropriate? And where does the bandwidth come from to optimize everything else using SIMD instructions where possible? I haven't even yet brought up the fact instructions have to be stored and brought in somehow, also chewing up bandwidth if we make a function call to instructions not in cache because of a misprediction by the prefetcher or a forced spill because of a hotly contested L2 or L3 cache.

 

~95% of just this one core's cycles are being left on the table purely waiting for data to arrive for one case of one relatively minor task in the context of a game engine. 100% of the cycles on the other cores are going without utilization too if they're waiting on this task to be finished. Calling this an issue is an understatement. It's a performance tragedy in the making.

 

<informative_rant>

Spoiler

As a side note, this is why there has been such a huge push to improve memory bandwidth for CPUs in enterprise. Vectorization is an incredibly important optimization technique. Enterprise generally has the luxury of having more operations to do on each datum. However, if all you need to do in your program is run a reduction on data using a single operand, bandwidth is the choke point, and throwing more cores at the problem solves absolutely nothing. You have to load RAM from disk or from another server. Splitting the data up among nodes to increase total bandwidth requires you have both the interconnect to support it and a way to feed the interconnect from disk and RAM in a way that is faster than feeding the CPU from RAM.

 

Amdahl was right, and Gustafson was right. Amdahl's Law basically states making fixed-size tasks faster becomes impossible at some point, bound in scaling by the single-core performance for the workload as a theoretical best case, before memory gets in the way, let alone local disk, and Heaven forbid the network itself if you are so unfortunate. Gustafson's Law basically states you can get better average scaling than Amdahl's Law would lead you to believe if you increase your problem size and complexity to the point further distribution of tasks and data is worthwhile. In the end the time to execute Gustafson's larger task, even with better distribution, will take more time than Amdahl's original problem, because you still have to tackle the original problem (which could scale no further) on top of all the new stuff you added.

</informative_rant>

 

In Part II I will outline and demonstrate two of the primary techniques used to mitigate this bandwidth problem. In Part III I intend to demonstrate why multithreading scalar code cannot possibly produce any better results and should always produce worse for reasons which go far beyond the simple overhead of launching and merging threads.

24 Comments

I appreciate being asked for feedback but I don't think I am the right person to ask.

I'll read though this and your previous post when I get some more spare time though. Seems interesting.

Link to comment
Link to post

you state L2 contested for AMD, yet many of the AMD CPUs do have some L3 (except Steamroller and Excavator).

 

So is it because the cache is Read only - write through (one of the major flaws in the OG Bulldozer cache), not RW?

Link to comment
Link to post
21 minutes ago, Prysin said:

you state L2 contested for AMD, yet many of the AMD CPUs do have some L3 (except Steamroller and Excavator).

 

So is it because the cache is Read only - write through (one of the major flaws in the OG Bulldozer cache), not RW?

Jaguar and Puma also only have L2. I can still clarify for FX though. Good catch.

 

Cache is r/w. In fact writes to memory are buffered and done as batch submissions when possible.

 

And by contested, I mean other cores are using data and instructions in the LLC, so you don't get all the bandwidth of shared caches to yourself, and sometimes data gets kicked out because it's been there just a hair too long.

 

Although, your r/w comment reminded me why the situation gets even worse. I have to revise my bandwidth figures again.

Link to comment
Link to post

no, for Bulldozer designs, one of the caches, L1 or L2 is write through, not write. Its one of the major issues that causes cache misses. Its one of the things they fixed in Excavator that caused single core perf to increase. The Cache issues severely hindered performance.

 

Edit: i mean Bulldozer, PIledriver, Steamroller with equivalent low power versions.

 

Also, Jaguar is a slightly modified Piledriver model. Its basically Piledriver V2 for low power.

Link to comment
Link to post
5 minutes ago, Prysin said:

no, for Bulldozer designs, one of the caches, L1 or L2 is write through, not write. Its one of the major issues that causes cache misses. Its one of the things they fixed in Excavator that caused single core perf to increase. The Cache issues severely hindered performance.

 

Edit: i mean Bulldozer, PIledriver, Steamroller with equivalent low power versions.

 

Also, Jaguar is a slightly modified Piledriver model. Its basically Piledriver V2 for low power.

write-through doesn't cause a cache miss, but it does add cycles to stores. I believe it's L1 that's write-through to minimize the damage of the cache line sniffing by removing the uppermost layer when possible.

 

Jaguar is closer to Phenom than it is the construction cores afaik.

 

2 minutes ago, Prysin said:

All intel Cache's in consumer products is Write Back. Just FYI

Yeah I know.

Link to comment
Link to post
7 minutes ago, patrickjp93 said:

write-through doesn't cause a cache miss, but it does add cycles to stores. I believe it's L1 that's write-through to minimize the damage of the cache line sniffing by removing the uppermost layer when possible.

 

Jaguar is closer to Phenom than it is the construction cores afaik.

 

Yeah I know.

i know WT doesnt cause more misses, but the shared nature of the L2/Int/Float, having to go through a WT L1, causes bottlenecks when/if it misses at all.

Link to comment
Link to post

Also, going by your current "development" in this theory, the "solution" to vertex compute of games would be to run it at 128bit AVX (which will help lower TDP units aswell) and reserve one or two threads for non-parrallellizable and low dependency scalar workloads.

 

EDIT: the following applies more to AMD Bulldozer.

 

If you could adress one of the two 128bit AVX units and "lock" them to each their own memory channel. Each AMD Bulldozer core could run TWO AVX threads with their own "bandwidth restraint". and aslong as the data has low dependency, the bandwidth restraint wouldnt hurt the other cores too much. You would still be at the mercy of AMDs shitty memory controller though.

 

Link to comment
Link to post
2 minutes ago, Prysin said:

Also, going by your current "development" in this theory, the "solution" to vertex compute of games would be to run it at 128bit AVX (which will help lower TDP units aswell) and reserve one or two threads for non-parrallellizable and low dependency scalar workloads.

 

 

Wait for it patiently young grasshopper. This will go from being a horrifying but cute toy example to incredibly convoluted just staying in mesh transforms and matrix multiplication. It's also not a theory. You have to start at a high level and go down into the dark gloom a bit at a time. By the time I'm done explaining how I optimize my code around 1 thread on 1core (we'll be at 12 entries before that), most won't be eager to see how I use p_threads (not std::thread, way too much overhead) with what I call task frames and working groups to avoid function calls altogether and split up the work and task order to asynchronously build a frame and max out bandwidth without dropping a cycle. It's a hair-brained work of art that performs 11% better than native threads and function calls using std::thread just on 2 cores.

 

I'm going to try to post 1 page a day until entry 7. Then it's gonna get rough and will make me have to step through the logic again.

Link to comment
Link to post
1 minute ago, patrickjp93 said:

Wait for it patiently young grasshopper. This will go from being a horrifying but cute toy example to incredibly convoluted just staying in mesh transforms and matrix multiplication. It's also not a theory. You have to start at a high level and go down into the dark gloom a bit at a time. By the time I'm done explaining how I optimize my code around 1 thread on 1core (we'll be at 12 entries before that), most won't be eager to see how I use p_threads (not std::thread, way too much overhead) with what I call task frames and working groups to avoid function calls altogether and split up the work and task order to asynchronously build a frame and max out bandwidth without dropping a cycle. It's a hair-brained work of art that performs 11% better than native threads and function calls using std::thread just on 2 cores.

 

I'm going to try to post 1 page a day until entry 7. Then it's gonna get rough and will make me have to step through the logic again.

you're boring. i cannot wait a day for each explanation.

I learnt the basics of graphics design and shadow models in a day (well roughly 1.5 days to be precise as i replayed the recording a few times to grasp the details better), all while actually working.... and you tell me i have to wait for half a month in order to get all the info on how to multithread a game?

 

 

Link to comment
Link to post
1 minute ago, Prysin said:

you're boring. i cannot wait a day for each explanation.

I learnt the basics of graphics design and shadow models in a day (well roughly 1.5 days to be precise as i replayed the recording a few times to grasp the details better), all while actually working.... and you tell me i have to wait for half a month in order to get all the info on how to multithread a game?

You're not gonna get all of it from me, and creating educational materials is more difficult than perusing them. And yes, teaching graphics is easier than teaching graphics with performance in mind using the best tools at our disposal. It's easy to spout off "the order of your matrix multiplications must be X,Y,Z b/c otherwise weird shit happens!" or "Here's how you make linear, quadratic, and logarithmic fog!"

 

Graphics boils down to a few very simple mathematics equations, matrix algebra, picking good textures, lighting, and buffering data correctly into the GPU driver. High Performance Graphics boils down to a self-balancing K-D Tree or Oct-Tree for space partitioning, Skip fields for dying creatures, ray tracing in a cache-friendly way, using multiple threads to track events and networking in addition to AI decisions and on and on and on...

Link to comment
Link to post
1 minute ago, patrickjp93 said:

You're not gonna get all of it from me, and creating educational materials is more difficult than perusing them. And yes, teaching graphics is easier than teaching graphics with performance in mind using the best tools at our disposal. It's easy to spout off "the order of your matrix multiplications must be X,Y,Z b/c otherwise weird shit happens!" or "Here's how you make linear, quadratic, and logarithmic fog!"

 

Graphics boils down to a few very simple mathematics equations, matrix algebra, picking good textures, lighting, and buffering data correctly into the GPU driver. High Performance Graphics boils down to a self-balancing K-D Tree or Oct-Tree for space partitioning, Skip fields for dying creatures, ray tracing in a cache-friendly way, using multiple threads to track events and networking in addition to AI decisions and on and on and on...

nothing is simple if your goal is to ray-trace everything in real time:P

Link to comment
Link to post
24 minutes ago, Prysin said:

nothing is simple if your goal is to ray-trace everything in real time:P

Uh... 

 

And Intel is shipping a chip 3x as powerful as this by the end of December.

Link to comment
Link to post
17 minutes ago, patrickjp93 said:

Uh... 

 

And Intel is shipping a chip 3x as powerful as this by the end of December.

but that chip cannot be slotted into a ASUS Maximus Hero VII nor a ASUS SABERTOOTH 990FX R2.0

Link to comment
Link to post

This was a lot of information to digest. I would also like to point out, that even if you run DDR3 1600mhz, you will never achieve your peak theoretical bandwidth of 12.8GB/s or even that 12.1GB/s without some manual tweaking. I have yet to come across a kit of memory that had properly trained tertiary timings in relation to everything else (RTL/IO-L being the most difficult for a board to train along side these timings). Not only that, but even as we move to the enthusiast platforms, additional channels tends to reduce efficiency. Peak theoretical bandwidth doubles from dual to quad, but you get further away from achieving that number. With my tweaking, I can achieve 98% efficiency in writes, and 96% efficiency in reads in dual channel. In quad using DDR4, I am lucky to hit 85% of my peak theoretical bandwidth. 

 

As much as I would like to see AVX catch on in gaming applications, memory bandwidth is not the only problem. AVX is extremely efficient, and people with their pseudostable overclocks are going to immediately feel the effects of it the moment they attempt to game with absurd clocks/voltages. Granted, it won't be nearly as bad as AVX2 loads that you see in Linpack or P95, but it is still far more than what current games offer. 

 

On a side note, didn't a few titles try AVX before? I think GRID did, but I don't know the exact implementation of it. 

Link to comment
Link to post

@MageTanki think you're right grid did use AVX. I think it was for the physics. Not sure if AVX but i think it was called "vector physics"....

 

Also, if you want to have someone more competent check this stuff out @patrickjp93 then why didnt you add @Tomsen???

Link to comment
Link to post
2 hours ago, Prysin said:

@MageTanki think you're right grid did use AVX. I think it was for the physics. Not sure if AVX but i think it was called "vector physics"....

 

Also, if you want to have someone more competent check this stuff out @patrickjp93 then why didnt you add @Tomsen???

I have very little knowledge of who around here is best to call. Feel free to loop in whoever you want:

Link to comment
Link to post
7 hours ago, patrickjp93 said:

@Prysin@MageTank, and @Tomsen I updated the bandwidth requirements and did some tidying of the article. 243.2GB/s just for 1 CPU core seems a bit excessive, don't you think? ;)

A bit excessive? Maybe closer to "currently impossible", lol. Let's assume that Broadwell-E's IMC is magically strong enough to handle DDR4 at its highest JEDEC rated speed of 4266mhz. In order to achieve the bandwidth suggestion you mathed out, you would need AT LEAST 90% efficiency on that ram (90% of 272.8GB/s = 245.52GB/s). To put that into perspective, the highest quad channel efficiency I've seen thus far, is around 80%. Granted, not many people overclock their ram on enthusiast platforms because quad channel is already more than enough bandwidth for their tasks, I still don't think you'd get too far with what we've seen from their IMC's.

 

That isn't even counting the fact that you said you need that kind of bandwidth on a single core. I currently hit roughly 54GB/s bandwidth in writes (53GB/s in reads) but if I were to shut off my cores, that number drastically lowers. Just by going from my pentium G4400 to a Skylake i5, I gained 20% higher bandwidth at the exact same memory clock speeds. In fact, I can currently test this for you, give me a few minutes to boot with a single core and test this.

 

EDIT: Okay, now ASRock has my attention. Turns out, if you disable cores on this specific board, it requires a hard CMOS clear to get them back. Changing the active core configuration back to "all" changes absolutely nothing. Was able to get 3 back, but in order to get 4, had to completely clear CMOS. Not only that, but my memory overclock will no longer load. Something tells me it broke one of the subtle timing adjustments, so I will have to manually input that again. Luckily, I still have the results of my overclock saved from before, just without the photoworx result.

 

1 Core/1 Thread: U6lPOSe.png

 

All Cores/Threads:

 

IxngBTl.png

 

Write bandwidth doesn't change much at all, and can probably be considered margin of error (background processes can impact the results of this test, and this all thread result is older). However, we see a 40% difference in read bandwidth, and 15% difference in copy speed. Latency remains unchanged. It's safe to say, regardless of platform, achieving your proposed memory bandwidth requirement with a single core, is impossible with the current technology we have. 

 

EDIT 2: I did manage to find a photoworxx result from when you asked me to run my CPU at 4.5ghz. Note: Photoworxx does not scale with CPU clock speed, it is strictly a memory bandwidth test. 

 

BNzWNKl.jpg

 

Link to comment
Link to post
×