Real Game Engine Code Optimization

patrickjp93 · October 13, 2016

@Prysin @MageTank @LAwLz Feedback would be appreciated.

LAwLz · October 13, 2016

I appreciate being asked for feedback but I don't think I am the right person to ask.

I'll read though this and your previous post when I get some more spare time though. Seems interesting.

Prysin · October 13, 2016

you state L2 contested for AMD, yet many of the AMD CPUs do have some L3 (except Steamroller and Excavator).

So is it because the cache is Read only - write through (one of the major flaws in the OG Bulldozer cache), not RW?

patrickjp93 · October 13, 2016

21 minutes ago, Prysin said:

you state L2 contested for AMD, yet many of the AMD CPUs do have some L3 (except Steamroller and Excavator).

So is it because the cache is Read only - write through (one of the major flaws in the OG Bulldozer cache), not RW?

Jaguar and Puma also only have L2. I can still clarify for FX though. Good catch.

Cache is r/w. In fact writes to memory are buffered and done as batch submissions when possible.

And by contested, I mean other cores are using data and instructions in the LLC, so you don't get all the bandwidth of shared caches to yourself, and sometimes data gets kicked out because it's been there just a hair too long.

Although, your r/w comment reminded me why the situation gets even worse. I have to revise my bandwidth figures again.

Prysin · October 13, 2016

no, for Bulldozer designs, one of the caches, L1 or L2 is write through, not write. Its one of the major issues that causes cache misses. Its one of the things they fixed in Excavator that caused single core perf to increase. The Cache issues severely hindered performance.

Edit: i mean Bulldozer, PIledriver, Steamroller with equivalent low power versions.

Also, Jaguar is a slightly modified Piledriver model. Its basically Piledriver V2 for low power.

Prysin · October 13, 2016

All intel Cache's in consumer products is Write Back. Just FYI

Prysin · October 13, 2016

incase people dont understand the difference.

http://www.computerweekly.com/feature/Write-through-write-around-write-back-Cache-explained

patrickjp93 · October 13, 2016

5 minutes ago, Prysin said:

no, for Bulldozer designs, one of the caches, L1 or L2 is write through, not write. Its one of the major issues that causes cache misses. Its one of the things they fixed in Excavator that caused single core perf to increase. The Cache issues severely hindered performance.

Edit: i mean Bulldozer, PIledriver, Steamroller with equivalent low power versions.

Also, Jaguar is a slightly modified Piledriver model. Its basically Piledriver V2 for low power.

write-through doesn't cause a cache miss, but it does add cycles to stores. I believe it's L1 that's write-through to minimize the damage of the cache line sniffing by removing the uppermost layer when possible.

Jaguar is closer to Phenom than it is the construction cores afaik.

2 minutes ago, Prysin said:

All intel Cache's in consumer products is Write Back. Just FYI

Yeah I know.

Prysin · October 13, 2016

http://www.realworldtech.com/jaguar/

hmm, maybe you're right. This is the only GOOD source i can find. David Kanter is perhaps one of the most meticulous tech reporters i know of. Outside of Scott Wasson (works for AMD atm) and Allen Malventano (PCPer)

Prysin · October 13, 2016

7 minutes ago, patrickjp93 said:

write-through doesn't cause a cache miss, but it does add cycles to stores. I believe it's L1 that's write-through to minimize the damage of the cache line sniffing by removing the uppermost layer when possible.

Jaguar is closer to Phenom than it is the construction cores afaik.

Yeah I know.

i know WT doesnt cause more misses, but the shared nature of the L2/Int/Float, having to go through a WT L1, causes bottlenecks when/if it misses at all.

Prysin · October 13, 2016

Also, going by your current "development" in this theory, the "solution" to vertex compute of games would be to run it at 128bit AVX (which will help lower TDP units aswell) and reserve one or two threads for non-parrallellizable and low dependency scalar workloads.

EDIT: the following applies more to AMD Bulldozer.

If you could adress one of the two 128bit AVX units and "lock" them to each their own memory channel. Each AMD Bulldozer core could run TWO AVX threads with their own "bandwidth restraint". and aslong as the data has low dependency, the bandwidth restraint wouldnt hurt the other cores too much. You would still be at the mercy of AMDs shitty memory controller though.

patrickjp93 · October 13, 2016

2 minutes ago, Prysin said:

Also, going by your current "development" in this theory, the "solution" to vertex compute of games would be to run it at 128bit AVX (which will help lower TDP units aswell) and reserve one or two threads for non-parrallellizable and low dependency scalar workloads.

Wait for it patiently young grasshopper. This will go from being a horrifying but cute toy example to incredibly convoluted just staying in mesh transforms and matrix multiplication. It's also not a theory. You have to start at a high level and go down into the dark gloom a bit at a time. By the time I'm done explaining how I optimize my code around 1 thread on 1core (we'll be at 12 entries before that), most won't be eager to see how I use p_threads (not std::thread, way too much overhead) with what I call task frames and working groups to avoid function calls altogether and split up the work and task order to asynchronously build a frame and max out bandwidth without dropping a cycle. It's a hair-brained work of art that performs 11% better than native threads and function calls using std::thread just on 2 cores.

I'm going to try to post 1 page a day until entry 7. Then it's gonna get rough and will make me have to step through the logic again.

Prysin · October 13, 2016

1 minute ago, patrickjp93 said:

Wait for it patiently young grasshopper. This will go from being a horrifying but cute toy example to incredibly convoluted just staying in mesh transforms and matrix multiplication. It's also not a theory. You have to start at a high level and go down into the dark gloom a bit at a time. By the time I'm done explaining how I optimize my code around 1 thread on 1core (we'll be at 12 entries before that), most won't be eager to see how I use p_threads (not std::thread, way too much overhead) with what I call task frames and working groups to avoid function calls altogether and split up the work and task order to asynchronously build a frame and max out bandwidth without dropping a cycle. It's a hair-brained work of art that performs 11% better than native threads and function calls using std::thread just on 2 cores.

I'm going to try to post 1 page a day until entry 7. Then it's gonna get rough and will make me have to step through the logic again.

you're boring. i cannot wait a day for each explanation.

I learnt the basics of graphics design and shadow models in a day (well roughly 1.5 days to be precise as i replayed the recording a few times to grasp the details better), all while actually working.... and you tell me i have to wait for half a month in order to get all the info on how to multithread a game?

patrickjp93 · October 13, 2016

1 minute ago, Prysin said:

you're boring. i cannot wait a day for each explanation.

I learnt the basics of graphics design and shadow models in a day (well roughly 1.5 days to be precise as i replayed the recording a few times to grasp the details better), all while actually working.... and you tell me i have to wait for half a month in order to get all the info on how to multithread a game?

You're not gonna get all of it from me, and creating educational materials is more difficult than perusing them. And yes, teaching graphics is easier than teaching graphics with performance in mind using the best tools at our disposal. It's easy to spout off "the order of your matrix multiplications must be X,Y,Z b/c otherwise weird shit happens!" or "Here's how you make linear, quadratic, and logarithmic fog!"

Graphics boils down to a few very simple mathematics equations, matrix algebra, picking good textures, lighting, and buffering data correctly into the GPU driver. High Performance Graphics boils down to a self-balancing K-D Tree or Oct-Tree for space partitioning, Skip fields for dying creatures, ray tracing in a cache-friendly way, using multiple threads to track events and networking in addition to AI decisions and on and on and on...

Prysin · October 13, 2016

1 minute ago, patrickjp93 said:

You're not gonna get all of it from me, and creating educational materials is more difficult than perusing them. And yes, teaching graphics is easier than teaching graphics with performance in mind using the best tools at our disposal. It's easy to spout off "the order of your matrix multiplications must be X,Y,Z b/c otherwise weird shit happens!" or "Here's how you make linear, quadratic, and logarithmic fog!"

Graphics boils down to a few very simple mathematics equations, matrix algebra, picking good textures, lighting, and buffering data correctly into the GPU driver. High Performance Graphics boils down to a self-balancing K-D Tree or Oct-Tree for space partitioning, Skip fields for dying creatures, ray tracing in a cache-friendly way, using multiple threads to track events and networking in addition to AI decisions and on and on and on...

nothing is simple if your goal is to ray-trace everything in real time

patrickjp93 · October 13, 2016

24 minutes ago, Prysin said:

nothing is simple if your goal is to ray-trace everything in real time

Uh...

And Intel is shipping a chip 3x as powerful as this by the end of December.

Prysin · October 13, 2016

17 minutes ago, patrickjp93 said:

Uh...

And Intel is shipping a chip 3x as powerful as this by the end of December.

but that chip cannot be slotted into a ASUS Maximus Hero VII nor a ASUS SABERTOOTH 990FX R2.0

patrickjp93 · October 13, 2016

Just now, Prysin said:

but that chip cannot be slotted into a ASUS Maximus Hero VII nor a ASUS SABERTOOTH 990FX R2.0

Yet! http://www.tomshardware.com/news/intel-xeon-skylake-purley-cpu,31980.html

Prysin · October 13, 2016

33 minutes ago, patrickjp93 said:

Yet! http://www.tomshardware.com/news/intel-xeon-skylake-purley-cpu,31980.html

still cannot shove it in one of the boards i mentioned. need a new board for that

MageTank · October 13, 2016

This was a lot of information to digest. I would also like to point out, that even if you run DDR3 1600mhz, you will never achieve your peak theoretical bandwidth of 12.8GB/s or even that 12.1GB/s without some manual tweaking. I have yet to come across a kit of memory that had properly trained tertiary timings in relation to everything else (RTL/IO-L being the most difficult for a board to train along side these timings). Not only that, but even as we move to the enthusiast platforms, additional channels tends to reduce efficiency. Peak theoretical bandwidth doubles from dual to quad, but you get further away from achieving that number. With my tweaking, I can achieve 98% efficiency in writes, and 96% efficiency in reads in dual channel. In quad using DDR4, I am lucky to hit 85% of my peak theoretical bandwidth.

As much as I would like to see AVX catch on in gaming applications, memory bandwidth is not the only problem. AVX is extremely efficient, and people with their pseudostable overclocks are going to immediately feel the effects of it the moment they attempt to game with absurd clocks/voltages. Granted, it won't be nearly as bad as AVX2 loads that you see in Linpack or P95, but it is still far more than what current games offer.

On a side note, didn't a few titles try AVX before? I think GRID did, but I don't know the exact implementation of it.

Prysin · October 13, 2016

@MageTanki think you're right grid did use AVX. I think it was for the physics. Not sure if AVX but i think it was called "vector physics"....

Also, if you want to have someone more competent check this stuff out @patrickjp93 then why didnt you add @Tomsen???

patrickjp93 · October 13, 2016

2 hours ago, Prysin said:

@MageTanki think you're right grid did use AVX. I think it was for the physics. Not sure if AVX but i think it was called "vector physics"....

Also, if you want to have someone more competent check this stuff out @patrickjp93 then why didnt you add @Tomsen???

I have very little knowledge of who around here is best to call. Feel free to loop in whoever you want:

patrickjp93 · October 19, 2016

@Prysin, @MageTank, and @Tomsen I updated the bandwidth requirements and did some tidying of the article. 243.2GB/s just for 1 CPU core seems a bit excessive, don't you think?

MageTank · October 19, 2016

7 hours ago, patrickjp93 said:

@Prysin, @MageTank, and @Tomsen I updated the bandwidth requirements and did some tidying of the article. 243.2GB/s just for 1 CPU core seems a bit excessive, don't you think?

A bit excessive? Maybe closer to "currently impossible", lol. Let's assume that Broadwell-E's IMC is magically strong enough to handle DDR4 at its highest JEDEC rated speed of 4266mhz. In order to achieve the bandwidth suggestion you mathed out, you would need AT LEAST 90% efficiency on that ram (90% of 272.8GB/s = 245.52GB/s). To put that into perspective, the highest quad channel efficiency I've seen thus far, is around 80%. Granted, not many people overclock their ram on enthusiast platforms because quad channel is already more than enough bandwidth for their tasks, I still don't think you'd get too far with what we've seen from their IMC's.

That isn't even counting the fact that you said you need that kind of bandwidth on a single core. I currently hit roughly 54GB/s bandwidth in writes (53GB/s in reads) but if I were to shut off my cores, that number drastically lowers. Just by going from my pentium G4400 to a Skylake i5, I gained 20% higher bandwidth at the exact same memory clock speeds. In fact, I can currently test this for you, give me a few minutes to boot with a single core and test this.

EDIT: Okay, now ASRock has my attention. Turns out, if you disable cores on this specific board, it requires a hard CMOS clear to get them back. Changing the active core configuration back to "all" changes absolutely nothing. Was able to get 3 back, but in order to get 4, had to completely clear CMOS. Not only that, but my memory overclock will no longer load. Something tells me it broke one of the subtle timing adjustments, so I will have to manually input that again. Luckily, I still have the results of my overclock saved from before, just without the photoworx result.

1 Core/1 Thread:

All Cores/Threads:

Write bandwidth doesn't change much at all, and can probably be considered margin of error (background processes can impact the results of this test, and this all thread result is older). However, we see a 40% difference in read bandwidth, and 15% difference in copy speed. Latency remains unchanged. It's safe to say, regardless of platform, achieving your proposed memory bandwidth requirement with a single core, is impossible with the current technology we have.

EDIT 2: I did manage to find a photoworxx result from when you asked me to run my CPU at 4.5ghz. Note: Photoworxx does not scale with CPU clock speed, it is strictly a memory bandwidth test.

Sign In

24 Comments

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post