Real Game Engine Code Optimization

patrickjp93 · October 11, 2016

@LAwLz @Prysin I thought I'd get some feedback from you two in particular if possible.

Prysin · October 11, 2016

i got a x4 845 Excavator CPU if you need to run some code test seeing as you ahve no measurements for Excavator

patrickjp93 · October 11, 2016

2 minutes ago, Prysin said:

i got a x4 845 Excavator CPU if you need to run some code test seeing as you ahve no measurements for Excavator

I'm actually very surprised Agner hasn't gotten around to it yet. He's usually very punctual.

You need a hardware profiler to measure both latency and throughput. Unless you have one of those $50,000 machines laying around, your measurements wouldn't be worth anything, but thanks for the offer.

Prysin · October 11, 2016

while i understand your desire to use AVX it would wreck performance on laptops and other TDP constrained devices unless switching between AVX and "normal" coding is done often enough (intentionally) to keep the heat buildup in check. This will again tank the "potential" of said iteration.

Also struggling to understand some parts, need to read more up on coding.

Prysin · October 11, 2016

Just now, patrickjp93 said:

I'm actually very surprised Agner hasn't gotten around to it yet. He's usually very punctual.

You need a hardware profiler to measure both latency and throughput. Unless you have one of those $50,000 machines laying around, your measurements wouldn't be worth anything, but thanks for the offer.

lets build one.

to quote Jeremy Clarkson

"How hard can it be?"

Prysin · October 11, 2016

technically, couldnt you build such a test using a barebones Linux kernel (that barely get you past post, run test, print result on screen and do not terminate unless you type in exit) ???

at a kernel level you shouldnt have too much interference from anything else to ruin results.

patrickjp93 · October 11, 2016

Just now, Prysin said:

while i understand your desire to use AVX it would wreck performance on laptops and other TDP constrained devices unless switching between AVX and "normal" coding is done often enough (intentionally) to keep the heat buildup in check. This will again tank the "potential" of said iteration.

Also struggling to understand some parts, need to read more up on coding.

AVX only gets hot when you abuse the crap out of it like in IBT. When it's just a couple of the same ops with every pass (not like the FFTs where more than 2/3 of the AVX instruction set gets used), my MacBook Pro Retina doesn't even spin up the fan.

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX

That's the shitty documentation you have to use to get started. It makes more sense to hardware engineers than software devs for sure.

Prysin · October 11, 2016

does your Macbook have a fan?

also that intel document makes no sense. The explanations is so barebones you cannot even follow it with basic logic.

needs more AVX for dummies

patrickjp93 · October 11, 2016

1 minute ago, Prysin said:

technically, couldnt you build such a test using a barebones Linux kernel (that barely get you past post, run test, print result on screen and do not terminate unless you type in exit) ???

at a kernel level you shouldnt have too much interference from anything else to ruin results.

Maybe. I haven't done benchmarking at such a low level. However, I'm pretty sure to test individual instruction latencies and throughput's you're going to need direct hardware connection profiling tools that can sniff cache and registers just as fast as they refresh. They cost a ton of money to buy, and you have to get the pinout documentation for the pads if you want to build one.

patrickjp93 · October 11, 2016

1 minute ago, Prysin said:

does your Macbook have a fan?

also that intel document makes no sense. The explanations is so barebones you cannot even follow it with basic logic.

needs more AVX for dummies

LOL!

Well, this is the best starter for the visual learner, but it was abandoned at SSE3 because Intel started releasing 50 then 100 then gargantuanly more new instructions every year.

http://www.tommesani.com/index.php/simd/46-sse-arithmetic.html

This is also pretty good to learn from, but it doesn't cover more than 5% of the AVX extensions.

http://www.codeproject.com/Articles/874396/Crunching-Numbers-with-AVX-and-AVX

patrickjp93 · October 11, 2016

@MageTank Care to get in on this?

Prysin · October 11, 2016

18 minutes ago, patrickjp93 said:

Maybe. I haven't done benchmarking at such a low level. However, I'm pretty sure to test individual instruction latencies and throughput's you're going to need direct hardware connection profiling tools that can sniff cache and registers just as fast as they refresh. They cost a ton of money to buy, and you have to get the pinout documentation for the pads if you want to build one.

meh, thats not hard to do. you can do that simply by using a PLC, voltage booster chips (you need to boost the signal up to around 10v) and some tiny wires. gonna take a week to build the shit. In the end, it would cost like 5-7000$ to jerry rig it. Would invite a lot of manual labor.

But hardware wise, its not hard to do. The majority of the cost would be the PLC and ofc, the shitload of time you would need to solder tiny wires to the pins. you wouldnt be able to measure cache issues, BUT you could home in on other signals. Like memory fetch signals. and look for how long it takes to execute a code on hardware A, then extrapolate the performance based on known numbers and the differential

EDIT:
luckily we have both Piledriver and Steamroller chips that work with FM2+... so its totally possible to test.

I just dont have 5-7k USD on hand atm

patrickjp93 · October 11, 2016

1 minute ago, Prysin said:

meh, thats not hard to do. you can do that simply by using a PLC, voltage booster chips (you need to boost the signal up to around 10v) and some tiny wires. gonna take a week to build the shit. In the end, it would cost like 5-7000$ to jerry rig it. Would invite a lot of manual labor.

But hardware wise, its not hard to do. The majority of the cost would be the PLC and ofc, the shitload of time you would need to solder tiny wires to the pins.

I'd give you a thumbs up, but the lazy bums at @LinusTech haven't implemented it yet for blogs!

Prysin · October 11, 2016

48 minutes ago, patrickjp93 said:

I'd give you a thumbs up, but the lazy bums at @LinusTech haven't implemented it yet for blogs!

atleast they arent much worse then your average game dev

patrickjp93 · October 11, 2016

3 minutes ago, Prysin said:

atleast they arent much worse then your average game dev

Meta reply is meta.

Prysin · October 11, 2016

2 minutes ago, patrickjp93 said:

Meta reply is meta.

get on skype

the longer you procrastrinate, the worse the skypelag

patrickjp93 · October 11, 2016

4 minutes ago, Prysin said:

get on skype

the longer you procrastrinate, the worse the skypelag

I have to call it a night. It's 1:30 AM my time now, and I get up at 6:45 for work.

Prysin · October 11, 2016

oh, BTW patrick

your code. Does it take into consideration that CMT based AMD CPUs uses 2x 128bit vector units to get 256 AVX?

Prysin · October 11, 2016

Just now, patrickjp93 said:

I have to call it a night. It's 1:30 AM my time now, and I get up at 6:45 for work.

weaksauce... i make fun of Rasmus all day long, go to bed around midnight and get up at 4:55 every damn fucking day

MageTank · October 11, 2016

3 hours ago, patrickjp93 said:

@MageTank Care to get in on this?

Sorry, I am confused as to how these blog entries work. What exactly am i getting in on?

patrickjp93 · October 11, 2016

This current one. Just reveal the spoiler up top.

Prysin · October 12, 2016

9 hours ago, MageTank said:

Sorry, I am confused as to how these blog entries work. What exactly am i getting in on?

You're getting on the hype train

patrickjp93 · October 12, 2016

16 hours ago, MageTank said:

Sorry, I am confused as to how these blog entries work. What exactly am i getting in on?

The current one. Just open the hidden section up top. I'm looking for critique, (dis) agreement, suggestions, etc....

patrickjp93 · October 16, 2016

@Prysin I updated the code and provided a trimmed-down version for performance testing. Would you mind collecting some hard numbers for me? Pick whatever compiler. I've been using Clang 3.8 with -O3 enabled. Find the optimization level that works best if you use GCC or ICC. If you use MSVC, you'll have to do the research yourself on what combo of flags you'll need.

My average timings and variance for a 4960HQ on my Macbook Pro Retina under Fedora 24, latest kernel as of 10/15/2016:

Compiler: Clang++ 3.8.0

Flags: -std=c++14 -O3

Mesh size in floats: 90000
Scalar translation took 6.08489e-04s +- 0.11032e-04s
Vector translation took 5.82480e-05s +- 0.14391e-05s

Basically, even with the bandwidth bottleneck, 10x performance improvement over scalar code. And even if I loop 3 billion times over the same mesh, I don't thermal throttle under AVX. Mind you, it's only addition, but still.

Prysin · October 16, 2016

@patrickjp93 i think you are overestimating my coding ability grossly. Issue is, i have started at a low level, understanding what it means and does (when i look at it long enough) but i have near zero of the basics, so i cannot do much about it when it comes to using it. That is what i am working on atm, however it's going slow, very very slow, as i am in the middle of buying my own house and moving.

Sign In

Real Game Engine Code Optimization

Model Matrix and Vector Transforms Optimized By SIMD

42 Comments

patrickjp93 6,086

Prysin 5,348

patrickjp93 6,086

Prysin 5,348

Prysin 5,348

Prysin 5,348

patrickjp93 6,086

Prysin 5,348

patrickjp93 6,086

patrickjp93 6,086

patrickjp93 6,086

Prysin 5,348

patrickjp93 6,086

Prysin 5,348

patrickjp93 6,086

Prysin 5,348

patrickjp93 6,086

Prysin 5,348

Prysin 5,348

MageTank 6,103

patrickjp93 6,086

Prysin 5,348

patrickjp93 6,086

patrickjp93 6,086

Prysin 5,348

My Activity Streams