Real Game Engine Code Optimization

patrickjp93 · October 16, 2016

51 minutes ago, Prysin said:

@patrickjp93 i think you are overestimating my coding ability grossly. Issue is, i have started at a low level, understanding what it means and does (when i look at it long enough) but i have near zero of the basics, so i cannot do much about it when it comes to using it. That is what i am working on atm, however it's going slow, very very slow, as i am in the middle of buying my own house and moving.

Ah, no pressure. Are you on Windows or Linux (both?)? If Linux, you should have GCC built right in. Just open a text editor, plop this code in, save it as cuz.cpp, go to the save location in command line, and type

g++ -std=c++14 -march=native -O3 xyz.cpp -o abc

Then, assuming no errors:

./abc

If you don't have Sandy Bridge, Bulldozer, Jaguar, or later, the program will crash. I can rewrite the intrinsics for SSE if you need.

Prysin · October 16, 2016

W10, i got TAILS on a memory stick for snoopin around places i shouldnt be

patrickjp93 · October 16, 2016

2 minutes ago, Prysin said:

W10, i got TAILS on a memory stick for snoopin around places i shouldnt be

Well, in that case your options are:

1) Cygwin-w64 (lightweight Linux emulation layer with GCC 5.3 right now)

2) Download Clang 3.9.0 (pre-built binary) http://llvm.org/releases/download.html (run as clang++ <same flags and files as before>)

3) Visual Studio 2015 community edition.

Clang is the least hassle to set up imho.

Prysin · October 16, 2016

7 hours ago, patrickjp93 said:

Well, in that case your options are:

1) Cygwin-w64 (lightweight Linux emulation layer with GCC 5.3 right now)

2) Download Clang 3.9.0 (pre-built binary) http://llvm.org/releases/download.html (run as clang++ <same flags and files as before>)

3) Visual Studio 2015 community edition.

Clang is the least hassle to set up imho.

ill read up on the C and C++ manuals i have for a lil while before testing anything. No offense, but while i know you dont hate me, i dont trust you enough to execute code i dont know what will do....

And atm, i dont have a PSU to power my FX setup.... ill probably get ahold of a CX600M soon enough, and that should let me run my FX all the way up to 4.77GHz just fine. TBH, i dont care if my FX CPU blows up, the mobo should survive anyway (990FX Sabertooth R2.0), and the memory is rock stable (1600MHz Crucial DDR3)... getting a new FX CPU is well, inexpensive compared to a new PC or windows key.

patrickjp93 · October 16, 2016

5 hours ago, Prysin said:

ill read up on the C and C++ manuals i have for a lil while before testing anything. No offense, but while i know you dont hate me, i dont trust you enough to execute code i dont know what will do....

And atm, i dont have a PSU to power my FX setup.... ill probably get ahold of a CX600M soon enough, and that should let me run my FX all the way up to 4.77GHz just fine. TBH, i dont care if my FX CPU blows up, the mobo should survive anyway (990FX Sabertooth R2.0), and the memory is rock stable (1600MHz Crucial DDR3)... getting a new FX CPU is well, inexpensive compared to a new PC or windows key.

Just Google the intrinsic functions. You can see what the scalar code does just fine.

_mm256_add_ps(a, b) adds two vectors together according to packed single-precision floating point math. Load and store load move 256 bits in the form of 8 floats from memory into a register and back to memory. Other than that you see me iterating the pointer forward by 8 (the compiler knows I'm using floats so progresses it forward 32 bytes (256 bits). You can do the math and see that it stays in bounds of the array I declared. The only thing that should confuse you is the permutation function. If you look that up, all it does is reorder elements in a vector register.

Would you like a video demo proving it isn't evil?

Prysin · October 16, 2016

Just now, patrickjp93 said:

Just Google the intrinsic functions. You can see what the scalar code does just fine.

_mm256_add_ps(a, b) adds two vectors together according to packed single-precision floating point math. Load and store load move 256 bits in the form of 8 floats from memory into a register and back to memory. Other than that you see me iterating the pointer forward by 8 (the compiler knows I'm using floats so progresses it forward 32 bytes (256 bits). You can do the math and see that it stays in bounds of the array I declared. The only thing that should confuse you is the permutation function. If you look that up, all it does is reorder elements in a vector register.

Would you like a video demo proving it isn't evil?

no i would like to get my FX running, get a new chassis for my main PC, move out, get shit sorted , get internet in new home, and yadda yadda.

first, ima go to sleep. see you in 5 hours.

patrickjp93 · October 16, 2016

2 minutes ago, Prysin said:

no i would like to get my FX running, get a new chassis for my main PC, move out, get shit sorted , get internet in new home, and yadda yadda.

first, ima go to sleep. see you in 5 hours.

Haha, fair enough. Sleep well Prysin.

Prysin · October 17, 2016

todays question. what is better.

thermal throttling on unstable/damaged mobo 4770k

lucky chip FX 8320 on rock solid TUF board hitting 4.77GHz on air.....

patrickjp93 · October 18, 2016

13 hours ago, Prysin said:

todays question. what is better.

thermal throttling on unstable/damaged mobo 4770k

lucky chip FX 8320 on rock solid TUF board hitting 4.77GHz on air.....

Uh, for this? The 4770K most likely.

Prysin · October 18, 2016

1 hour ago, patrickjp93 said:

Uh, for this? The 4770K most likely.

we will never know i guess. Because the guy who owns the 4770k is a pleb that can breaks all his shit.

That being said, the code is float dependent, so yes should be faster as the FX is only a quad core under those workloads.

patrickjp93 · October 18, 2016

1 minute ago, Prysin said:

we will never know i guess. Because the guy who owns the 4770k is a pleb that can breaks all his shit.

That being said, the code is float dependent, so yes should be faster as the FX is only a quad core under those workloads.

I mean, it would be interesting to see how the FX handles it, and since there's no second core using the FPU, it should run without impedance.

Prysin · October 18, 2016

4 minutes ago, patrickjp93 said:

I mean, it would be interesting to see how the FX handles it, and since there's no second core using the FPU, it should run without impedance.

i can have my FX up and running on a 280mm watercooler later today or tomorrow i guess. Aslong as the code isnt as taxing as Prime95 FFT, it should be totally fine.

patrickjp93 · October 18, 2016

1 minute ago, Prysin said:

i can have my FX up and running on a 280mm watercooler later today or tomorrow i guess. Aslong as the code isnt as taxing as Prime95 FFT, it should be totally fine.

It's only on one core, and it's only doing a load, an add, a store, and a shuffle. If my MacBook Pro Retina 4960HQ can handle it with no thermal throttling, the FX should be able to handle it.

Prysin · October 18, 2016

7 minutes ago, patrickjp93 said:

It's only on one core, and it's only doing a load, an add, a store, and a shuffle. If my MacBook Pro Retina 4960HQ can handle it with no thermal throttling, the FX should be able to handle it.

oh it will handle it, the exciting part is "how fast".... Could also be interesting to test vs steamroller (CBA to disassemble my mini PC to test excavator atm)

could you add a "stopwatch" function to the equation you made in order to see how fast it executes.

it would be interesting in order to see how well does this code scale with MHz vs IPC (we know IPC always matters, but which matters most here? Overclockers would prob love to know)

patrickjp93 · October 18, 2016

29 minutes ago, Prysin said:

oh it will handle it, the exciting part is "how fast".... Could also be interesting to test vs steamroller (CBA to disassemble my mini PC to test excavator atm)

could you add a "stopwatch" function to the equation you made in order to see how fast it executes.

it would be interesting in order to see how well does this code scale with MHz vs IPC (we know IPC always matters, but which matters most here? Overclockers would prob love to know)

The second example has the function calls bound by timers. Do you mean put a stop watch in each iteration of the loops? That's not going to tell you much since the one loop has 8x the number of iterations.

Prysin · October 18, 2016

no from start to stop of loop.

Say 25 loops? En result, how long you take to run 25 AVX loops.

patrickjp93 · October 18, 2016

7 hours ago, Prysin said:

no from start to stop of loop.

Say 25 loops? En result, how long you take to run 25 AVX loops.

The second spoiler in the entry above has what you want. You can change the size of the mesh to be 200 and you'll get exactly 25 loops. There may be a minimum size where the two solutions cross back over, but bear in mind a game has multiple meshes that are usually all in memory in one contiguous group, so the performance increase of my workload size would be more representative as the transform function is inlined and all the meshes are transformed one after another.

Sign In

42 Comments

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post

Link to comment

Link to post