Jump to content
  • entries
    4
  • comments
    66
  • views
    1,659

Model Matrix and Vector Transforms Optimized By SIMD

patrickjp93

2,398 views

I'm sick to death of people telling me "if it was so easy, the game devs would have done it by now. They know better than you do."

 

Here is visible, incontrovertible proof that the games industry can get a huge boost from taking advantage of SIMD today, especially when games require Sandy Bridge or later hardware (meaning AVX is available, but not AVX2 for our purposes).

 

First Example: Mesh Transform By Translation Using AVX Intrinsics

 

Example updated and trimmed for readability.

Spoiler

#include <cstdalign>
#include <iostream>
#include <chrono>
#include <ctime>
#include <x86intrin.h>

//Size chosen because 30,000 triangles is considered medium-high for modern prominent characters
const uint size = 90000;
alignas(32) const float Mat3T[8]    = {1.0f, 2.0f, 3.0f, 1.0f,
                                       2.0f, 3.0f, 1.0f, 2.0f};
alignas(32)       float Mesh[size]  = {};

void translate_scalar(float *Mesh, const float *translation, const int length)
{
  for(uint i = 0; i < length; i+=3)
  {
    Mesh[i]   += translation[0];
    Mesh[i+1] += translation[1];
    Mesh[i+2] += translation[2];
  }
}

void translate_vector(float *Mesh, const float *translation, const uint length)
{
  __m256 trans = _mm256_load_ps(translation);
  
  //we stay 8 ahead in count so we don't go out of bounds
  uint i = 7;
  for(; i < length; i += 8, Mesh += 8)
  {
    __m256 verts = _mm256_load_ps(Mesh);
    verts        = _mm256_add_ps(verts, trans);
    _mm256_store_ps(Mesh, verts);
    
    trans = _mm256_permute_ps(trans, _MM_SHUFFLE(2, 1, 0, 2));
  }

  

  //Cleanup loop for cases where length is not a multiple of 8
  uint diff = 8 - (i - length);
  if( diff != 0)
  {
    float temp[8] = {};
    _mm256_store_ps(temp, trans);
    //for(uint j = 0; j < diff; ++j) { Mesh[j] += temp[j]; }
    while(diff != 0)
    {
      *Mesh += temp[7-diff];
      //temp++;
      Mesh++;
      diff--;
    }
  } 
}

int main()
{
  using namespace std::chrono;
  std::cout << "Mesh size in floats: " << size << "\n";
  high_resolution_clock::time_point start, end;
  
  start = high_resolution_clock::now();
  translate_scalar(Mesh, Mat3T, size);
  end = high_resolution_clock::now();

  duration<double> time_span = duration_cast<duration<double>>(end - start);
  std::cout << "Scalar translation took " << time_span.count() << "s\n";


  
  start = high_resolution_clock::now();
  translate_vector(Mesh, Mat3T, size);
  end = high_resolution_clock::now();

  duration<double> time_span2 = duration_cast<duration<double>>(end - start);
  std::cout << "Vector translation took " << time_span2.count() << "s\n";

  /*//This will double-check your work.
  for(uint i = 0; i < size; i += 3)
  {
    std::cout << Mesh[i] << ", " << Mesh[i+1] << ", " << Mesh[i+2] << "\n";
  }
  */

}

 

 

My average timings and variance for a 4960HQ on my Macbook Pro Retina under Fedora 24, latest kernel as of 10/15/2016:

Compiler: Clang++ 3.8.0

Flags:      -std=c++14 -O3 -march=native

Mesh size in floats: 90000
Scalar translation took 6.08489e-04s +- 0.11032e-04s
Vector translation took 5.82480e-05s +- 0.14391e-05s

 

The short of it is you can write tighter, denser loops with a little bit of effort. While the latency for each vector add is 3 cycles and each multiplication is 5, multiple iterations can be in flight at once on a single thread. The throughput for the vectorized version is 8x the scalar version without any unrolling. Thus, the loop can also easily fit into the small loop detector which can shave off some cycles due to prefetch removal and result forwarding between iterations. Assuming you don't run out of memory bandwidth, you can actually do other tasks on this same core without using hyper threading as long as they do not depend on the result of the mesh manipulation. Looking at the SB block diagram, with each clock achieving both an 8-wide vector multiplication and 8-wide vector addition, you can achieve more than 50GFlops per core on a 2600K, but the memory bandwidth will not allow you to load and store the results as quickly as you can request and produce them at a rate of 50GB/s without high-end dual-channel DDR3 or a quad-channel configuration. It would be best to use a C++ 17 stack-less resumable function to encapsulate this and do short bursts of another task when more than 3 L3 cache misses happen in a row (this can be tracked with a hardware profiler to determine optimal burst lengths).

 

If there is interest, I can go into nuances of leveraging vectorization techniques in conjunction with other data transforms relevant to gaming (though I'm not giving away my AVX ray tracer). I can also look into benchmarking multicore use of this and balancing it out against other tasks to achieve best performance for a given configuration.

42 Comments

51 minutes ago, Prysin said:

@patrickjp93 i think you are overestimating my coding ability grossly. Issue is, i have started at a low level, understanding what it means and does (when i look at it long enough) but i have near zero of the basics, so i cannot do much about it when it comes to using it. That is what i am working on atm, however it's going slow, very very slow, as i am in the middle of buying my own house and moving.

Ah, no pressure. Are you on Windows or Linux (both?)? If Linux, you should have GCC built right in. Just open a text editor, plop this code in, save it as cuz.cpp, go to the save location in command line, and type

g++ -std=c++14 -march=native -O3 xyz.cpp -o abc

 

Then, assuming no errors:

./abc

 

If you don't have Sandy Bridge, Bulldozer, Jaguar, or later, the program will crash. I can rewrite the intrinsics for SSE if you need.

Link to comment
Link to post
2 minutes ago, Prysin said:

W10, i got TAILS on a memory stick for snoopin around places i shouldnt be

Well, in that case your options are:

 

1) Cygwin-w64 (lightweight Linux emulation layer with GCC 5.3 right now)

2) Download Clang 3.9.0 (pre-built binary) http://llvm.org/releases/download.html (run as clang++ <same flags and files as before>)

3) Visual Studio 2015 community edition.

 

Clang is the least hassle to set up imho.

Link to comment
Link to post
7 hours ago, patrickjp93 said:

Well, in that case your options are:

 

1) Cygwin-w64 (lightweight Linux emulation layer with GCC 5.3 right now)

2) Download Clang 3.9.0 (pre-built binary) http://llvm.org/releases/download.html (run as clang++ <same flags and files as before>)

3) Visual Studio 2015 community edition.

 

Clang is the least hassle to set up imho.

ill read up on the C and C++ manuals i have for a lil while before testing anything. No offense, but while i know you dont hate me, i dont trust you enough to execute code i dont know what will do....

And atm, i dont have a PSU to power my FX setup.... ill probably get ahold of a CX600M soon enough, and that should let me run my FX all the way up to 4.77GHz just fine. TBH, i dont care if my FX CPU blows up, the mobo should survive anyway (990FX Sabertooth R2.0), and the memory is rock stable (1600MHz Crucial DDR3)... getting a new FX CPU is well, inexpensive compared to a new PC or windows key.

Link to comment
Link to post
5 hours ago, Prysin said:

ill read up on the C and C++ manuals i have for a lil while before testing anything. No offense, but while i know you dont hate me, i dont trust you enough to execute code i dont know what will do....

And atm, i dont have a PSU to power my FX setup.... ill probably get ahold of a CX600M soon enough, and that should let me run my FX all the way up to 4.77GHz just fine. TBH, i dont care if my FX CPU blows up, the mobo should survive anyway (990FX Sabertooth R2.0), and the memory is rock stable (1600MHz Crucial DDR3)... getting a new FX CPU is well, inexpensive compared to a new PC or windows key.

Just Google the intrinsic functions. You can see what the scalar code does just fine.

 

_mm256_add_ps(a, b) adds two vectors together according to packed single-precision floating point math. Load and store load move 256 bits in the form of 8 floats from memory into a register and back to memory. Other than that you see me iterating the pointer forward by 8 (the compiler knows I'm using floats so progresses it forward 32 bytes (256 bits). You can do the math and see that it stays in bounds of the array I declared. The only thing that should confuse you is the permutation function. If you look that up, all it does is reorder elements in a vector register.

 

Would you like a video demo proving it isn't evil?

Link to comment
Link to post
Just now, patrickjp93 said:

Just Google the intrinsic functions. You can see what the scalar code does just fine.

 

_mm256_add_ps(a, b) adds two vectors together according to packed single-precision floating point math. Load and store load move 256 bits in the form of 8 floats from memory into a register and back to memory. Other than that you see me iterating the pointer forward by 8 (the compiler knows I'm using floats so progresses it forward 32 bytes (256 bits). You can do the math and see that it stays in bounds of the array I declared. The only thing that should confuse you is the permutation function. If you look that up, all it does is reorder elements in a vector register.

 

Would you like a video demo proving it isn't evil?

no i would like to get my FX running, get a new chassis for my main PC, move out, get shit sorted , get internet in new home, and yadda yadda.

 

first, ima go to sleep. see you in 5 hours.

Link to comment
Link to post
2 minutes ago, Prysin said:

no i would like to get my FX running, get a new chassis for my main PC, move out, get shit sorted , get internet in new home, and yadda yadda.

 

first, ima go to sleep. see you in 5 hours.

Haha, fair enough. Sleep well Prysin.

Link to comment
Link to post

todays question. what is better.

thermal throttling on unstable/damaged mobo 4770k

lucky chip FX 8320 on rock solid TUF board hitting 4.77GHz on air.....

Link to comment
Link to post
13 hours ago, Prysin said:

todays question. what is better.

thermal throttling on unstable/damaged mobo 4770k

lucky chip FX 8320 on rock solid TUF board hitting 4.77GHz on air.....

Uh, for this? The 4770K most likely.

Link to comment
Link to post
1 hour ago, patrickjp93 said:

Uh, for this? The 4770K most likely.

we will never know i guess. Because the guy who owns the 4770k is a pleb that can breaks all his shit.

 

That being said, the code is float dependent, so yes should be faster as the FX is only a quad core under those workloads.

Link to comment
Link to post
1 minute ago, Prysin said:

we will never know i guess. Because the guy who owns the 4770k is a pleb that can breaks all his shit.

 

That being said, the code is float dependent, so yes should be faster as the FX is only a quad core under those workloads.

I mean, it would be interesting to see how the FX handles it, and since there's no second core using the FPU, it should run without impedance.

Link to comment
Link to post
4 minutes ago, patrickjp93 said:

I mean, it would be interesting to see how the FX handles it, and since there's no second core using the FPU, it should run without impedance.

i can have my FX up and running on a 280mm watercooler later today or tomorrow i guess. Aslong as the code isnt as taxing as Prime95 FFT, it should be totally fine.

Link to comment
Link to post
1 minute ago, Prysin said:

i can have my FX up and running on a 280mm watercooler later today or tomorrow i guess. Aslong as the code isnt as taxing as Prime95 FFT, it should be totally fine.

It's only on one core, and it's only doing a load, an add, a store, and a shuffle. If my MacBook Pro Retina 4960HQ can handle it with no thermal throttling, the FX should be able to handle it.

Link to comment
Link to post
7 minutes ago, patrickjp93 said:

It's only on one core, and it's only doing a load, an add, a store, and a shuffle. If my MacBook Pro Retina 4960HQ can handle it with no thermal throttling, the FX should be able to handle it.

oh it will handle it, the exciting part is "how fast"....  Could also be interesting to test vs steamroller (CBA to disassemble my mini PC to test excavator atm)

 

could you add a "stopwatch" function to the equation you made in order to see how fast it executes.

 

it would be interesting in order to see how well does this code scale with MHz vs IPC (we know IPC always matters, but which matters most here? Overclockers would prob love to know)

Link to comment
Link to post
29 minutes ago, Prysin said:

oh it will handle it, the exciting part is "how fast"....  Could also be interesting to test vs steamroller (CBA to disassemble my mini PC to test excavator atm)

 

could you add a "stopwatch" function to the equation you made in order to see how fast it executes.

 

it would be interesting in order to see how well does this code scale with MHz vs IPC (we know IPC always matters, but which matters most here? Overclockers would prob love to know)

The second example has the function calls bound by timers. Do you mean put a stop watch in each iteration of the loops? That's not going to tell you much since the one loop has 8x the number of iterations.

Link to comment
Link to post
7 hours ago, Prysin said:

no from start to stop of loop.

 

Say 25 loops? En result, how long you take to run 25 AVX loops.

 

The second spoiler in the entry above has what you want. You can change the size of the mesh to be 200 and you'll get exactly 25 loops. There may be a minimum size where the two solutions cross back over, but bear in mind a game has multiple meshes that are usually all in memory in one contiguous group, so the performance increase of my workload size would be more representative as the transform function is inlined and all the meshes are transformed one after another.

Link to comment
Link to post
×