Jump to content
  • entries
    4
  • comments
    66
  • views
    1,656

Model Matrix and Vector Transforms Optimized By SIMD

patrickjp93

2,393 views

I'm sick to death of people telling me "if it was so easy, the game devs would have done it by now. They know better than you do."

 

Here is visible, incontrovertible proof that the games industry can get a huge boost from taking advantage of SIMD today, especially when games require Sandy Bridge or later hardware (meaning AVX is available, but not AVX2 for our purposes).

 

First Example: Mesh Transform By Translation Using AVX Intrinsics

 

Example updated and trimmed for readability.

Spoiler

#include <cstdalign>
#include <iostream>
#include <chrono>
#include <ctime>
#include <x86intrin.h>

//Size chosen because 30,000 triangles is considered medium-high for modern prominent characters
const uint size = 90000;
alignas(32) const float Mat3T[8]    = {1.0f, 2.0f, 3.0f, 1.0f,
                                       2.0f, 3.0f, 1.0f, 2.0f};
alignas(32)       float Mesh[size]  = {};

void translate_scalar(float *Mesh, const float *translation, const int length)
{
  for(uint i = 0; i < length; i+=3)
  {
    Mesh[i]   += translation[0];
    Mesh[i+1] += translation[1];
    Mesh[i+2] += translation[2];
  }
}

void translate_vector(float *Mesh, const float *translation, const uint length)
{
  __m256 trans = _mm256_load_ps(translation);
  
  //we stay 8 ahead in count so we don't go out of bounds
  uint i = 7;
  for(; i < length; i += 8, Mesh += 8)
  {
    __m256 verts = _mm256_load_ps(Mesh);
    verts        = _mm256_add_ps(verts, trans);
    _mm256_store_ps(Mesh, verts);
    
    trans = _mm256_permute_ps(trans, _MM_SHUFFLE(2, 1, 0, 2));
  }

  

  //Cleanup loop for cases where length is not a multiple of 8
  uint diff = 8 - (i - length);
  if( diff != 0)
  {
    float temp[8] = {};
    _mm256_store_ps(temp, trans);
    //for(uint j = 0; j < diff; ++j) { Mesh[j] += temp[j]; }
    while(diff != 0)
    {
      *Mesh += temp[7-diff];
      //temp++;
      Mesh++;
      diff--;
    }
  } 
}

int main()
{
  using namespace std::chrono;
  std::cout << "Mesh size in floats: " << size << "\n";
  high_resolution_clock::time_point start, end;
  
  start = high_resolution_clock::now();
  translate_scalar(Mesh, Mat3T, size);
  end = high_resolution_clock::now();

  duration<double> time_span = duration_cast<duration<double>>(end - start);
  std::cout << "Scalar translation took " << time_span.count() << "s\n";


  
  start = high_resolution_clock::now();
  translate_vector(Mesh, Mat3T, size);
  end = high_resolution_clock::now();

  duration<double> time_span2 = duration_cast<duration<double>>(end - start);
  std::cout << "Vector translation took " << time_span2.count() << "s\n";

  /*//This will double-check your work.
  for(uint i = 0; i < size; i += 3)
  {
    std::cout << Mesh[i] << ", " << Mesh[i+1] << ", " << Mesh[i+2] << "\n";
  }
  */

}

 

 

My average timings and variance for a 4960HQ on my Macbook Pro Retina under Fedora 24, latest kernel as of 10/15/2016:

Compiler: Clang++ 3.8.0

Flags:      -std=c++14 -O3 -march=native

Mesh size in floats: 90000
Scalar translation took 6.08489e-04s +- 0.11032e-04s
Vector translation took 5.82480e-05s +- 0.14391e-05s

 

The short of it is you can write tighter, denser loops with a little bit of effort. While the latency for each vector add is 3 cycles and each multiplication is 5, multiple iterations can be in flight at once on a single thread. The throughput for the vectorized version is 8x the scalar version without any unrolling. Thus, the loop can also easily fit into the small loop detector which can shave off some cycles due to prefetch removal and result forwarding between iterations. Assuming you don't run out of memory bandwidth, you can actually do other tasks on this same core without using hyper threading as long as they do not depend on the result of the mesh manipulation. Looking at the SB block diagram, with each clock achieving both an 8-wide vector multiplication and 8-wide vector addition, you can achieve more than 50GFlops per core on a 2600K, but the memory bandwidth will not allow you to load and store the results as quickly as you can request and produce them at a rate of 50GB/s without high-end dual-channel DDR3 or a quad-channel configuration. It would be best to use a C++ 17 stack-less resumable function to encapsulate this and do short bursts of another task when more than 3 L3 cache misses happen in a row (this can be tracked with a hardware profiler to determine optimal burst lengths).

 

If there is interest, I can go into nuances of leveraging vectorization techniques in conjunction with other data transforms relevant to gaming (though I'm not giving away my AVX ray tracer). I can also look into benchmarking multicore use of this and balancing it out against other tasks to achieve best performance for a given configuration.

42 Comments

2 minutes ago, Prysin said:

i got a x4 845 Excavator CPU if you need to run some code test seeing as you ahve no measurements for Excavator

I'm actually very surprised Agner hasn't gotten around to it yet. He's usually very punctual.

 

You need a hardware profiler to measure both latency and throughput. Unless you have one of those $50,000 machines laying around, your measurements wouldn't be worth anything, but thanks for the offer.

Link to comment
Link to post

while i understand your desire to use AVX it would wreck performance on laptops and other TDP constrained devices unless switching between AVX and "normal" coding is done often enough (intentionally) to keep the heat buildup in check. This will again tank the "potential" of said iteration.

 

Also struggling to understand some parts, need to read more up on coding.

Link to comment
Link to post
Just now, patrickjp93 said:

I'm actually very surprised Agner hasn't gotten around to it yet. He's usually very punctual.

 

You need a hardware profiler to measure both latency and throughput. Unless you have one of those $50,000 machines laying around, your measurements wouldn't be worth anything, but thanks for the offer.

lets build one.

 

to quote Jeremy Clarkson

"How hard can it be?"

Link to comment
Link to post

technically, couldnt you build such a test using a barebones Linux kernel (that barely get you past post, run test, print result on screen and do not terminate unless you type in exit) ???

 

at a kernel level you shouldnt have too much interference from anything else to ruin results.

Link to comment
Link to post
Just now, Prysin said:

while i understand your desire to use AVX it would wreck performance on laptops and other TDP constrained devices unless switching between AVX and "normal" coding is done often enough (intentionally) to keep the heat buildup in check. This will again tank the "potential" of said iteration.

 

Also struggling to understand some parts, need to read more up on coding.

AVX only gets hot when you abuse the crap out of it like in IBT. When it's just a couple of the same ops with every pass (not like the FFTs where more than 2/3 of the AVX instruction set gets used), my MacBook Pro Retina doesn't even spin up the fan.

 

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX

 

That's the shitty documentation you have to use to get started. It makes more sense to hardware engineers than software devs for sure.

Link to comment
Link to post

does your Macbook have a fan?

 

also that intel document makes no sense. The explanations is so barebones you cannot even follow it with basic logic.
 

needs more AVX for dummies

Link to comment
Link to post
1 minute ago, Prysin said:

technically, couldnt you build such a test using a barebones Linux kernel (that barely get you past post, run test, print result on screen and do not terminate unless you type in exit) ???

 

at a kernel level you shouldnt have too much interference from anything else to ruin results.

Maybe. I haven't done benchmarking at such a low level. However, I'm pretty sure to test individual instruction latencies and throughput's you're going to need direct hardware connection profiling tools that can sniff cache and registers just as fast as they refresh. They cost a ton of money to buy, and you have to get the pinout documentation for the pads if you want to build one.

Link to comment
Link to post
1 minute ago, Prysin said:

does your Macbook have a fan?

 

also that intel document makes no sense. The explanations is so barebones you cannot even follow it with basic logic.
 

needs more AVX for dummies

LOL!

 

Well, this is the best starter for the visual learner, but it was abandoned at SSE3 because Intel started releasing 50 then 100 then gargantuanly more new instructions every year.

http://www.tommesani.com/index.php/simd/46-sse-arithmetic.html

 

This is also pretty good to learn from, but it doesn't cover more than 5% of the AVX extensions.

http://www.codeproject.com/Articles/874396/Crunching-Numbers-with-AVX-and-AVX

Link to comment
Link to post
18 minutes ago, patrickjp93 said:

Maybe. I haven't done benchmarking at such a low level. However, I'm pretty sure to test individual instruction latencies and throughput's you're going to need direct hardware connection profiling tools that can sniff cache and registers just as fast as they refresh. They cost a ton of money to buy, and you have to get the pinout documentation for the pads if you want to build one.

meh, thats not hard to do. you can do that simply by using a PLC, voltage booster chips (you need to boost the signal up to around 10v) and some tiny wires. gonna take a week to build the shit. In the end, it would cost like 5-7000$ to jerry rig it. Would invite a lot of manual labor.

But hardware wise, its not hard to do. The majority of the cost would be the PLC and ofc, the shitload of time you would need to solder tiny wires to the pins. you wouldnt be able to measure cache issues, BUT you could home in on other signals. Like memory fetch signals. and look for how long it takes to execute a code on hardware A, then extrapolate the performance based on known numbers and the differential

 

EDIT:
luckily we have both Piledriver and Steamroller chips that work with FM2+... so its totally possible to test.

 

I just dont have 5-7k USD on hand atm

Link to comment
Link to post
1 minute ago, Prysin said:

meh, thats not hard to do. you can do that simply by using a PLC, voltage booster chips (you need to boost the signal up to around 10v) and some tiny wires. gonna take a week to build the shit. In the end, it would cost like 5-7000$ to jerry rig it. Would invite a lot of manual labor.

But hardware wise, its not hard to do. The majority of the cost would be the PLC and ofc, the shitload of time you would need to solder tiny wires to the pins.

I'd give you a thumbs up, but the lazy bums at @LinusTech haven't implemented it yet for blogs! ;)

Link to comment
Link to post
48 minutes ago, patrickjp93 said:

I'd give you a thumbs up, but the lazy bums at @LinusTech haven't implemented it yet for blogs! ;)

atleast they arent much worse then your average game dev

Link to comment
Link to post
2 minutes ago, patrickjp93 said:

Meta reply is meta. :D

get on skype :P

 

the longer you procrastrinate, the worse the skypelag

Link to comment
Link to post
4 minutes ago, Prysin said:

get on skype :P

 

the longer you procrastrinate, the worse the skypelag

I have to call it a night. It's 1:30 AM my time now, and I get up at 6:45 for work.

Link to comment
Link to post

oh, BTW patrick

your code. Does it take into consideration that CMT based AMD CPUs uses 2x 128bit vector units to get 256 AVX?

Link to comment
Link to post
Just now, patrickjp93 said:

I have to call it a night. It's 1:30 AM my time now, and I get up at 6:45 for work.

weaksauce... i make fun of Rasmus all day long, go to bed around midnight and get up at 4:55 every damn fucking day

Link to comment
Link to post
9 hours ago, MageTank said:

Sorry, I am confused as to how these blog entries work. What exactly am i getting in on? 

You're getting on the hype train

Link to comment
Link to post
16 hours ago, MageTank said:

Sorry, I am confused as to how these blog entries work. What exactly am i getting in on? 

The current one. Just open the hidden section up top. I'm looking for critique, (dis) agreement, suggestions, etc....

Link to comment
Link to post

@Prysin I updated the code and provided a trimmed-down version for performance testing. Would you mind collecting some hard numbers for me? Pick whatever compiler. I've been using Clang 3.8 with -O3 enabled. Find the optimization level that works best if you use GCC or ICC. If you use MSVC, you'll have to do the research yourself on what combo of flags you'll need.

 

My average timings and variance for a 4960HQ on my Macbook Pro Retina under Fedora 24, latest kernel as of 10/15/2016:

Compiler: Clang++ 3.8.0

Flags:      -std=c++14 -O3

Mesh size in floats: 90000
Scalar translation took 6.08489e-04s +- 0.11032e-04s
Vector translation took 5.82480e-05s +- 0.14391e-05s

 

Basically, even with the bandwidth bottleneck, 10x performance improvement over scalar code. And even if I loop 3 billion times over the same mesh, I don't thermal throttle under AVX. Mind you, it's only addition, but still.

Link to comment
Link to post

@patrickjp93 i think you are overestimating my coding ability grossly. Issue is, i have started at a low level, understanding what it means and does (when i look at it long enough) but i have near zero of the basics, so i cannot do much about it when it comes to using it. That is what i am working on atm, however it's going slow, very very slow, as i am in the middle of buying my own house and moving.

Link to comment
Link to post
×