Jump to content
  • entries
    4
  • comments
    66
  • views
    1,906

Model Matrix and Vector Transforms Optimized By SIMD

I'm sick to death of people telling me "if it was so easy, the game devs would have done it by now. They know better than you do."

 

Here is visible, incontrovertible proof that the games industry can get a huge boost from taking advantage of SIMD today, especially when games require Sandy Bridge or later hardware (meaning AVX is available, but not AVX2 for our purposes).

 

First Example: Mesh Transform By Translation Using AVX Intrinsics

 

Example updated and trimmed for readability.

Spoiler

#include <cstdalign>
#include <iostream>
#include <chrono>
#include <ctime>
#include <x86intrin.h>

//Size chosen because 30,000 triangles is considered medium-high for modern prominent characters
const uint size = 90000;
alignas(32) const float Mat3T[8]    = {1.0f, 2.0f, 3.0f, 1.0f,
                                       2.0f, 3.0f, 1.0f, 2.0f};
alignas(32)       float Mesh[size]  = {};

void translate_scalar(float *Mesh, const float *translation, const int length)
{
  for(uint i = 0; i < length; i+=3)
  {
    Mesh[i]   += translation[0];
    Mesh[i+1] += translation[1];
    Mesh[i+2] += translation[2];
  }
}

void translate_vector(float *Mesh, const float *translation, const uint length)
{
  __m256 trans = _mm256_load_ps(translation);
  
  //we stay 8 ahead in count so we don't go out of bounds
  uint i = 7;
  for(; i < length; i += 8, Mesh += 8)
  {
    __m256 verts = _mm256_load_ps(Mesh);
    verts        = _mm256_add_ps(verts, trans);
    _mm256_store_ps(Mesh, verts);
    
    trans = _mm256_permute_ps(trans, _MM_SHUFFLE(2, 1, 0, 2));
  }

  

  //Cleanup loop for cases where length is not a multiple of 8
  uint diff = 8 - (i - length);
  if( diff != 0)
  {
    float temp[8] = {};
    _mm256_store_ps(temp, trans);
    //for(uint j = 0; j < diff; ++j) { Mesh[j] += temp[j]; }
    while(diff != 0)
    {
      *Mesh += temp[7-diff];
      //temp++;
      Mesh++;
      diff--;
    }
  } 
}

int main()
{
  using namespace std::chrono;
  std::cout << "Mesh size in floats: " << size << "\n";
  high_resolution_clock::time_point start, end;
  
  start = high_resolution_clock::now();
  translate_scalar(Mesh, Mat3T, size);
  end = high_resolution_clock::now();

  duration<double> time_span = duration_cast<duration<double>>(end - start);
  std::cout << "Scalar translation took " << time_span.count() << "s\n";


  
  start = high_resolution_clock::now();
  translate_vector(Mesh, Mat3T, size);
  end = high_resolution_clock::now();

  duration<double> time_span2 = duration_cast<duration<double>>(end - start);
  std::cout << "Vector translation took " << time_span2.count() << "s\n";

  /*//This will double-check your work.
  for(uint i = 0; i < size; i += 3)
  {
    std::cout << Mesh[i] << ", " << Mesh[i+1] << ", " << Mesh[i+2] << "\n";
  }
  */

}

 

 

My average timings and variance for a 4960HQ on my Macbook Pro Retina under Fedora 24, latest kernel as of 10/15/2016:

Compiler: Clang++ 3.8.0

Flags:      -std=c++14 -O3 -march=native

Mesh size in floats: 90000
Scalar translation took 6.08489e-04s +- 0.11032e-04s
Vector translation took 5.82480e-05s +- 0.14391e-05s

 

The short of it is you can write tighter, denser loops with a little bit of effort. While the latency for each vector add is 3 cycles and each multiplication is 5, multiple iterations can be in flight at once on a single thread. The throughput for the vectorized version is 8x the scalar version without any unrolling. Thus, the loop can also easily fit into the small loop detector which can shave off some cycles due to prefetch removal and result forwarding between iterations. Assuming you don't run out of memory bandwidth, you can actually do other tasks on this same core without using hyper threading as long as they do not depend on the result of the mesh manipulation. Looking at the SB block diagram, with each clock achieving both an 8-wide vector multiplication and 8-wide vector addition, you can achieve more than 50GFlops per core on a 2600K, but the memory bandwidth will not allow you to load and store the results as quickly as you can request and produce them at a rate of 50GB/s without high-end dual-channel DDR3 or a quad-channel configuration. It would be best to use a C++ 17 stack-less resumable function to encapsulate this and do short bursts of another task when more than 3 L3 cache misses happen in a row (this can be tracked with a hardware profiler to determine optimal burst lengths).

 

If there is interest, I can go into nuances of leveraging vectorization techniques in conjunction with other data transforms relevant to gaming (though I'm not giving away my AVX ray tracer). I can also look into benchmarking multicore use of this and balancing it out against other tasks to achieve best performance for a given configuration.

42 Comments

Prysin

Posted

i got a x4 845 Excavator CPU if you need to run some code test seeing as you ahve no measurements for Excavator

patrickjp93
· Banned

Posted

2 minutes ago, Prysin said:

i got a x4 845 Excavator CPU if you need to run some code test seeing as you ahve no measurements for Excavator

I'm actually very surprised Agner hasn't gotten around to it yet. He's usually very punctual.

 

You need a hardware profiler to measure both latency and throughput. Unless you have one of those $50,000 machines laying around, your measurements wouldn't be worth anything, but thanks for the offer.

Prysin

Posted

while i understand your desire to use AVX it would wreck performance on laptops and other TDP constrained devices unless switching between AVX and "normal" coding is done often enough (intentionally) to keep the heat buildup in check. This will again tank the "potential" of said iteration.

 

Also struggling to understand some parts, need to read more up on coding.

Prysin

Posted

Just now, patrickjp93 said:

I'm actually very surprised Agner hasn't gotten around to it yet. He's usually very punctual.

 

You need a hardware profiler to measure both latency and throughput. Unless you have one of those $50,000 machines laying around, your measurements wouldn't be worth anything, but thanks for the offer.

lets build one.

 

to quote Jeremy Clarkson

"How hard can it be?"

Prysin

Posted

technically, couldnt you build such a test using a barebones Linux kernel (that barely get you past post, run test, print result on screen and do not terminate unless you type in exit) ???

 

at a kernel level you shouldnt have too much interference from anything else to ruin results.

patrickjp93
· Banned

Posted

Just now, Prysin said:

while i understand your desire to use AVX it would wreck performance on laptops and other TDP constrained devices unless switching between AVX and "normal" coding is done often enough (intentionally) to keep the heat buildup in check. This will again tank the "potential" of said iteration.

 

Also struggling to understand some parts, need to read more up on coding.

AVX only gets hot when you abuse the crap out of it like in IBT. When it's just a couple of the same ops with every pass (not like the FFTs where more than 2/3 of the AVX instruction set gets used), my MacBook Pro Retina doesn't even spin up the fan.

 

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX

 

That's the shitty documentation you have to use to get started. It makes more sense to hardware engineers than software devs for sure.

Prysin

Posted

does your Macbook have a fan?

 

also that intel document makes no sense. The explanations is so barebones you cannot even follow it with basic logic.
 

needs more AVX for dummies

patrickjp93
· Banned

Posted

1 minute ago, Prysin said:

technically, couldnt you build such a test using a barebones Linux kernel (that barely get you past post, run test, print result on screen and do not terminate unless you type in exit) ???

 

at a kernel level you shouldnt have too much interference from anything else to ruin results.

Maybe. I haven't done benchmarking at such a low level. However, I'm pretty sure to test individual instruction latencies and throughput's you're going to need direct hardware connection profiling tools that can sniff cache and registers just as fast as they refresh. They cost a ton of money to buy, and you have to get the pinout documentation for the pads if you want to build one.

patrickjp93
· Banned

Posted

1 minute ago, Prysin said:

does your Macbook have a fan?

 

also that intel document makes no sense. The explanations is so barebones you cannot even follow it with basic logic.
 

needs more AVX for dummies

LOL!

 

Well, this is the best starter for the visual learner, but it was abandoned at SSE3 because Intel started releasing 50 then 100 then gargantuanly more new instructions every year.

http://www.tommesani.com/index.php/simd/46-sse-arithmetic.html

 

This is also pretty good to learn from, but it doesn't cover more than 5% of the AVX extensions.

http://www.codeproject.com/Articles/874396/Crunching-Numbers-with-AVX-and-AVX

Prysin

Posted

18 minutes ago, patrickjp93 said:

Maybe. I haven't done benchmarking at such a low level. However, I'm pretty sure to test individual instruction latencies and throughput's you're going to need direct hardware connection profiling tools that can sniff cache and registers just as fast as they refresh. They cost a ton of money to buy, and you have to get the pinout documentation for the pads if you want to build one.

meh, thats not hard to do. you can do that simply by using a PLC, voltage booster chips (you need to boost the signal up to around 10v) and some tiny wires. gonna take a week to build the shit. In the end, it would cost like 5-7000$ to jerry rig it. Would invite a lot of manual labor.

But hardware wise, its not hard to do. The majority of the cost would be the PLC and ofc, the shitload of time you would need to solder tiny wires to the pins. you wouldnt be able to measure cache issues, BUT you could home in on other signals. Like memory fetch signals. and look for how long it takes to execute a code on hardware A, then extrapolate the performance based on known numbers and the differential

 

EDIT:
luckily we have both Piledriver and Steamroller chips that work with FM2+... so its totally possible to test.

 

I just dont have 5-7k USD on hand atm

patrickjp93
· Banned

Posted

1 minute ago, Prysin said:

meh, thats not hard to do. you can do that simply by using a PLC, voltage booster chips (you need to boost the signal up to around 10v) and some tiny wires. gonna take a week to build the shit. In the end, it would cost like 5-7000$ to jerry rig it. Would invite a lot of manual labor.

But hardware wise, its not hard to do. The majority of the cost would be the PLC and ofc, the shitload of time you would need to solder tiny wires to the pins.

I'd give you a thumbs up, but the lazy bums at @LinusTech haven't implemented it yet for blogs! ;)

Prysin

Posted

48 minutes ago, patrickjp93 said:

I'd give you a thumbs up, but the lazy bums at @LinusTech haven't implemented it yet for blogs! ;)

atleast they arent much worse then your average game dev

patrickjp93
· Banned

Posted

3 minutes ago, Prysin said:

atleast they arent much worse then your average game dev

Meta reply is meta. :D

Prysin

Posted

2 minutes ago, patrickjp93 said:

Meta reply is meta. :D

get on skype :P

 

the longer you procrastrinate, the worse the skypelag

patrickjp93
· Banned

Posted

4 minutes ago, Prysin said:

get on skype :P

 

the longer you procrastrinate, the worse the skypelag

I have to call it a night. It's 1:30 AM my time now, and I get up at 6:45 for work.

Prysin

Posted

oh, BTW patrick

your code. Does it take into consideration that CMT based AMD CPUs uses 2x 128bit vector units to get 256 AVX?

Prysin

Posted

Just now, patrickjp93 said:

I have to call it a night. It's 1:30 AM my time now, and I get up at 6:45 for work.

weaksauce... i make fun of Rasmus all day long, go to bed around midnight and get up at 4:55 every damn fucking day

MageTank

Posted

3 hours ago, patrickjp93 said:

@MageTank Care to get in on this?

Sorry, I am confused as to how these blog entries work. What exactly am i getting in on? 

patrickjp93
· Banned

Posted

This current one. Just reveal the spoiler up top.

Prysin

Posted

9 hours ago, MageTank said:

Sorry, I am confused as to how these blog entries work. What exactly am i getting in on? 

You're getting on the hype train

patrickjp93
· Banned

Posted

16 hours ago, MageTank said:

Sorry, I am confused as to how these blog entries work. What exactly am i getting in on? 

The current one. Just open the hidden section up top. I'm looking for critique, (dis) agreement, suggestions, etc....

patrickjp93
· Banned

Posted

@Prysin I updated the code and provided a trimmed-down version for performance testing. Would you mind collecting some hard numbers for me? Pick whatever compiler. I've been using Clang 3.8 with -O3 enabled. Find the optimization level that works best if you use GCC or ICC. If you use MSVC, you'll have to do the research yourself on what combo of flags you'll need.

 

My average timings and variance for a 4960HQ on my Macbook Pro Retina under Fedora 24, latest kernel as of 10/15/2016:

Compiler: Clang++ 3.8.0

Flags:      -std=c++14 -O3

Mesh size in floats: 90000
Scalar translation took 6.08489e-04s +- 0.11032e-04s
Vector translation took 5.82480e-05s +- 0.14391e-05s

 

Basically, even with the bandwidth bottleneck, 10x performance improvement over scalar code. And even if I loop 3 billion times over the same mesh, I don't thermal throttle under AVX. Mind you, it's only addition, but still.

Prysin

Posted

@patrickjp93 i think you are overestimating my coding ability grossly. Issue is, i have started at a low level, understanding what it means and does (when i look at it long enough) but i have near zero of the basics, so i cannot do much about it when it comes to using it. That is what i am working on atm, however it's going slow, very very slow, as i am in the middle of buying my own house and moving.

×