Model Matrix and Vector Transforms Optimized By SIMD
I'm sick to death of people telling me "if it was so easy, the game devs would have done it by now. They know better than you do."
Here is visible, incontrovertible proof that the games industry can get a huge boost from taking advantage of SIMD today, especially when games require Sandy Bridge or later hardware (meaning AVX is available, but not AVX2 for our purposes).
First Example: Mesh Transform By Translation Using AVX Intrinsics
Example updated and trimmed for readability.
#include <cstdalign> #include <iostream> #include <chrono> #include <ctime> #include <x86intrin.h> //Size chosen because 30,000 triangles is considered medium-high for modern prominent characters const uint size = 90000; alignas(32) const float Mat3T[8] = {1.0f, 2.0f, 3.0f, 1.0f, 2.0f, 3.0f, 1.0f, 2.0f}; alignas(32) float Mesh[size] = {}; void translate_scalar(float *Mesh, const float *translation, const int length) { for(uint i = 0; i < length; i+=3) { Mesh[i] += translation[0]; Mesh[i+1] += translation[1]; Mesh[i+2] += translation[2]; } } void translate_vector(float *Mesh, const float *translation, const uint length) { __m256 trans = _mm256_load_ps(translation); //we stay 8 ahead in count so we don't go out of bounds uint i = 7; for(; i < length; i += 8, Mesh += 8) { __m256 verts = _mm256_load_ps(Mesh); verts = _mm256_add_ps(verts, trans); _mm256_store_ps(Mesh, verts); trans = _mm256_permute_ps(trans, _MM_SHUFFLE(2, 1, 0, 2)); } //Cleanup loop for cases where length is not a multiple of 8 uint diff = 8 - (i - length); if( diff != 0) { float temp[8] = {}; _mm256_store_ps(temp, trans); //for(uint j = 0; j < diff; ++j) { Mesh[j] += temp[j]; } while(diff != 0) { *Mesh += temp[7-diff]; //temp++; Mesh++; diff--; } } } int main() { using namespace std::chrono; std::cout << "Mesh size in floats: " << size << "\n"; high_resolution_clock::time_point start, end; start = high_resolution_clock::now(); translate_scalar(Mesh, Mat3T, size); end = high_resolution_clock::now(); duration<double> time_span = duration_cast<duration<double>>(end - start); std::cout << "Scalar translation took " << time_span.count() << "s\n"; start = high_resolution_clock::now(); translate_vector(Mesh, Mat3T, size); end = high_resolution_clock::now(); duration<double> time_span2 = duration_cast<duration<double>>(end - start); std::cout << "Vector translation took " << time_span2.count() << "s\n"; /*//This will double-check your work. for(uint i = 0; i < size; i += 3) { std::cout << Mesh[i] << ", " << Mesh[i+1] << ", " << Mesh[i+2] << "\n"; } */ }
My average timings and variance for a 4960HQ on my Macbook Pro Retina under Fedora 24, latest kernel as of 10/15/2016:
Compiler: Clang++ 3.8.0
Flags: -std=c++14 -O3 -march=native
Mesh size in floats: 90000
Scalar translation took 6.08489e-04s +- 0.11032e-04s
Vector translation took 5.82480e-05s +- 0.14391e-05s
The short of it is you can write tighter, denser loops with a little bit of effort. While the latency for each vector add is 3 cycles and each multiplication is 5, multiple iterations can be in flight at once on a single thread. The throughput for the vectorized version is 8x the scalar version without any unrolling. Thus, the loop can also easily fit into the small loop detector which can shave off some cycles due to prefetch removal and result forwarding between iterations. Assuming you don't run out of memory bandwidth, you can actually do other tasks on this same core without using hyper threading as long as they do not depend on the result of the mesh manipulation. Looking at the SB block diagram, with each clock achieving both an 8-wide vector multiplication and 8-wide vector addition, you can achieve more than 50GFlops per core on a 2600K, but the memory bandwidth will not allow you to load and store the results as quickly as you can request and produce them at a rate of 50GB/s without high-end dual-channel DDR3 or a quad-channel configuration. It would be best to use a C++ 17 stack-less resumable function to encapsulate this and do short bursts of another task when more than 3 L3 cache misses happen in a row (this can be tracked with a hardware profiler to determine optimal burst lengths).
If there is interest, I can go into nuances of leveraging vectorization techniques in conjunction with other data transforms relevant to gaming (though I'm not giving away my AVX ray tracer). I can also look into benchmarking multicore use of this and balancing it out against other tasks to achieve best performance for a given configuration.
42 Comments