We hear quite often these days of developers and gamers complaining games aren't optimized for graphics hardware for AMD/Nvidia or are choking at the CPU whether by terrible single threaded implementation or by bad multithreaded code. In reality it's not very difficult to optimize C++ code, and game developers should frankly be ashamed they hide behind the excuses they currently give. Here I have supplied a rudimentary program which may not appear to be very useful or worthwhile to the average reader, but these specifically can be used as part of a CPU-based physics engine or Artificial Intelligence routine in the hands of someone who can manage data well.
Here are a few ways to implement getting the sum of the elements of columns in a matrix. You may not find this particularly interesting, but for any mathematical/statistical reduction, it's best to try to get all of your data in matrix form, be it a 2D array or a flat buffer you can traverse by using multiplicative offsets. Furthermore, it's best you know how your programming language stores data sequentially. Going in the wrong order can have disastrous consequences on your runtime (yay foreshadowing).
Trials run per code base: 6
CPU: Core I7 2600K stock clocks & settings, watercooled (no thermal throttling possible)
RAM: dual-channel G.Skill Trident X 1866 (3x4GB)
OS: Ubuntu 15.04 Vivid Vervet fully updated as of 5/14/15
Compiler: icpc (ICC) 15.0.3 20150407 Copyright © 1985-2015 Intel Corporation. All rights reserved.
Compiler Flags: -Wall -Wextra -std=c++11 -g -Ofast dummy.cpp -o dummy
Timing Specifications: /usr/bin/time -v ./dummy 115000 20000
Major (requiring disk I/O) page faults: 0 (very important that we're not hitting virtual memory on the hard drives here)
CPU Usage: 99% for all valid runs
Be aware due to SandyBridge being limited to AVX 128 (with MMX 256) and FMA2, you will see better performance in Haswell, but you will need a lot of RAM to see how it scales. I use up most of my available 10 GB, and the system has the last 2.
Naive/First Instinct Serial Solution //some have noted I forgot to use a reference operator for my function, but adding it shaves only a single second in runtime, insignificant until later.
Elapsed time average: 53.46 seconds
Standard deviation: 0.27 seconds
Spoiler#include <iostream> #include <vector> #include <stdexcept> #include <sstream> std::vector<int> col_sums(const std::vector<std::vector<short>> &data) { unsigned int height = data.size(), width = data[0].size(); std::vector<int> sums(width, 0); for (unsigned int i = 0; i < width; i++) { for (unsigned int j = 0; j < height; j++) { sums[i] += data[j][i]; } } return sums; } int main(int argc, char** argv) { if (argc < 3) { std::cout << "Run program as \"executable <rows> <columns>\n"; } else { std::stringstream args; args << argv[1] << " " << argv[2]; int rows, columns; args >> rows >> columns; std::vector<std::vector<short>> data(rows, std::vector<short>(columns)); std::vector<int> columnSums = col_sums(data); } }
C++ Favorable Serial Solution
Elapsed time average: 4.90 seconds (that's right, 10.91 x speedup with 1 tiny change)
Standard deviation: 0.003 seconds
Spoilerstd::vector<int> col_sums(const std::vector<std::vector<short>> &data) { unsigned int height = data.size(), width = data[0].size(); std::vector<int> sums(width, 0); for (unsigned int i = 0; i < height; i++) { for (unsigned int j = 0; j < width; j++) { sums[j] += data[i][j]; } } return sums; }
And that's just by understanding how C++ stores multi-dimensional arrays: row major order, and how the translation lookahead buffer tries to keep spatially local data in the caches for your CPU. Traversing by column causes a huge number of cache misses which, as you can see, severely diminishes your performance. Feel free to experiment by adding columns, narrowing row width, expanding it, etc.. This second algorithm seems to be slightly obfuscated to a novice or someone who doesn't know the nitty gritty details of a language's implementation (COUGH* game programmers COUGH*). Believe it or not, this sort of naivete I found almost an exact copy of in the physics engine for the Unity Engine. You wouldn't believe how hard I had to fight to get it changed, including providing extensive evidence you can get here for yourselves. I could not believe how impossible it was to convince three programmers some upstart nobody found something they could vastly improve in their code. If I ever come across as an elitist, it's because I'm jaded by industry people who think stuck up college kids know nothing.
Now, using Intel's crafted CilkPlus libraries (which run on AMD processors all the same if you compile with g++ or clang++ (g++ 4.9.2 has some CilkPlus bugs right now though)), we can do even better before going multi-threaded. To my knowledge Microsoft's Visual C/C++ compiler does not yet support CilkPlus at all.
Good CilkPlus Solution (for g++ or clang++, must use -fcilkplus flag)
Elapsed time average: 3.00 seconds (almost 40% speedup over good C++ algorithm)
Standard deviation: 0.000 seconds
Spoilerstd::vector<int> col_sums(const std::vector<std::vector<short>> &data) { unsigned int height = data.size(), width = data[0].size(); std::vector<int> sums(width, 0); for (unsigned int i = 0; i < height; i++) { sums.data()[0:width] += data[i].data()[0:width]; } return sums; }
Bad CilkPlus Solution (for g++ or clang++, must use -fcilkplus flag)
Elapsed time average: 50.80 seconds (better than bad algorithm in raw C++, but terrible nonetheless)
Standard deviation: 0.29 seconds
Spoilerstd::vector<int> col_sums(const std::vector<std::vector<short>> &data) { unsigned int height = data.size(), width = data[0].size(); std::vector<int> sums(width, 0); for (unsigned int i = 0; i < width; i++) { sums[i] = __sec_reduce_add(data.data()[0:height].data()[i]); } return sums; }
CilkPlus even manages to get you something when your algorithm doesn't use your hardware resources optimally, but it doesn't get you too much.
OpenMP and CilkPlus Combined Solution (export OMP_NUM_THREADS=4, compile using -fopenmp (and -fcilkplus if using g++ or clang++)
Cpu Usage: 391%
Elapsed time average: 1.19 seconds (more than 60% faster than the plain CilkPlus Solution. There is a barrier to scaling here determined by the critical section which must be resolved)
Standard deviation: 0.0058 seconds
***Note using 2 threads gives 0.92 seconds elapsed time and lower (174%) CPU usage. This is a lesson that parallelization can yield diminishing returns and even lose you ground if you aren't careful***
Proposals: chunk-wise reduction from partials to total. Edit: no gains. Still confused.
Spoiler#include <iostream> #include <vector> #include <stdexcept> #include <sstream> #include <omp.h> void fill_vector(std::vector<std::vector<short>>& data, unsigned int rows, unsigned int cols) { data.resize(rows); #pragma omp parallel for for (unsigned int i = 0; i < rows; i++) { data[i] = std::vector<short>(cols); } } std::vector<int> col_sums(const std::vector<std::vector<short>> &data) { unsigned int height = data.size(), width = data[0].size(); std::vector<int> totalSums(width, 0), threadSums(width, 0); #pragma omp parallel firstprivate(threadSums) { #pragma omp for nowait for (unsigned int i = 0; i < height; i++) { threadSums.data()[0:width] += data[i].data()[0:width]; } #pragma omp atomic totalSums.data()[0:width] += threadSums.data()[0:width]; } return totalSums; } int main(int argc, char** argv) { if (argc < 3) { std::cout << "Run program as \"executable <rows> <columns>\n"; } else { std::stringstream args; args << argv[1] << " " << argv[2]; unsigned int rows, columns; args >> rows >> columns; std::vector<std::vector<short>> data(rows); fill_vector(data, rows, columns); std::vector<int> columnSums = col_sums(data); } }
Sometimes it's the most obvious things that escape our sight when looking for ways to improve. Memory allocation takes a long time, and trying to do it in one single call was not appropriate for such a large structure. Having said that, reserving a single block of data which will do nothing other than point to the individual rows and then making calls to RAM to allocate much smaller pieces takes far less time. I still believe I'm hitting a bandwidth bottleneck considering increasing the thread count for this increases the runtime, especially given the threshold is 2 and not 3. I believe the increased run time would result from contention to the DDR3 bus, but I lack a hardware profiler to confirm this (there are various levels of profilers which can analyze software and hardware performance, and to pinpoint exactly where the delay is being generated would most likely require a hardware profiler.
EDIT: OpenMP Solution was given further optimization, and scaling is now closer to expectation. However, there is more I have to think about. There's definitely still room for improvement.
11 Comments