Jump to content
  • entries
    5
  • comments
    11
  • views
    1,298

High Performance Computing: Basic Optimization Part 1

patrickjp93

899 views

We hear quite often these days of developers and gamers complaining games aren't optimized for graphics hardware for AMD/Nvidia or are choking at the CPU whether by terrible single threaded implementation or by bad multithreaded code. In reality it's not very difficult to optimize C++ code, and game developers should frankly be ashamed they hide behind the excuses they currently give. Here I have supplied a rudimentary program which may not appear to be very useful or worthwhile to the average reader, but these specifically can be used as part of a CPU-based physics engine or Artificial Intelligence routine in the hands of someone who can manage data well.

Here are a few ways to implement getting the sum of the elements of columns in a matrix. You may not find this particularly interesting, but for any mathematical/statistical reduction, it's best to try to get all of your data in matrix form, be it a 2D array or a flat buffer you can traverse by using multiplicative offsets. Furthermore, it's best you know how your programming language stores data sequentially. Going in the wrong order can have disastrous consequences on your runtime (yay foreshadowing).

Trials run per code base: 6

CPU: Core I7 2600K stock clocks & settings, watercooled (no thermal throttling possible)

RAM: dual-channel G.Skill Trident X 1866 (3x4GB)

OS: Ubuntu 15.04 Vivid Vervet fully updated as of 5/14/15

Compiler: icpc (ICC) 15.0.3 20150407 Copyright © 1985-2015 Intel Corporation. All rights reserved.

Compiler Flags: -Wall -Wextra -std=c++11 -g -Ofast dummy.cpp -o dummy

Timing Specifications: /usr/bin/time -v ./dummy 115000 20000

Major (requiring disk I/O) page faults: 0 (very important that we're not hitting virtual memory on the hard drives here)

CPU Usage: 99% for all valid runs

Be aware due to SandyBridge being limited to AVX 128 (with MMX 256) and FMA2, you will see better performance in Haswell, but you will need a lot of RAM to see how it scales. I use up most of my available 10 GB, and the system has the last 2.

Naive/First Instinct Serial Solution //some have noted I forgot to use a reference operator for my function, but adding it shaves only a single second in runtime, insignificant until later.

Elapsed time average: 53.46 seconds

Standard deviation: 0.27 seconds

Spoiler
 

#include <iostream>
#include <vector>
#include <stdexcept>
#include <sstream>

std::vector<int> col_sums(const std::vector<std::vector<short>> &data) {
  unsigned int height = data.size(), width = data[0].size();
  std::vector<int> sums(width, 0);
  for (unsigned int i = 0; i < width; i++) {        
  	for (unsigned int j = 0; j < height; j++) {
      sums[i] += data[j][i];
    }    
  }    
  return sums;
}

int main(int argc, char** argv) {
  if (argc < 3) {
    std::cout << "Run program as \"executable <rows> <columns>\n";    
  } else {
    std::stringstream args;
    args << argv[1] << " " << argv[2];
    int rows, columns;
    args >> rows >> columns;
    std::vector<std::vector<short>> data(rows, std::vector<short>(columns));
    std::vector<int> columnSums = col_sums(data);    
  }
}

 

C++ Favorable Serial Solution

Elapsed time average: 4.90 seconds (that's right, 10.91 x speedup with 1 tiny change)

Standard deviation: 0.003 seconds

Spoiler
 

std::vector<int> col_sums(const std::vector<std::vector<short>> &data) {
  unsigned int height = data.size(), width = data[0].size();    
  std::vector<int> sums(width, 0);    
  for (unsigned int i = 0; i < height; i++) {
    for (unsigned int j = 0; j < width; j++) {
      sums[j] += data[i][j];        
    }    
  }    
  return sums;
}

 

And that's just by understanding how C++ stores multi-dimensional arrays: row major order, and how the translation lookahead buffer tries to keep spatially local data in the caches for your CPU. Traversing by column causes a huge number of cache misses which, as you can see, severely diminishes your performance. Feel free to experiment by adding columns, narrowing row width, expanding it, etc.. This second algorithm seems to be slightly obfuscated to a novice or someone who doesn't know the nitty gritty details of a language's implementation (COUGH* game programmers COUGH*). Believe it or not, this sort of naivete I found almost an exact copy of in the physics engine for the Unity Engine. You wouldn't believe how hard I had to fight to get it changed, including providing extensive evidence you can get here for yourselves. I could not believe how impossible it was to convince three programmers some upstart nobody found something they could vastly improve in their code. If I ever come across as an elitist, it's because I'm jaded by industry people who think stuck up college kids know nothing.

Now, using Intel's crafted CilkPlus libraries (which run on AMD processors all the same if you compile with g++ or clang++ (g++ 4.9.2 has some CilkPlus bugs right now though)), we can do even better before going multi-threaded. To my knowledge Microsoft's Visual C/C++ compiler does not yet support CilkPlus at all.

Good CilkPlus Solution (for g++ or clang++, must use -fcilkplus flag)

Elapsed time average: 3.00 seconds (almost 40% speedup over good C++ algorithm)

Standard deviation: 0.000 seconds

Spoiler
 

std::vector<int> col_sums(const std::vector<std::vector<short>> &data) {
  unsigned int height = data.size(), width = data[0].size();    
  std::vector<int> sums(width, 0);    
  for (unsigned int i = 0; i < height; i++) {
    sums.data()[0:width] += data[i].data()[0:width];
  }    
  return sums;
}

 

Bad CilkPlus Solution (for g++ or clang++, must use -fcilkplus flag)

Elapsed time average: 50.80 seconds (better than bad algorithm in raw C++, but terrible nonetheless)

Standard deviation: 0.29 seconds

Spoiler
 

std::vector<int> col_sums(const std::vector<std::vector<short>> &data) {
  unsigned int height = data.size(), width = data[0].size();
  std::vector<int> sums(width, 0);
  for (unsigned int i = 0; i < width; i++) {
    sums[i] = __sec_reduce_add(data.data()[0:height].data()[i]);
  }    
  return sums;
}

 

CilkPlus even manages to get you something when your algorithm doesn't use your hardware resources optimally, but it doesn't get you too much.

OpenMP and CilkPlus Combined Solution (export OMP_NUM_THREADS=4, compile using -fopenmp (and -fcilkplus if using g++ or clang++)

Cpu Usage: 391%

Elapsed time average: 1.19 seconds (more than 60% faster than the plain CilkPlus Solution. There is a barrier to scaling here determined by the critical section which must be resolved)

Standard deviation: 0.0058 seconds

***Note using 2 threads gives 0.92 seconds elapsed time and lower (174%) CPU usage. This is a lesson that parallelization can yield diminishing returns and even lose you ground if you aren't careful***

Proposals: chunk-wise reduction from partials to total. Edit: no gains. Still confused.

Spoiler
 

#include <iostream>
#include <vector>
#include <stdexcept>
#include <sstream>
#include <omp.h>
  
void fill_vector(std::vector<std::vector<short>>& data, unsigned int rows, unsigned int cols) {
  data.resize(rows);    
  #pragma omp parallel for
  for (unsigned int i = 0; i < rows; i++) {
    data[i] = std::vector<short>(cols);
  }
}

std::vector<int> col_sums(const std::vector<std::vector<short>> &data) {
  unsigned int height = data.size(), width = data[0].size();
  std::vector<int> totalSums(width, 0), threadSums(width, 0);
  #pragma omp parallel firstprivate(threadSums)
  {        
    #pragma omp for nowait
    for (unsigned int i = 0; i < height; i++) {
      threadSums.data()[0:width] += data[i].data()[0:width];
    }
    #pragma omp atomic
    totalSums.data()[0:width] += threadSums.data()[0:width];    
  }
  return totalSums;
}

int main(int argc, char** argv) {
  if (argc < 3) {
    std::cout << "Run program as \"executable <rows> <columns>\n";
  } else {
    std::stringstream args;
    args << argv[1] << " " << argv[2];
    unsigned int rows, columns;
    args >> rows >> columns;
    std::vector<std::vector<short>> data(rows);
    fill_vector(data, rows, columns);
    std::vector<int> columnSums = col_sums(data);    
  }
}

 

Sometimes it's the most obvious things that escape our sight when looking for ways to improve. Memory allocation takes a long time, and trying to do it in one single call was not appropriate for such a large structure. Having said that, reserving a single block of data which will do nothing other than point to the individual rows and then making calls to RAM to allocate much smaller pieces takes far less time. I still believe I'm hitting a bandwidth bottleneck considering increasing the thread count for this increases the runtime, especially given the threshold is 2 and not 3. I believe the increased run time would result from contention to the DDR3 bus, but I lack a hardware profiler to confirm this (there are various levels of profilers which can analyze software and hardware performance, and to pinpoint exactly where the delay is being generated would most likely require a hardware profiler.

EDIT: OpenMP Solution was given further optimization, and scaling is now closer to expectation. However, there is more I have to think about. There's definitely still room for improvement.

11 Comments

That spatial coherence part on the CPU is something my mate doesn't get at all. He always writes his code to go down Y before across X causing the prefetched data to be almost useless. I didn't expect it to be 10X difference though, that's insane. It really shows how powerful that is.

 

The way OpenMP works is odd but it makes sense. I recently found out that if you have a OMP parallel for in the form "for(int x = 0; x < 10000;x++)", it sets one thread to 0 and the other (assuming two threads) to 5000. This means in a for loop of changing cost, the work load is not evenly distributed. Also the slow thread might stop the other one from progressing, not sure on this however and I don't have time to test at the moment.

 

PS. That mate also calls "efficiency" the number of lines of code he has...

Link to comment
Link to post

That spatial coherence part on the CPU is something my mate doesn't get at all. He always writes his code to go down Y before across X causing the prefetched data to be almost useless. I didn't expect it to be 10X difference though, that's insane. It really shows how powerful that is.

 

The way OpenMP works is odd but it makes sense. I recently found out that if you have a OMP parallel for in the form "for(int x = 0; x < 10000;x++)", it sets one thread to 0 and the other (assuming two threads) to 5000. This means in a for loop of changing cost, the work load is not evenly distributed. Also the slow thread might stop the other one from progressing, not sure on this however and I don't have time to test at the moment.

 

PS. That mate also calls "efficiency" the number of lines of code he has...

 

In my HPC class if you write more than 25 lines in any function, you get a 0 for that assignment. In fact compilers which optimize your code generally start to royally screw up after functions get longer than 20 lines unless every line is a function call it can individually optimize, in-line, and then re-optimize. 6-nested loops do worse than 3-nested which call a function which uses a 3-nested loop. (ex: block matrix multiplication).

 

If you print the totals vector it comes out as all 0s because I obviously haven't given it any real data, so using dynamic or guided scheduling would get you nothing. All the data is the same size for right now. 

 

I've never seen a parallel for-loop do that to the threads. I've seen plenty of instances of parallel for which evenly distribute the load among 4 threads. Now, what you may be seeing is the main program thread spawn 4 threads and then idle until all 4 are ready to die, which is the fork-join model. The reason this is done is so you can use the no_wait clause and move on to another task if synchronicity does not have to be immediately tight.

Link to comment
Link to post

That spatial coherence part on the CPU is something my mate doesn't get at all. He always writes his code to go down Y before across X causing the prefetched data to be almost useless. I didn't expect it to be 10X difference though, that's insane. It really shows how powerful that is.

 

The way OpenMP works is odd but it makes sense. I recently found out that if you have a OMP parallel for in the form "for(int x = 0; x < 10000;x++)", it sets one thread to 0 and the other (assuming two threads) to 5000. This means in a for loop of changing cost, the work load is not evenly distributed. Also the slow thread might stop the other one from progressing, not sure on this however and I don't have time to test at the moment.

 

PS. That mate also calls "efficiency" the number of lines of code he has...

The next entry is up if you're interested.I explore a whole bunch of different combinations of optimization techniques on a very similar problem.

Link to comment
Link to post

In my HPC class if you write more than 25 lines in any function, you get a 0 for that assignment. In fact compilers which optimize your code generally start to royally screw up after functions get longer than 20 lines unless every line is a function call it can individually optimize, in-line, and then re-optimize. 6-nested loops do worse than 3-nested which call a function which uses a 3-nested loop. (ex: block matrix multiplication).

 

If you print the totals vector it comes out as all 0s because I obviously haven't given it any real data, so using dynamic or guided scheduling would get you nothing. All the data is the same size for right now. 

 

I've never seen a parallel for-loop do that to the threads. I've seen plenty of instances of parallel for which evenly distribute the load among 4 threads. Now, what you may be seeing is the main program thread spawn 4 threads and then idle until all 4 are ready to die, which is the fork-join model. The reason this is done is so you can use the no_wait clause and move on to another task if synchronicity does not have to be immediately tight.

I get that, I've yet to do a paper involving optimization for the compiler (next year I think) but what I meant by him calling efficiency the number of lines, I mean total lines not lines per function. He just kinda expects the compiler to understand everything he does. 

 

It's interesting that you are seeing something different in terms of OpenMP. I've only noticed once when I made a mersenne prime calculator and had them printing the respective value of x, where x was the iteration number. (Quite possibly because my code was mainly brute force).

 

I wonder how these optimizations carry over to GPU compute as that tends to be more what I do.

Link to comment
Link to post

I get that, I've yet to do a paper involving optimization for the compiler (next year I think) but what I meant by him calling efficiency the number of lines, I mean total lines not lines per function. He just kinda expects the compiler to understand everything he does. 

 

It's interesting that you are seeing something different in terms of OpenMP. I've only noticed once when I made a mersenne prime calculator and had them printing the respective value of x, where x was the iteration number. (Quite possibly because my code was mainly brute force).

 

I wonder how these optimizations carry over to GPU compute as that tends to be more what I do.

Well in this instance I have a strange scaling problem I can't seem to get around With 4 threads I can't break 200% utilization no matter what size data set I use (1000x1000 should be small enough to avoid a huge allocation time and any bandwidth issues), and frankly that makes no sense. I'm working on Part 3 which is task parallelism and data parallelism combined, and my example for that is a homework from my high performance computing class where I easily get into the 390%+ CPU usage for 4 threads. I'm literally stumped, and that's the first time that's happened the whole semester where I couldn't just zero in on the problem and fix it.

Link to comment
Link to post

-snip-

I greatly improved the OpenMP solution if you'd like to view it. Sometimes it's the obvious stuff that goes right over your head when you've been thinking about a problem too long. I tried to do each of these pages in a single day, and I think that's why I got stuck. There's still more room for improvement, but I'm glad I made that breakthrough about setting up the data structure using multiple threads. I feel dumb for not thinking of that sooner.

Link to comment
Link to post
Just now, DXMember said:

Would you be willing to format these as well?

And thank you very much, - your effort is appreciated.

I will tomorrow. For now, sleep.

Link to comment
Link to post
18 hours ago, DXMember said:

Would you be willing to format these as well?

And thank you very much, - your effort is appreciated.

Done.

Link to comment
Link to post
×