High Performance Computing: Basic Optimization Part 2

Entry posted by patrickjp93 May 15, 2015

482 views

In the previous section, I decided to forego terminology and verbiage for the sake of expediency, but to continue the discussion one needs to know about the parallelism hierarchy in programming. At the highest level is process parallelism, usually the calling card of distributed computing, where multiple processes run on multiple nodes in a distributed system such as CentOS on a cluster supercomputer. This can take the form of SPMD (Single Program, Multiple Data), MPMD (Multiple...), or MPSD (Multiple Program, Single Data). The fourth form, SPSD, is generally reserved for a single machine running a single program on a single data set, or consumer computing. These acronyms and names come to us from Flynn's Taxonomy, a bit of a technical resource but very good reading for anyone wanting to enter the HPC space.

The next layer down is very closely related, and in some circles is entirely synonymous: task parallelism, where multiple tasks of a single program (usually) run in parallel on multiple cores. Sometimes in the Process Parallelism layer, multiple processes are not working towards the same end goal. In task parallelism, we assert all portions of the program work together to achieve a total system, such as a game. In Process parallelism, such as a distributed simulation, it's possible all the processes are gathering statistics about different results based on probabilities, so the end product is a cohesive unit of statistical research, but the processes do not work together in any way. They are fully independent other than (usually) the starting data set. In process parallelism, we try to build a system of interconnected subprograms which work cohesively and efficiently with minimum stalls due to waiting on any single task. If your game stutters without new frames because of CPU-based physics calculations taking too long, that's a bad thing. If you're trying to encrypt and save a file to disk, it's best we save as we encrypt instead of waiting for the entire file to be encoded before starting the save process. That is the essence of task parallelism.

Moving down we next find data parallelism, where (optimally) data transforms can occur simultaneously without any risk of producing an incorrect result. If you ever hear the term "race condition", it's a critical item of data which must be treated extremely carefully in a parallel program, where accesses to read and write must be managed tightly to ensure program states are consistent. If you recall, in the OpenMP implementation of col_sums, I told each thread to create a partial sum of its "chunk" of the data, and then when it was done, add its chunk to the totalsum, but that second part was headed by a special "atomic" header, where atomics are transactional commands which must be completed from start to end with no interference. Each thread must wait for its turn before adding its partial sum. If you want to have multiple actions take place in a transactional manner to preserve data integrity and system state, you would use '#pragma omp critical <newline> { //code here }'. Yet, I didn't do this for the partial sums, because each thread has its own private copy of a suitably sized vector, and each thread only sees its copy and is only given one row of the base data to read at a time. That is the essence of data parallelism. For the partial sums, each can be made independently without us having to worry about corrupted results (assuming there isn't a bug in the OpenMP or CilkPlus libraries themselves or the compiler you use (these do appear time to time, and I've found my fair share, though 99% of the time it's the lexical analyzer which has an issue (operator multi versioning under icpc 15 v2 is broken, fixed under v3), not a correctness of code production issue). Lastly, notice each column of each partial sum can be added to independently of the others. There's no race condition between columns. The more data-parallelism you can find in a program, the more you can speed it up with high gains using some of the techniques displayed in these blog entries.

Lastly we have instruction parallelism. There is some overlap with data parallelism, but moreover the focus here is maximizing how many instructions you can run concurrently on your CPU without causing a data or control hazard. Data hazards are race conditions in the simplest terms. You can't subtract and add from 6 simultaneously and come up with a correct answer, and in fact due to the lack of safety rails built into the control circuitry for MMX and AVX instructions, if you try you'll probably get a segmentation fault. The CPU covers safety of most if not all single data transforms in its Out of Order Processing engine, but vector extensions do not apply. Modern CPUs will reorder the instructions they are fed to extract more parallelism from them, to make the pipeline longer and increase throughput. However, poorly written code can hinder this severely. Also, if you have more than a 3-nested branch statement, suddenly you have enough control hazards and enough guesses that the branch predictor will start to fail noticeably more often, causing pipeline stalls/flushes, and losing you performance. Instruction Level Parallelism cannot be fully controlled by a programmer, but the programmer can write code in a suggestive manner to extract it. This is called implicit or programmer-guided & compiler-assisted parallelism, a coding technique I'll cover in an example below.

I apologize for the definitions and textual lecture, but to play in the big leagues it is about what you know once you get your foot in the door.

Row Total Sums

While this may seem to be trivial after the first example, it's important to catch the subtleties. Instead of being fully data-independent when creating partial sums for the columns as in the previous example, this time we have to condense entire rows to single values, a challenge which on its face represents serious data hazards which cause pipeline stalls which harm performance for your CPU cores (and GPU cores too if you're not careful). And what if we could organize our data neatly all into a single flat buffer or 2D array where we know the order of every x values corresponds to some particular item of interest? How do you efficiently traverse your data while keeping your code short and readable? This page will address such concerns on a simple example and level, but the solutions are extensible and scale very well (in most cases) for larger data and multiple cores.

This time around, we're going to disable and re-enable automatic compiler optimizations to compare how the following hand-optimized algorithms fair against the compiler-optimized defaults and how compiler optimizations can add to what we have done to hand-optimize (sometimes).

Runtime Environment: (copy from High Performance Computing: Basic Optimization Part 1)

OMP_NUM_THREADS: export OMP_NUM_THREADS=4

CPU: Core I7 2600K stock clocks & settings, watercooled (no thermal throttling possible)

RAM: dual-channel G.Skill Trident X 1866 (3x4GB)

OS: Ubuntu 15.04 Vivid Vervet fully updated as of 5/14/15

Compiler Flags: -Wall -Wextra -std=c++11 -O0 dummy2.cpp -o dummy2

Timing Specifications: /usr/bin/time -v ./dummy2 116000 20000

Major (requiring disk I/O) page faults: 0 (very important that we're not hitting virtual memory on the hard drives here)

# of runs per code base: 6

Total Row Sum Basic Serial Solution

Average Elapsed Time (-O0) : 26.72 seconds (-Ofast) 5.62 seconds

Standard Deviation (-O0) : 0.115 seconds (-Ofast) 0.007 seconds

#include <iostream>#include <vector>#include <stdexcept>#include <sstream>#include <omp.h>// basic serial solutionstd::vector<int> row_sums(const std::vector<std::vector<short>>& data) {    unsigned int height = data.size(), width = data[0].size();    std::vector<int> totalSums(height, 0);    for(unsigned int i = 0; i < height; i++) {        for(unsigned int j = 0; j < width; j++) {            totalSums[i] += data[i][j];        }    }    return totalSums;}int main(int argc, char** argv) {    if (argc < 3) {        std::cout << "Run program as \"executable <rows> <columns>\n";    } else {        std::stringstream args;        args << argv[1] << " " << argv[2];        unsigned int rows, columns;        args >> rows >> columns;        std::vector<std::vector<short>> data(rows, std::vector<short>(columns));        std::vector<int> rowSums = row_sums(data);    }}

Notice this runtime is approximately 1/2 that of the Naive/First Instinct Column Sum Algorithm discussed in Part 1 It's for the same reason the optimized solution ran 10.91x faster. We're getting free benefits from the way a modern CPU caches data (C++ stores in row-major order for a 2D array, and the caches get data according to spatial locality relative to other data recently used) and looks ahead using the TLB (Translation Lookahead Buffer) to keep the hungry chip fed. However, we're not getting the full 10.91x speedup. That's because we've disabled automatic optimizations (-O0 vs. -Ofast), exposing our first nuance: data hazards. As you can see, with each iteration of the inner loop, we're adding a single data element to our sum for that row, and we do this for each row. According to basic rules of digital logic, you can't add 2 data items to a 3rd in a single cycle or simultaneously in general. You would have to stagger the adds at the level of bit shifting, something which can be done at a great electric/thermal expense, but generally is not except in some very exotic architectures.

With -Ofast, we see times closely centered around 5.62 seconds, a near 10x improvement in time over the naive column sum algorithm, but not quite 10.91x. There's still room to squeeze out 3x better performance.

In our instance for the given run size, we know our data will always be in sizes divisible by 16 (MMX is 256 bits wide which can be split into 16 16-bit shorts), useful to know for the rationale behind the next implementation. It can (and will later) be generalized to handle all cases in a more optimal manner than the basic solution above.

Total Row Sum Loop Unrolling

Average Elapsed Time (-O0) : 17.41 seconds (-Ofast) 2.00 seconds

Standard Deviation (-O0) : 0.290 seconds (-Ofast) 0.000 seconds

std::vector<int> row_sums(const std::vector<std::vector<short>>& data) {    unsigned int height = data.size(), width = data[0].size();    std::vector<int> totalSums(height, 0);    for(unsigned int i = 0; i < height; i++) {	    for(unsigned int j = 0; j < width; j += 16) {		    totalSums[i] += data[i][j] + data[i][j+1] + data[i][j+2] + data[i][j+3]			    + data[i][j+4] + data[i][j+5] + data[i][j+6] + data[i][j+7]			    + data[i][j+8] + data[i][j+9] + data[i][j+10] + data[i][j+11]			    + data[i][j+12] + data[i][j+13] + data[i][j+14] + data[i][j+15];	    }    }    return totalSums;}

Loop unrolling is common to see in any massively data-parallel application. This is because in this era, most CPU and GPU architectures are superscalar in some way, where a single core may execute multiple instructions in parallel in certain proportions. By disabling automatic optimization (done a lot in modern software for legacy support reasons) we can observe the asymptotic limit of SISD(Single Instruction, Single Data) computing performance. If you compare 2-unrolled, 8-unrolled, and 16, you will notice diminishing returns with every doubling very rapidly. This is because the Out of Order Processing engine is only being fed single data instructions which are highly independent and can be slightly refactored at runtime to do pair-wise reductions using extra registers beyond the general purpose 6 afforded by x86. It's important to note with -Ofast we even get a 2.8x speedup over the basic algorithm with -Ofast, and frankly that surprised even me. Usually icpc is extremely good at vectorizing this sort of operation without having to unroll the loop.

People who complain about Intel's inability to push performance much each generation under the SISD model:

1) Don't understand that twits are building their software and making choices which produce less than optimal algorithms and on top of that less than optimal assembly instructions

2) Have never looked at what SIMD gets them to see the fallout of those decisions

3) Have no clue how difficult it is to widen the pipeline very far beyond what it already is.

Total Row Sum CilkPlus Serial

Average Elapsed Time (-O0) : 7.67 seconds (-Ofast) 1.664 seconds

Standard Deviation (-O0) : 0.009 seconds (-Ofast) 0.002 seconds

std::vector<int> row_sums(const std::vector<std::vector<short>>& data) {    unsigned int height = data.size(), width = data[0].size();    std::vector<int> totalSums(height, 0);    for(unsigned int i = 0; i < height; i++) {        totalSums[i] = __sec_reduce_add(data[i].data()[0:width]);    }    return totalSums;}

Even under -O0, CilkPlus provides a 3.48x speedup over the -O0 basic version. With -Ofast, it's a 3.38x speedup over the basic version.

Against the unrolled serial version, it's a 2.27x speedup with -O0. With -Ofast it's a 1.2x (20%) speedup.

Again we see the power of some of the most carefully crafted data reduction libraries in the world. If you want a detailed view of CilkPlus, check this out. Remember this cardinal rule: if you can reduce your problem to array manipulations, do it, and keep this library in mind.

Total Row Sum CilkPlus With Unrolled Loops

Average Elapsed Time (-O0) : 7.73 seconds (-Ofast) 1.645 seconds

Standard Deviation (-O0) : 0.021 seconds (-Ofast) 0.006 seconds

std::vector<int> row_sums(const std::vector<std::vector<short>>& data) {    unsigned int height = data.size(), width = data[0].size();    std::vector<int> totalSums(height, 0);    for(unsigned int i = 0; i < height;  i += 16) {        totalSums[i]   = __sec_reduce_add(data[i].data()[0:width]);        totalSums[i+1] = __sec_reduce_add(data[i+1].data()[0:width]);        totalSums[i+2] = __sec_reduce_add(data[i+2].data()[0:width]);        totalSums[i+3] = __sec_reduce_add(data[i+3].data()[0:width]);        totalSums[i+4] = __sec_reduce_add(data[i+4].data()[0:width]);        totalSums[i+5] = __sec_reduce_add(data[i+5].data()[0:width]);        totalSums[i+6] = __sec_reduce_add(data[i+6].data()[0:width]);        totalSums[i+7] = __sec_reduce_add(data[i+7].data()[0:width]);        totalSums[i+8] = __sec_reduce_add(data[i+8].data()[0:width]);        totalSums[i+9] = __sec_reduce_add(data[i+9].data()[0:width]);        totalSums[i+10] = __sec_reduce_add(data[i+10].data()[0:width]);        totalSums[i+11] = __sec_reduce_add(data[i+11].data()[0:width]);        totalSums[i+12] = __sec_reduce_add(data[i+12].data()[0:width]);        totalSums[i+13] = __sec_reduce_add(data[i+13].data()[0:width]);        totalSums[i+14] = __sec_reduce_add(data[i+14].data()[0:width]);        totalSums[i+15] = __sec_reduce_add(data[i+15].data()[0:width]);    }    return totalSums;}

As with all good things, eventually we hit diminishing returns, and sometimes neither you nor the compiler can do anything more without diving into the assembly code itself. There is a super tiny difference between 1, 2, 4, 8, and 16 unrolls. They all perform within milliseconds of each other. In fact performance gets perceptibly worse on the unrolled version moving from 2 to 4 to 8 to 16.

Total Row Sum Parallel Loop V1

Average Elapsed Time (-O0) : 8.780 seconds (319% CPU usage) (-Ofast) 2.380 seconds (260% CPU usage)

Standard Deviation (-O0) : 0.011 seconds (001% CPU usage) (-Ofast) 0.000 seconds (001% CPU usage)

std::vector<int> row_sums(const std::vector<std::vector<short>>& data) {    unsigned int height = data.size(), width = data[0].size();    std::vector<int> totalSums(height, 0);    #pragma omp parallel for    for(unsigned int i = 0; i < height;  i++) {        for(unsigned int j = 0; j < width; j++) {            totalSums[i] += data[i][j];        }    }    return totalSums;}

The bottleneck in this version is in fact the same one which appeared in my OpenMP program for part 1 (except there I was stressing the cache even more with all the misses forced by changing row across 4 threads and trying to stream 20,000 shorts at a time for a parallel independent add for partial sums which were also in the cache and in heap memory (vectors create arrays in the heap)). With this size data set, I'm in fact exceeding what the cache and TLB can keep up with in a consumer chip using dual-channel RAM. With lower latency and higher bandwidth RAM and an overclock to the cache this bottleneck would evaporate for the most part. Having to pass back a huge dataset also requires it be allocated to the stack frame. I'm being a bit lazy, and if I tidied this up to pass only a pointer to the vector back to main, I could push usage back into the 300+% range for 4 threads.

For reference, even with 8 threads (which does perform better than 4 centered around 2.16 seconds), we only achieve 1/(3.48/2.16) = 1.61x speedup, not remotely acceptable for the amount of extra heat and electricity used, not to mention locking down all the cores for the task and impacting the whole system. For this implementation, with dual-channel RAM anyway, the optimal number of threads is 2. It's very important to know how your program scales for multithreading when your data is absolutely huge. This is also why GPGPUs with dedicated onboard memory became far more suited to this task. The CPU is fighting a bandwidth bottleneck and noisy neighbors in the cache when dealing with data this vast and spatially local in memory. For sanity's sake, let's see what loop unrolling gets us.

Total Row Sum Parallel Loop V1 Unrolled

Average Elapsed Time (-O0) : 6.400 seconds (289% CPU usage) (-Ofast) 1.630 seconds (201% CPU usage)

Standard Deviation (-O0) : 0.000 seconds (001% CPU usage) (-Ofast) 0.007 seconds (001% CPU usage)

std::vector<int> row_sums(const std::vector<std::vector<short>>& data) {    unsigned int height = data.size(), width = data[0].size();    std::vector<int> totalSums(height, 0);    #pragma omp parallel for    for(unsigned int i = 0; i < height;  i++) {        for(unsigned int j = 0; j < width; j += 16) {            totalSums[i] += data[i][j] + data[i][j+1] + data[i][j+2] + data[i][j+3]                + data[i][j+4] + data[i][j+5] + data[i][j+6] + data[i][j+7]                + data[i][j+8] + data[i][j+9] + data[i][j+10] + data[i][j+11]                + data[i][j+12] + data[i][j+13] + data[i][j+14] + data[i][j+15];        }    }    return totalSums;}

About 1.25x speedup, though again it should be noted with 2 threads the average time is 1.68 seconds with 135% CPU usage. Using more than 2 threads is wasteful with this model. And again unrolling more than 4 is also near negligible in performance gains.

Total Row Sum Parallel Loop V2

Average Elapsed Time (-O0) : 6.980 seconds (296% CPU usage) (-Ofast) 1.690 seconds (205% CPU usage)

Standard Deviation (-O0) : 0.017 seconds (002% CPU usage) (-Ofast) 0.014 seconds (001% CPU usage)

std::vector<int> row_sums(const std::vector<std::vector<short>>& data) {    unsigned int height = data.size(), width = data[0].size();    std::vector<int> totalSums(height, 0);    for(unsigned int i = 0; i < height;  i++) {        int psum = 0;        #pragma omp parallel for reduction(+ : psum)        for(unsigned int j = 0; j < width; j++) {            psum += data[i][j];        }        totalSums[i] = psum;    }    return totalSums;}

Hilariously, I actually get slightly (~15 millisecond) better times using 2 threads rather than 4, and CPU usage then averages 134%. Be aware you cannot use arrays, vectors, or array/vector elements as the target of an OpenMP reduction clause. You used to be able to in OpenMP 3.0, and it's dumb the maintainers did that, especially when the compiler just optimizes around that anyway and effectively doesn't use the psum variable (if you look at the created assembly). Now let's unroll it and see what we get.

Total Row Sum Parallel Loop V2 Unrolled

Average Elapsed Time (-O0) : 6.380 seconds (286% CPU usage) (-Ofast) 1.740 seconds (207% CPU usage)

Standard Deviation (-O0) : 0.017 seconds (002% CPU usage) (-Ofast) 0.000 seconds (001% CPU usage)

std::vector<int> row_sums(const std::vector<std::vector<short>>& data) {    unsigned int height = data.size(), width = data[0].size();    std::vector<int> totalSums(height, 0);    for(unsigned int i = 0; i < height;  i++) {        int psum = 0;        #pragma omp parallel for reduction(+ : psum)        for(unsigned int j = 0; j < width; j += 8) {            psum += (((data[i][j] + data[i][j+1]) + (data[i][j+2] + data[i][j+3]))                     + ((data[i][j+4] + data[i][j+5]) + (data[i][j+6] + data[i][j+7])));        }        totalSums[i] = psum;    }    return totalSums;}

As it turns out, unrolling to 16 results in increased running time. Unrolling to 2, 4, and 8 get us better for -O0 but worse for -Ofast (there are intricacies you have to be aware of using different optimization flags, even between different compilers as to how they work), where 4 and 8 perform the best and equally well. It's not entirely clear why, though chances are it's in the OpenMP runtime. Results may get better with higher bandwidth and lower latency, so I will leave the 8-unrolled version as the reference. The nesting of the individual sums can improve or worsen runtime. In my case it improved it by about 20 milliseconds.

Already we have seen one CilkPlus implementation, but that was across multiple rows of data rather than a single one, causing noisy neighbor issues on the cache. Next we will see how CilkPlus may be able to further improve on our reduction.

Total Row Sum Parallel Loop CilkPlus Row Full Addition

Average Elapsed Time (-O0) : 3.783 seconds (215% CPU usage) (-Ofast) 1.600 seconds (196% CPU usage)

Standard Deviation (-O0) : 0.016 seconds (002% CPU usage) (-Ofast) 0.001 seconds (001% CPU usage)

std::vector<int> row_sums(const std::vector<std::vector<short>>& data) {    unsigned int height = data.size(), width = data[0].size(),        blockSize = width / omp_get_max_threads(), offset = width % blockSize,        endPoint = width + offset;    std::vector<int> totalSums(height, 0);    #pragma omp parallel for    for(unsigned int i = 0; i < height;  i++) {        totalSums[i] = __sec_reduce_add(data[i].data()[0:width]);    }    return totalSums;}

With multiple threads calling for long chains of data, I believe we're running into bandwidth issues here especially. With only 2 threads, the CPU usage drops to 134%, but the elapsed time remains exactly the same. Clearly 2 threads is optimal for this algorithm for my system. Oddly enough, dropping to 1 thread puts the CPU usage at 99%, and the time only increases to 1.66 seconds average, which seems odd until you remember I'm running on a hyperthreaded machine which can let both the main and team thread occupy a single core and achieve possibly better results. Without assembly analysis, this is conjecture, but it's all the more proof thread-level parallelism is only sometimes useful.

Total Row Sum Parallel Loop CilkPlus Row Chunk Addition

Average Elapsed Time (-O0) : 4.110 seconds (226% CPU usage) (-Ofast) 1.745 seconds (207% CPU usage)

Standard Deviation (-O0) : 0.023 seconds (002% CPU usage) (-Ofast) 0.016 seconds (002% CPU usage)

std::vector<int> row_sums(const std::vector<std::vector<short>>& data) {    unsigned int height = data.size(), width = data[0].size(),	    blockSize = width / omp_get_max_threads(), offset = width % blockSize,	    endPoint = width + offset;    std::vector<int> totalSums(height, 0);    for(unsigned int i = 0; i < height;  i++) {	    int psum = 0;	    #pragma omp parallel for reduction(+ : psum) schedule(guided)	    for(unsigned int j = 0; j < endPoint; j += blockSize) {		    if((j + blockSize) > width)			    psum += __sec_reduce_add(data[i].data()[j:(width-j)]);		    else			    psum += __sec_reduce_add(data[i].data()[j:blockSize]);	    }	    totalSums[i] = psum;    }    return totalSums;}

With 2 threads CPU usage drops to 134%, but performance differs by at most 40 milliseconds.

If you were unsure of any bandwidth bottlenecks before, you should now be convinced they exist and are choking the scaling of this program and runtime potential. In a later blog post I'll talk about using profilers to better explain how to reach this conclusion scientifically. I know for a fact the Intel CilkPlus libraries extensively use SSE 4.2 and AVX2 instructions, meaning they stream in data from long vectors and then process them using the most efficient SIMD and Vector instructions at the CPU's disposal. Due to no memory reuse (I don't go back and reiterate over older sections of the matrix) I'm falling victim to RAM not keeping my cache fed. If you decrease the data size for any of these programs, you should see CPU usage go up drastically, but copying the vector is still a non-parallelizable operation, so you must find a way to pass back a pointer if you want to alleviate that bottleneck as well.

This is a lot of ground covered using a variety of techniques for attacking a single problem trying to optimize it. The CilkPlus implementations have universally done better than their competition, but the competition isn't off by a huge percentage either. And sometimes even without having more than 1 OMP thread available, the OMP versions perform better than non-OMP versions, which is somewhat mind boggling due to the overhead OpenMP imposes (generally). The best overall solution for a data set this large is Total Row Sum Parallel Loop CilkPlus Row Full Addition. The total speedup over the plain serial solution is 1/(1.600/26.72) = 16.7x speedup. Not bad at all.

These are not the most flattering examples of OpenMP's performance potential and scaling ability, so for now you'll have to take my word for it that scaling to 4+ cores is actually rather simple with this framework. I simply chose a data set too large and rediscovered a bottleneck I'd learned to avoid in my HPC class which tended to have a fair amount of memory reuse (such as finding a sub-image in a larger image, performing block matrix multiplication, or calculating/verifying various properties of numbers such as primality, factorability, palindrome form, or prime factorization. In the next post I'll show how CilkPlus may be used to produce staggered sums and do more advanced data traversals and become an extremely useful tool for building a task-parallel program.

EDIT: It's the simplest things that evade us when we try to dive deep for long periods of time. It finally dawned on me I was making our matrix in a very stupid way. Here's the revised version including all runtime and usage statistics. I'm still seeing bad scaling, but not nearly on the level I was previously. I suspect a bandwidth bottleneck may be the culprit given the optimal performance is with 2 threads.

Optimal Parallel Solution Revision 2

***With 4 Threads***

Average Elapsed Time (-O0) : 1.80 seconds (373% CPU usage) (-Ofast) 1.001 seconds (279% CPU usage)

Standard Deviation (-O0) : 0.21 seconds (009% CPU usage) (-Ofast) 0.019 seconds (002% CPU usage)

***Note without the optimization flags, the performance was rather eratic with lots of deviation in timing. Also note the CPU usage is far less in the version compiled with optimization enabled.***

***With 2 threads***

Average Elapsed Time (-O0) : 2.86 seconds (194% CPU usage) (-Ofast) 0.921 seconds (174% CPU usage)

Standard Deviation (-O0) : 0.01 seconds (002% CPU usage) (-Ofast) 0.004 seconds (001% CPU usage)

***Note the less erratic behavior when using 2 threads and the lower timings***

void fill_vector(std::vector<std::vector<short>>& data, unsigned int rows,                 unsigned int cols) {    data.reserve(rows);    #pragma omp parallel for    for (unsigned int i = 0; i < rows; i++) {        data[i] = std::vector<short>(cols);    }}std::vector<int> row_sums(const std::vector<std::vector<short>>& data) {    unsigned int height = data.size(), width = data[0].size(),        blockSize = width / omp_get_max_threads(), offset = width % blockSize,        endPoint = width + offset;    std::vector<int> totalSums(height, 0);    #pragma omp parallel for    for(unsigned int i = 0; i < height;  i++) {        totalSums[i] = __sec_reduce_add(data[i].data()[0:width]);    }    return totalSums;}int main(int argc, char** argv) {    if (argc < 3) {        std::cout << "Run program as \"executable <rows> <columns>\n";    } else {        std::stringstream args;        args << argv[1] << " " << argv[2];        unsigned int rows, columns;        args >> rows >> columns;        std::vector<std::vector<short>> data;        fill_vector(data, rows, columns);        std::vector<int> rowSums = row_sums(data);    }}

Foxgott16
1

Sign In

High Performance Computing For Dummies

High Performance Computing: Basic Optimization Part 2

0 Comments

My Activity Streams