High Performance Computing For Dummies

Casperian · May 15, 2015

That spatial coherence part on the CPU is something my mate doesn't get at all. He always writes his code to go down Y before across X causing the prefetched data to be almost useless. I didn't expect it to be 10X difference though, that's insane. It really shows how powerful that is.

The way OpenMP works is odd but it makes sense. I recently found out that if you have a OMP parallel for in the form "for(int x = 0; x < 10000;x++)", it sets one thread to 0 and the other (assuming two threads) to 5000. This means in a for loop of changing cost, the work load is not evenly distributed. Also the slow thread might stop the other one from progressing, not sure on this however and I don't have time to test at the moment.

PS. That mate also calls "efficiency" the number of lines of code he has...

patrickjp93 · May 15, 2015

That spatial coherence part on the CPU is something my mate doesn't get at all. He always writes his code to go down Y before across X causing the prefetched data to be almost useless. I didn't expect it to be 10X difference though, that's insane. It really shows how powerful that is.

The way OpenMP works is odd but it makes sense. I recently found out that if you have a OMP parallel for in the form "for(int x = 0; x < 10000;x++)", it sets one thread to 0 and the other (assuming two threads) to 5000. This means in a for loop of changing cost, the work load is not evenly distributed. Also the slow thread might stop the other one from progressing, not sure on this however and I don't have time to test at the moment.

PS. That mate also calls "efficiency" the number of lines of code he has...

In my HPC class if you write more than 25 lines in any function, you get a 0 for that assignment. In fact compilers which optimize your code generally start to royally screw up after functions get longer than 20 lines unless every line is a function call it can individually optimize, in-line, and then re-optimize. 6-nested loops do worse than 3-nested which call a function which uses a 3-nested loop. (ex: block matrix multiplication).

If you print the totals vector it comes out as all 0s because I obviously haven't given it any real data, so using dynamic or guided scheduling would get you nothing. All the data is the same size for right now.

I've never seen a parallel for-loop do that to the threads. I've seen plenty of instances of parallel for which evenly distribute the load among 4 threads. Now, what you may be seeing is the main program thread spawn 4 threads and then idle until all 4 are ready to die, which is the fork-join model. The reason this is done is so you can use the no_wait clause and move on to another task if synchronicity does not have to be immediately tight.

patrickjp93 · May 16, 2015

That spatial coherence part on the CPU is something my mate doesn't get at all. He always writes his code to go down Y before across X causing the prefetched data to be almost useless. I didn't expect it to be 10X difference though, that's insane. It really shows how powerful that is.

The way OpenMP works is odd but it makes sense. I recently found out that if you have a OMP parallel for in the form "for(int x = 0; x < 10000;x++)", it sets one thread to 0 and the other (assuming two threads) to 5000. This means in a for loop of changing cost, the work load is not evenly distributed. Also the slow thread might stop the other one from progressing, not sure on this however and I don't have time to test at the moment.

PS. That mate also calls "efficiency" the number of lines of code he has...

The next entry is up if you're interested.I explore a whole bunch of different combinations of optimization techniques on a very similar problem.

Casperian · May 16, 2015

In my HPC class if you write more than 25 lines in any function, you get a 0 for that assignment. In fact compilers which optimize your code generally start to royally screw up after functions get longer than 20 lines unless every line is a function call it can individually optimize, in-line, and then re-optimize. 6-nested loops do worse than 3-nested which call a function which uses a 3-nested loop. (ex: block matrix multiplication).

If you print the totals vector it comes out as all 0s because I obviously haven't given it any real data, so using dynamic or guided scheduling would get you nothing. All the data is the same size for right now.

I've never seen a parallel for-loop do that to the threads. I've seen plenty of instances of parallel for which evenly distribute the load among 4 threads. Now, what you may be seeing is the main program thread spawn 4 threads and then idle until all 4 are ready to die, which is the fork-join model. The reason this is done is so you can use the no_wait clause and move on to another task if synchronicity does not have to be immediately tight.

I get that, I've yet to do a paper involving optimization for the compiler (next year I think) but what I meant by him calling efficiency the number of lines, I mean total lines not lines per function. He just kinda expects the compiler to understand everything he does.

It's interesting that you are seeing something different in terms of OpenMP. I've only noticed once when I made a mersenne prime calculator and had them printing the respective value of x, where x was the iteration number. (Quite possibly because my code was mainly brute force).

I wonder how these optimizations carry over to GPU compute as that tends to be more what I do.

patrickjp93 · May 16, 2015

I get that, I've yet to do a paper involving optimization for the compiler (next year I think) but what I meant by him calling efficiency the number of lines, I mean total lines not lines per function. He just kinda expects the compiler to understand everything he does.

It's interesting that you are seeing something different in terms of OpenMP. I've only noticed once when I made a mersenne prime calculator and had them printing the respective value of x, where x was the iteration number. (Quite possibly because my code was mainly brute force).

I wonder how these optimizations carry over to GPU compute as that tends to be more what I do.

Well in this instance I have a strange scaling problem I can't seem to get around With 4 threads I can't break 200% utilization no matter what size data set I use (1000x1000 should be small enough to avoid a huge allocation time and any bandwidth issues), and frankly that makes no sense. I'm working on Part 3 which is task parallelism and data parallelism combined, and my example for that is a homework from my high performance computing class where I easily get into the 390%+ CPU usage for 4 threads. I'm literally stumped, and that's the first time that's happened the whole semester where I couldn't just zero in on the problem and fix it.

patrickjp93 · June 5, 2015

-snip-

I greatly improved the OpenMP solution if you'd like to view it. Sometimes it's the obvious stuff that goes right over your head when you've been thinking about a problem too long. I tried to do each of these pages in a single day, and I think that's why I got stuck. There's still more room for improvement, but I'm glad I made that breakthrough about setting up the data structure using multiple threads. I feel dumb for not thinking of that sooner.

DXMember · August 31, 2016

Would you be willing to format these as well?

And thank you very much, - your effort is appreciated.

patrickjp93 · August 31, 2016

Just now, DXMember said:

Would you be willing to format these as well?

And thank you very much, - your effort is appreciated.

I will tomorrow. For now, sleep.

patrickjp93 · September 1, 2016

18 hours ago, DXMember said:

Would you be willing to format these as well?

And thank you very much, - your effort is appreciated.

Done.

DXMember · September 1, 2016

2 hours ago, patrickjp93 said:

Done.

You're the best!

patrickjp93 · September 1, 2016

Nah, that's a contest between Stephan Lavavej and Andrei Alendrescu

Sign In

High Performance Computing For Dummies

High Performance Computing: Basic Optimization Part 1

11 Comments

Casperian 121

Link to comment

Link to post

patrickjp93 6,086

Link to comment

Link to post

patrickjp93 6,086

Link to comment

Link to post

Casperian 121

Link to comment

Link to post

patrickjp93 6,086

Link to comment

Link to post

patrickjp93 6,086

Link to comment

Link to post

DXMember 2,198

Link to comment

Link to post

patrickjp93 6,086

Link to comment

Link to post

patrickjp93 6,086

Link to comment

Link to post

DXMember 2,198

Link to comment

Link to post

patrickjp93 6,086

Link to comment

Link to post

My Activity Streams