Jump to content

Ideas for more balanced multithreading in C++?

18358414

Without going into too many details, let's say I have about 5000 5x5 matrices that are continuously being updated and I'm calculating the eigenvalues of these matrices in C++ and sometimes performing other operations depending on the values (the number of operations performed is dependent on the returned value).  To save time, I set up multithreading with the STL via std::thread.  Currently, the method of balancing this calculation is very naive: if user selects to use 8 threads, I process 625 of these matrices on each thread (8 x 625 = 5000).  Currently with 8 threads, it is obvious that at least 2 threads are taking more than 100% longer than the other 6 (this machine has 12 CPU cores [not threads, cores], so it should be fine).  These matrices are not randomly generated and are evolving through time, so the complexity of one group shouldn't change significantly from one frame to the next. 

 

Currently I'm thinking I can define a 'manager' class which times the execution of each group of matrices and moves matrices out of the slow groups and into faster groups to balance the load out over time, however I'm interested in what other methods exist, especially ones which can be easily implemented. 

 

Suggestions?

 

edit: I suppose I could always just process 1 matrix per thread and continually re-join processor threads as I go along, eh?

If I have to explain every detail, I won't talk to you.  If you answer a question with what can be found through 10 seconds of googling, you've contributed nothing, as I assure you I've already considered it.

 

What a world we would be living in if I had to post several paragraphs every time I ask a question.

Link to comment
Share on other sites

Link to post
Share on other sites

Use a producer/consumer pattern, where each thread is a consumer. Each consumer also has a bucket. The produce submits work to a consumer until its bucket is full, then moves onto the next one and does the same. Every so often have the producer refill everyone's buckets.

Link to comment
Share on other sites

Link to post
Share on other sites

11 hours ago, 7he404guy said:

Without going into too many details, let's say I have about 5000 5x5 matrices that are continuously being updated and I'm calculating the eigenvalues of these matrices in C++ and sometimes performing other operations depending on the values (the number of operations performed is dependent on the returned value).  To save time, I set up multithreading with the STL via std::thread.  Currently, the method of balancing this calculation is very naive: if user selects to use 8 threads, I process 625 of these matrices on each thread (8 x 625 = 5000).  Currently with 8 threads, it is obvious that at least 2 threads are taking more than 100% longer than the other 6 (this machine has 12 CPU cores [not threads, cores], so it should be fine).  These matrices are not randomly generated and are evolving through time, so the complexity of one group shouldn't change significantly from one frame to the next. 

 

Currently I'm thinking I can define a 'manager' class which times the execution of each group of matrices and moves matrices out of the slow groups and into faster groups to balance the load out over time, however I'm interested in what other methods exist, especially ones which can be easily implemented. 

 

Suggestions?

 

edit: I suppose I could always just process 1 matrix per thread and continually re-join processor threads as I go along, eh?

I'm assuming you have your 5000 matrices in a contiguous memory block, like a array or a std::vector, all next to each other?

If so, it is possible the performance discrepancy you're seeing from those 2 cores is due to false sharing at the "edges", where 2 blocks of 625 matrices meet. (Assuming there isn't something else in your code causing this. No races anywhere ? For sure?).

 

Cache lines on modern systems tend to be 64 bytes wide. Cache lines are the smallest unit the cache controller can work with. That means even when you change only a single byte, a whole 64 byte cache line has to be brought in, modified, and written back. Everywhere a block of 625 matrices ends and the next one begins there will be a cache line that contains a bit of the end of the first block and a bit of the beginning of the next block. Each of the 2 cores working on those 2 blocks has it's own copy of that cache line in it's L1/L2 cache. Whenever 1 core makes a change to it's bit of the cache line it invalidates the entire cache line for the other core causing a reload from higher up memory. Thus, both cores will ping-pong back and forth clobbering each other's cache line.

 

You might think: "But it only happens on a tiny section near the edges of 2 blocks". Yes, but cache misses really are *that* expensive that it could cause such slowdowns.

"Why only on 2 cores?" It depends on the size of your matrix elements and the alignment of how everything ends up in memory. Some blocks might only share a few bytes in a cache line while other blocks might share a cache line 50/50.

 

To test I'd create 8 separate std::vectors of say 650 matrices in stead of 625 and leave the last 25 unused but act as padding and retest.

 

Also look out for possibilities of false sharing in the rest of your code. Try to move as much per thread state as possible to the stack of the actual thread as each thread will have it's own separate stack space, typically far enough away from the others.

 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

4 hours ago, Unimportant said:

Yes, but cache misses really are *that* expensive that it could cause such slowdowns. 

^^ This. Cache misses can be so expensive that a relatively small amount of misses can completely negate the overall performance increase of having a cache, or moreover, even give worse performance than not caching.

 

16 hours ago, 7he404guy said:

Currently I'm thinking I can define a 'manager' class which times the execution of each group of matrices and moves matrices out of the slow groups and into faster groups to balance the load out over time

One of the expensive parts here is creating and destroying the threads. You have to set everything up, ask the OS to do allocate resources, wait for the OS to schedule the thread, then wait for the OS to destroy the thread and release it's resources... All of that can take a significant amount of time.

One method that you can take is called Thread Pooling. In thread pooling, you have a  queue of processing requests, a pool of threads, and a queue of completed requests. The thread pool keeps track of which threads are busy and which are idle. When you get a processing request, you find an idle thread in the pool and let it handle the request. This amortizes the cost of creating and destroying threads.

 

One of the things that can possibly be done in this method is to dynamically change the number of threads in the pool to find the point that yields the highest throughput. This will be easiest when your thread pool rarely has an empty thread: That is, when the request queue is rarely empty. Your mileage may vary with this type of optimization, I haven't thoroughly experimented with it.

 

ENCRYPTION IS NOT A CRIME

Link to comment
Share on other sites

Link to post
Share on other sites

13 hours ago, Unimportant said:

when you change only a single byte, a whole 64 byte cache line has to be brought in, modified, and written back

The way everything is handled, the original matrices are never modified.  They are only read.  A result vector is modified but is never read from within the thread and is processed after all threads have joined (nothing that would significantly slow it though).  In each instance, the memory is already allocated for the matrix and result vector.

 

Tried implementing the producer/consumer pattern also, but that doesn't seem to have helped anything.  Still sits at about 100% of one or two cores and occasionally jumps to higher usage.  Also tried a few other formats but that didn't change it either. 

 

I'm worried that perhaps the operations that I have to perform after getting all eigenvalues could be slowing things down because it involves allocating some memory also during intermediate operations.  Might just need to comb through and pre-allocate some intermediate variables to see if that helps. 

If I have to explain every detail, I won't talk to you.  If you answer a question with what can be found through 10 seconds of googling, you've contributed nothing, as I assure you I've already considered it.

 

What a world we would be living in if I had to post several paragraphs every time I ask a question.

Link to comment
Share on other sites

Link to post
Share on other sites

13 hours ago, 7he404guy said:

A result vector is modified but is never read from within the thread

Because of how cache lines work, writing involves reading, so that could be causing the cache lines to jump around. You could fix that by making each thread have its own result vector, then combining them at the end. (Also, I would be surprised if std::vector is thread safe).

On 3/27/2019 at 10:02 PM, 7he404guy said:

edit: I suppose I could always just process 1 matrix per thread and continually re-join processor threads as I go along, eh?

A join is a fairly expensive operation, so that will slow the whole process down.

13 hours ago, 7he404guy said:

I'm worried that perhaps the operations that I have to perform after getting all eigenvalues could be slowing things down because it involves allocating some memory also during intermediate operations.  Might just need to comb through and pre-allocate some intermediate variables to see if that helps.

Memory allocation could well be a bottleneck, though I'm not sure what would cause it to be much slower on two threads than the rest.

 

A couple of things you could try

  • Is it always the threads that are allocated a particular pair of tasks that are slower, or is it random?
  • If you comment out sections individually (eigenvalue calculation, writing results, ...), can you narrow it down to one bit of code causing the discrepancy?

HTTP/2 203

Link to comment
Share on other sites

Link to post
Share on other sites

12 minutes ago, colonel_mortis said:

(Also, I would be surprised if std::vector is thread safe).

It's not, there's some very basic guarantees but nothing worthwhile.

Without seeing some code we can merely speculate tough.

Link to comment
Share on other sites

Link to post
Share on other sites

10 hours ago, colonel_mortis said:

I would be surprised if std::vector is thread safe

In this specific application it should be -- it's pre-allocated and constructed then re-used (no push_back calls, passed by reference, only writes to array indices) ---> I suppose instead of passing the vector by reference I could just pass in the .data()

 

I'll look at separate vectors (although the order in which they are added is important) and maybe write in some timers to check for bottlenecks.  All in all, I think I've gotten some valuable information from this. 

If I have to explain every detail, I won't talk to you.  If you answer a question with what can be found through 10 seconds of googling, you've contributed nothing, as I assure you I've already considered it.

 

What a world we would be living in if I had to post several paragraphs every time I ask a question.

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×