Jump to content

Multi-Threading C++ & OpenGL

Kamjam66xx

Hello, i cant find any good source on multithreading with openGL. Can anyone help? 

 

I'm trying to increase the speed of my "while" loop that consistently gets user inputs, move a camera, calls functions for shadows, final render, swap buffers.

I'm using C++, OpenGL, GLFW, GLEW, GLM, ASSIMP, and STB_IMAGE. Anything to do with openGL directly, seems to make my program crash when i give it a thread. 

I know the benefit is worth the effort.

 

Thanks guys!

 

edit: GLFW & GLEW is what i want to throw into multiple threads.

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Kamjam66xx said:

Hello, i cant find any good source on multithreading with openGL. Can anyone help? 

 

I'm trying to increase the speed of my "while" loop that consistently gets user inputs, move a camera, calls functions for shadows, final render, swap buffers.

I'm using C++, OpenGL, GLFW, GLEW, GLM, ASSIMP, and STB_IMAGE. Anything to do with openGL directly, seems to make my program crash when i give it a thread. 

I know the benefit is worth the effort.

 

Thanks guys!

 

edit: GLFW & GLEW is what i want to throw into multiple threads.

AFAIK OGL is not thread friendly.  With how threading works, you would do polling in a thread to fill a buffer and have the GL thread collect inputs from that buffer.  Not sure this would be any better than doing the polling in the GL thread though.

Link to comment
Share on other sites

Link to post
Share on other sites

7 minutes ago, KarathKasun said:

AFAIK OGL is not thread friendly.  With how threading works, you would do polling in a thread to fill a buffer and have the GL thread collect inputs from that buffer.  Not sure this would be any better than doing the polling in the GL thread though.

I'm kind of new to actually using threads in c++, and openGL is an added complication. Its a minefield to work around, but i still got user input and a few other things into separate threads. I figured a pretty clever way to throw my buffer swapping into its own thread.  

 

what you just said went over my head. GLFW is not thread friendly at all ive come to find. 

 

edit: I dont know the technical terms for threading stuff yet.

Link to comment
Share on other sites

Link to post
Share on other sites

Im not familiar with C++ terminology or variable naming conventions to be honest.  But in general you would have an input thread write to something like a global variable, and the rendering thread read that global variable to determine how to update the output.

 

You still need a way to sync the threads or deal with any unused input data when the render thread slows down though.

 

http://discourse.glfw.org/t/multithreading-glfw/573/2

Some good info there.

 

I actually had this problem in python while writing a extremely fast threaded implementation of the basic turtle vector library.  It is a major PITA to work around.

Link to comment
Share on other sites

Link to post
Share on other sites

Use threads to divvy up the workload and compose the data that should be sent to OpenGL but only have a single thread (the one that opened the OpenGL context) handle the calls to OpenGL. Also consider that the error might be in your code, not necessarily anything to do with OpenGL, are you sure you don't have any data races? atomicity issues ? OOE problems that break naive mutex attempts?, etc...

Link to comment
Share on other sites

Link to post
Share on other sites

Unsure with ogl but dx you would create a command list with a deferred contex in each thread and then submit them to the immediate list to rendered. Should essentially be the same as dx.

 

However i guarantee that you can drastically improve performance with far less complexity added to the program. Have you even profiled your application yet?

CPU: Intel i7 - 5820k @ 4.5GHz, Cooler: Corsair H80i, Motherboard: MSI X99S Gaming 7, RAM: Corsair Vengeance LPX 32GB DDR4 2666MHz CL16,

GPU: ASUS GTX 980 Strix, Case: Corsair 900D, PSU: Corsair AX860i 860W, Keyboard: Logitech G19, Mouse: Corsair M95, Storage: Intel 730 Series 480GB SSD, WD 1.5TB Black

Display: BenQ XL2730Z 2560x1440 144Hz

Link to comment
Share on other sites

Link to post
Share on other sites

From what I understand, Open GL itself is threaded. IE, the GPU does all the drawing & little from the CPU is needed. 

 

You could use STD::thread for threading though. 

 

Threading just executed a separate function. 

What you could do is put your input logic into a while loop in a separate function & then your input is all executed on a separate thread. 

 

Some oversights are threads using data needed by other parts of the program, the main thread needing data the thread hasn’t completed yet 

Link to comment
Share on other sites

Link to post
Share on other sites

12 hours ago, KarathKasun said:

Thanks, thats some good info to look through. A lot of what i have been doing is trial and error.

 

8 hours ago, Unimportant said:

Use threads to divvy up the workload and compose the data that should be sent to OpenGL but only have a single thread (the one that opened the OpenGL context) handle the calls to OpenGL. Also consider that the error might be in your code, not necessarily anything to do with OpenGL, are you sure you don't have any data races? atomicity issues ? OOE problems that break naive mutex attempts?, etc...

my application start is a lot faster, and ive managed to get a few improvements. mouse input is smoother and less buggy, etc.

 

1 hour ago, trag1c said:

Have you even profiled your application yet?

what do you mean by profiled? sorry im self taught, and this is really only my 4th or 5th month programming.

 

1 hour ago, fpo said:

Some oversights are threads using data needed by other parts of the program, the main thread needing data the thread hasn’t completed yet 

right. I had to figure out nifty tricks to get it working right. but only a few GLFW functions let me just throw em in a lambda thread or something. 

 

--

 

sorry im both new to threading, and new to openGL and threading. 

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, Kamjam66xx said:

what do you mean by profiled? sorry im self taught, and this is really only my 4th or 5th month programming

 

sorry im both new to threading, and new to openGL and threading. 

Profiling is a measurement of performance. This can be as simple as measuring frame time (the inverse of fps which is a metric that's far more valuable then fps) or as low level as measuring function execution time and the number of calls to funcations. 

 

I highly suggest looking into Profiling before you try any optimizations because it tell you where you need to improve performance.

 

Theres also the 80-20 rules (or any derivatives such as 90-10) which means you will get 80% performance increase for optimizing 20% of your code. So Profiling helps you find that magical 20%.

CPU: Intel i7 - 5820k @ 4.5GHz, Cooler: Corsair H80i, Motherboard: MSI X99S Gaming 7, RAM: Corsair Vengeance LPX 32GB DDR4 2666MHz CL16,

GPU: ASUS GTX 980 Strix, Case: Corsair 900D, PSU: Corsair AX860i 860W, Keyboard: Logitech G19, Mouse: Corsair M95, Storage: Intel 730 Series 480GB SSD, WD 1.5TB Black

Display: BenQ XL2730Z 2560x1440 144Hz

Link to comment
Share on other sites

Link to post
Share on other sites

Quote

  Anything to do with openGL directly, seems to make my program crash when i give it a thread. 

That just sounds that you're calling OpenGL on a thread with no active OpenGL context

 

However, in general, it is barely possible to get an appreciable speedup from an OpenGL renderer by using multithreading. Don't hope that you can, for example, call the shadow map rendering commands on one thread and scene rendering commands on the other - that will not work: the opengl driver will just synchronize those threads and everything will effectively become serialized, losing any benefit you may have had from parallelism. This isn't a question of driver quality, it's a fundamental constraint caused by the design of OpenGL. So, you're better off calling OpenGL on just one thread.

 

One exception to this is loading textures/meshes etc from disk. Since most time is spent waiting on file reads it makes sense to split resource loading/texture and buffer creation into a separate thread(s) - create a shared context on the resource loading thread, load your textures/models on it while you do other stuff. This could improve your loading times. 

 

If you are interested in building a multithreaded renderer, the best path forward is with new APIs - DX12 or Vulkan. They allow to split the driver overhead of recording command buffers onto multiple different threads, thus making better use of you cpu's many cores. This comes at a price of needing to handle GPU-side synchronization and memory management yourself though - it is a very daunting task and I don't think someone who is beginning graphics should bother with it. I promise you it's way more fun to play with lights and materials than to look for synchronization bugs in your vulkan code :)

 

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, nicebyte said:

This comes at a price of needing to handle GPU-side synchronization and memory management yourself though - it is a very daunting task and I don't think someone who is beginning graphics should bother with it. I promise you it's way more fun to play with lights and materials than to look for synchronization bugs in your vulkan code 

I want to +1 to this. From what I've read on developer circles, DX12 is really "DX11 Expert Mode." I can't imagine Vulkan being any different.

 

Multithreading is something that shouldn't be taken lightly. While the concept is deceptively simple looking, executing it well in practice is horribly difficult. You have to understand your design and your code well in order to make sure multithreading is working as intended.

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Mira Yurizaki said:

I want to +1 to this. From what I've read on developer circles, DX12 is really "DX11 Expert Mode." I can't imagine Vulkan being any different.

 

Multithreading is something that shouldn't be taken lightly. While the concept is deceptively simple looking, executing it well in practice is horribly difficult. You have to understand your design and your code well in order to make sure multithreading is working as intended.

to drive the point home, the opengl backend of my homegrown gfx lib  is ~2000 lines of code 

Is the vulkan one is about the same size now, but it's nowhere near being feature complete AND i've "outsourced" gpu memory management to AMD's VMA library (which could easily add anothe 1K lines if i did it myself).

I definitely don't want to discourage anyone from learningVulkan, but those who are considering need to understand that graphics APIs are not about graphics, they are about abstracting the GPU. Learning DX12 or Vk will take a nontrivial amount of time during which you will not be dealing with actual "graphics", i.e. making pretty images. Instead, you'll be figuring out how to be efficient at feeding data into a massively parallel computer attached to your regular computer :) this can be interesting in and of itself, but make sure you understand what you're getting into!

Link to comment
Share on other sites

Link to post
Share on other sites

19 minutes ago, nicebyte said:

to drive the point home, the opengl backend of my homegrown gfx lib  is ~2000 lines of code 

Is the vulkan one is about the same size now, but it's nowhere near being feature complete AND i've "outsourced" gpu memory management to AMD's VMA library (which could easily add anothe 1K lines if i did it myself).

I definitely don't want to discourage anyone from learningVulkan, but those who are considering need to understand that graphics APIs are not about graphics, they are about abstracting the GPU. Learning DX12 or Vk will take a nontrivial amount of time during which you will not be dealing with actual "graphics", i.e. making pretty images. Instead, you'll be figuring out how to be efficient at feeding data into a massively parallel computer attached to your regular computer :) this can be interesting in and of itself, but make sure you understand what you're getting into!

This tells me more that DX11/OGL is more like... proving your thing works. DX12/Vulkan is more like, you proven your thing works, but now you want to make it work better and you're positive you've tried everything else.

Link to comment
Share on other sites

Link to post
Share on other sites

4 hours ago, nicebyte said:

 

The end game is to learn Vulkan for a job that a family member can get me in for a chance to prove myself, since i dont even have a GED. But he said to stick with openGL for now and practice shaders, multivariate calculus, and a few other things. 

 

my little renderer thing is about 2,500 lines including my shaders, its all intertwined so idk. I guess ill save any serious attempts at multi-threading for Vulkan then, is that the conclusion i should be leaving with?

 

glad the forum has knowledgeable people on this. it feels like a lonely unsupported niche a lot of the time haha.

Link to comment
Share on other sites

Link to post
Share on other sites

>  I guess ill save any serious attempts at multi-threading for Vulkan then, is that the conclusion i should be leaving with?

 

Yeah I would recommend that. Focus on the fundamentals and try not to get too bogged down in the api details yet.

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, nicebyte said:

Yeah I would recommend that. Focus on the fundamentals and try not to get too bogged down in the api details yet.

Thanks for steering me clear. idk how, but graphics stuff just seems to get more and more fun to program.

Link to comment
Share on other sites

Link to post
Share on other sites

7 hours ago, trag1c said:

I highly suggest looking into Profiling before you try any optimizations because it tell you where you need to improve performance.

Ill have to look into that. I've just been making educated guesses based on what i know of modern compilers, testing execution time of tiny programs, and the specifics i know of c++.  however i did cut my loading time in half at least when i start my application. Thanks!

Link to comment
Share on other sites

Link to post
Share on other sites

On 2/4/2019 at 12:11 AM, Kamjam66xx said:

Ill have to look into that. I've just been making educated guesses based on what i know of modern compilers, testing execution time of tiny programs, and the specifics i know of c++.  however i did cut my loading time in half at least when i start my application. Thanks!

I know this doesn't directly answer your question, but I've been taking a different approach to learning about graphics programming.

Instead of jumping directly to DirectX or OpenGL, I'm building up a 2d engine that draws to the console, without the use of any graphics APIs (except those necessary to actually draw bitmaps to a win32 console).

I think that approach has two major benefits: It can teach you the underlying concepts of how "graphics" works in general, and it gives you basic opportunities to work on multithreading in a simpler application (or atleast, in an application that you fully understand).

ENCRYPTION IS NOT A CRIME

Link to comment
Share on other sites

Link to post
Share on other sites

To add with what @straight_stewie said, if you still want to tackle multithreaded programming, I think the best approach is to do it from a point of view where there are distinct "workers" who do completely independent things. You don't even have to build a game engine or anything to do this with. You can dabble in basic multithreading using Python and a GUI library like Tcl and try to keep the GUI responsive while doing things in the back ground.

 

But trying to use multithreading to speed up a single task is not something that should be jumped to right away. When it comes to a single task, you run into the biggest issue you have to solve with multithreading: resource contention.

Link to comment
Share on other sites

Link to post
Share on other sites

@straight_stewie regardless, people have given me great input and things to think about. i'm hung up on getting my multivariate calculus down for now, put multi-threading on the back burner until i learn Vulkan. (except for non opengl stuff) I personally always seem to choose the hardest route i can, this graphics stuff sure is rewarding when you make it do what you want though.

 

@Mira Yurizaki I should probably dig into a book specifically on multi-threading. My solutions are probably put to shame by ones id learn from a few quick reads.

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, Kamjam66xx said:

@straight_stewie regardless, people have given me great input and things to think about. i'm hung up on getting my multivariate calculus down for now, put multi-threading on the back burner until i learn Vulkan. (except for non opengl stuff) I personally always seem to choose the hardest route i can, this graphics stuff sure is rewarding when you make it do what you want though.

 

@Mira Yurizaki I should probably dig into a book specifically on multi-threading. My solutions are probably put to shame by ones id learn from a few quick reads.

Like others have said, OpenGL is not thread safe.

However, for any toy application that you are building I would not expect OpenGL command submission to be the bottleneck.

Calls to OpenGL functions are deferred to the drive so there is no waiting involved.

When you submit a drawcall the API driver will check that what you're doing is legal and will then forward the result to a work queue for the kernel-mode part of the driver to process which might do some more error checking, schedule requests between different programs, convert the commands to a GPU compatible format and upload those commands to the GPUs internal command queue.

Note that your program does not wait for the kernel-mode driver (and thus also won't wait for triangles to be drawn by the GPU).

 

With all due respect, but if draw calls are indeed a bottleneck (in your hobby OpenGL project which does not have a 100 square km game map filled with high quality assets) then you are probably doing something wrong.

Make sure that you are not using the legacy fixed-function pipeline (submitting triangles with glVertex calls) and instead use "modern" OpenGL (fixed-function pipeline was deprecated in OpenGL 3.0 (2008) and removed starting from OpenGL 3.1 (2009!)):

https://www.khronos.org/opengl/wiki/Fixed_Function_Pipeline

 

Another way to reduce driver overhead is to use the functions added in recent OpenGL versions (>4.3 IIRC).

This collection of new features is often referred to as AZDO ("Approaching Zero Driver Overhead") which was presented at GDC (Game Developer Conference):

https://gdcvault.com/play/1020791/Approaching-Zero-Driver-Overhead-in

https://gdcvault.com/play/1023516/High-performance-Low-Overhead-Rendering      (2016 presentation with some new stuff).

 

Also, be sure to check out gdcvault, the video-on-demand service of GDC), it contains a ton of very interesting and useful presentations (note that some presentations are behind a paywall (video mostly, slide decks are usually available) which usually gets removed after a year or two).

 

A good way to greatly improve GPU performance is by applying frustum and/or occlusion culling.

With frustum culling we try to check whether an object (a collection of primitives) might possibly be visible with respect to the camera frustum (whether it's inside the field of view).

Frustum culling is an easy optimisation that only requires you to know the bounding volumes of the objects (which you can compute ahead of time).

You simply check for each object whether its bounding volume overlaps with the cameras view frustum (google "frustum culling" for info on how to implement that test).

Note that this type of frustum culling is easily parallelizable both with multi-threading and SIMD (or even on the GPU with indirect draw commands).

If you have a very complex scene then you could also experiment with hierarchical culling where you store the objects in a tree structure (like a bounding volume hierarchy) and traverse the tree, only visiting child nodes when their bounding volume overlaps with the view frustum.

Note that this does make multi-threading and SIMD optimizations somewhat harder (an easy way to properly utilise SIMD in this case is to use a wider tree (ie 4 or 8 children per node)).

Although this might result in fewer overlap tests (when most of the objects are not visible) it does not map that well to modern hardware (many cache hits will mean a lot of stalls on memory == lower performance).

Frostbite for example switched from a fully hierarchical to a hybrid for BF3:

https://www.gamedevs.org/uploads/culling-the-battlefield-battlefield3.pdf

https://www.gdcvault.com/play/1014491/Culling-the-Battlefield-Data-Oriented

 

Occlusion culling is a lot more complicated than frustum culling and there are many different solutions.

The most popular solutions right now are based on screen-space techniques (like hierarchical z-buffer, HOM and IOM) because they map well to modern hardware (especially GPU) and can handle any arbitrary fully dynamic scenes.

Like I mentioned this topic is a lot more complex than frustum culling and requires complex scenes (high depth complexity) to perform well.

So I would recommend you not look into this too much until you've build a decently sized engine and the performance is GPU bottlenecked with no other obvious optimisations (like backface culling).

Anyway here is some reading on occlusion culling in games:

https://www.google.com/search?q=umbra+master+thesis    (first link. Master thesis by Timo Aila (currently a researcher at Nvidia Research with an impressive list of publications to his name). Umbra is now developed by the equally named company and the technology is used in games like The Witcher 3).

https://www.gdcvault.com/play/1014491/Culling-the-Battlefield-Data-Oriented

https://frostbite-wp-prd.s3.amazonaws.com/wp-content/uploads/2016/03/29204330/GDC_2016_Compute.pdf

https://gdcvault.com/play/1017837/Why-Render-Hidden-Objects-Cull

http://advances.realtimerendering.com/s2015/aaltonenhaar_siggraph2015_combined_final_footer_220dpi.pdf

And a interesting note: GPUs already implement hierarchical z-buffer culling to cull individual triangles (but not whole objects).

 

With regards to multi-threading, what most game engines do is create their own command lists.

Recording into these command lists can may be multi threaded and only execution (looping over the commands and calling the corresponding OpenGL functions) of the command lists has to be sequential.

Furthermore, you could also apply multithreading to any other processing (like physics simulations) that you would like to do between the input phase (polling the keyboard/mouse. This does not take any significant amount of time) and the rendering phase.

The best way to handle this in terms of throughput is to overlap rendering of frame N with the input+physics of frame N+1.

Although this does add a frame of latency it helps with filling compute resources (e.g. fork/join creates waiting until the last task has finished and maybe not everything can multi-threaded (Amdahl's law)).

A good way to get the most parallelism out of the system is to describe your program as a directed acyclic graph (DAG) of tasks.

This allows the scheduler to figure out which tasks do not depend on each other such that they can be executed in parallel.

If you're keen to work with Vulkan/DX12 then you might also want to apply the same concept to scheduling GPU commands.

Some examples of task/frame graphs in practice:

https://gdcvault.com/play/1021926/Destiny-s-Multithreaded-Rendering

https://www.ea.com/frostbite/news/framegraph-extensible-rendering-architecture-in-frostbite

https://www.gdcvault.com/play/1022186/Parallelizing-the-Naughty-Dog-Engine

 

Also, I would like to recommend you to ignore some of the previous advice in this forum thread on using std::thread for multi threading.

Spawning an OS thread is relatively costly and in a game engine you want all the performance you can get.

Furthermore, hitting a mutex means that the operating system will allow another thread to run which might be a completely different application.

Instead I would recommend you to take a look at mulit-threaded tasking libraries which spawn a bunch of threads at start-up (usually as many threads as you have cores) and then do the scheduling of tasks themselves (using a (work stealing) task queue).

Examples of these are Intel Threaded Building Blocks (TBB), cpp-taskflow, HPX (distributed computing focused), FiberTaskingLib and Boost Fiber.

Note that the last 3 all use fibers (AKA user-land threads, AKA green threads) which are like operating system threads but where the programmer is in control of scheduling them.

A well known example of using fibers for a tasking system in video games is the GDC presentation by Naughty Dog on porting The Last of Us to the PS4 (and running it at 60fps):

https://www.gdcvault.com/play/1022186/Parallelizing-the-Naughty-Dog-Engine

 

Finally, if you care about performance try to read up on modern computer architecture (the memory system) and SIMD.

Most game engine developers now try to apply "Data Oriented Design" which is a way of structuring your program in such a way that it makes it easy for the processor to process the data.

This usually comes down to storing your data as a structure of arrays (SOA) which is better for cache coherency and makes SIMD optimisations easier (although DOD does cover more than just SOA).

 

To learn more about the graphics pipeline, a lot of resources are available online describing how the GPUs programmable cores work (covering terms like warps/wavefronts, registry pressure, shared memory vs global memory, etc).

If you are interested in learning more about the actual graphics pipeline itself (which contains fixed-function parts) then I would definitely recommend this read:

https://fgiesen.wordpress.com/2011/07/09/a-trip-through-the-graphics-pipeline-2011-index/

 

Also, writing a software rasterizer is a great way to get to learn the graphics pipeline and it is also a really good toy project to practice performance optimisations (maybe read up on project Larrabee by Intel).

 

Sorry for the wall of text.

Hopefully this will help you and anyone else trying to develop their first game/graphics engine and not knowing where to start (in terms of performance optimizations).

 

Desktop: Intel i9-10850K (R9 3900X died 😢 )| MSI Z490 Tomahawk | RTX 2080 (borrowed from work) - MSI GTX 1080 | 64GB 3600MHz CL16 memory | Corsair H100i (NF-F12 fans) | Samsung 970 EVO 512GB | Intel 665p 2TB | Samsung 830 256GB| 3TB HDD | Corsair 450D | Corsair RM550x | MG279Q

Laptop: Surface Pro 7 (i5, 16GB RAM, 256GB SSD)

Console: PlayStation 4 Pro

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×