CPU vs GPU instruction set difference

BotDamian · July 11, 2021

Hello dear Community,

I'm thinking about a project and I have no idea how to implement it because I don't understand CPU and GPU differences well.

So for example anything someone is programming like C#, C or Javascript is running on the CPU because the instruction set of the CPU makes it possible right?

So what is the difference exactly? Why can XY programm run on the CPU but not on the GPU? Where is the difference, what instructions are missing etc?

I also know that you don't talk to the GPU directly but use APIs like OpenGL, Vlukan, CUDA etc.

I'm very interesting in this topic and I'm sure any good and constructive reply will also help the future generations that are also interested in this topic.

I've found few threads on Google that direct me also to LTT Forum.

Bombastinator · July 11, 2021

24 minutes ago, BotDamian said:

Hello dear Community,

I'm thinking about a project and I have no idea how to implement it because I don't understand CPU and GPU differences well.

So for example anything someone is programming like C#, C or Javascript is running on the CPU because the instruction set of the CPU makes it possible right?

So what is the difference exactly? Why can XY programm run on the CPU but not on the GPU? Where is the difference, what instructions are missing etc?

I also know that you don't talk to the GPU directly but use APIs like OpenGL, Vlukan, CUDA etc.

I'm very interesting in this topic and I'm sure any good and constructive reply will also help the future generations that are also interested in this topic.

I've found few threads on Google that direct me also to LTT Forum.

Sadly the first post may test that. I am not a programmer in a way not so different from an ostridge not being an ice skater, but my understanding is that while both are Turing complete processors they are optimized in such radically different ways to the point that what one can do the other is often abysmally terrible at. LTT did a video on some Uber cpu chip where they ran crisis with no gpu. A game that’s almost old enough to vote. (Or drink, or something. I don’t even remember when it came out) At abysmally low fps, and it was a demonstration of how incredibly powerful the chip was, as such a thing couldn’t normally even function. Programs are compiled into other languages. Sometimes several times, on its way down to something a computer can understand, and if a compiler is not around for a particular thing you can’t get there from that other place., even if it’s close by. There are probably a lot of people here who can do that better and more accurately. Hopefully my hairy knuckle explanation is useful in its low level though.

Helpful Tech Witch · July 11, 2021

15 minutes ago, Bombastinator said:

A game that’s almost old enough to vote. (Or drink, or something. I don’t even remember when it came out)

Drive. Came out 2007

Eigenvektor · July 11, 2021

1 hour ago, BotDamian said:

So what is the difference exactly? Why can XY programm run on the CPU but not on the GPU? Where is the difference, what instructions are missing etc?

As a high level answer, a CPU is general purpose. Anything you can think of, it can do. That complexity comes at a price. A CPU core is huge compared to a GPU "core" (we now have up to 64, while GPUs have several thousand)

That's because a GPU is far more specialized. A GPU "core" is not comparable to a CPU core in any way. For example a tensor core can do matrix multiplications. That's it. It can't do anything else. However, you have hundreds or even thousands of them. So you can do thousands of these multiplications each clock cycle. Which is great for 3D graphics where you have to do a ton of these operations each frame.

Likewise GPUs are good at stuff like AI or even physics, because you have to do a lot of mathematical operations (like matrix multiplications) that a GPU is good at.

This is also sometimes referred to as SIMD – Simple Instruction Multiple Data. The idea is that you have very few operations you can do, but you can do hundreds or thousands of them in parallel, because (for 3D graphics) you have to do the same operations over and over again (e.g. to determine the color of each pixel).

Franck · July 12, 2021

17 hours ago, BotDamian said:

Why can XY programm run on the CPU but not on the GPU? Where is the difference, what instructions are missing etc?

The instruction set is just a collection of methods that when called know which electrical circuits combination it need to use in order to achieve what was asked. In thises circuits there are few that are similar in both core but some only exist in CPU others in GPU so instruction set cannot be all the same for both. Also each work with different amount of data per call, quantity of simultaneous circuit usage etc.

The closest both core come at would be CUDA. CUDA is the closest to CPU in term of function. Nearly any line of code you can do normally to a CPU can be done with CUDA. It does not require the same code but it can pretty much do the same thing better or worse.

BotDamian · July 12, 2021

Thanks for the comments but I still have questions.

So the part I don't understand is that for example with CPUs you're able to add, subtract, divide, multiply and with GPUs you can only do matrix multiplications?

But why is it only matrix multiplications, is it something similar to ASIC chips? They can do one thing extremely fast but useless for anything else?

I'm asking because the GPUs have so many cores so why can't we run sort algorithms on GPUs and let thousands of cores handle it?

It seems pretty easy to me, it's just a check "is higher or lower than".

Kilrah · July 12, 2021

1 hour ago, BotDamian said:

is it something similar to ASIC chips? They can do one thing extremely fast but useless for anything else?

Yep basically.

The reason there can be thousands of cores is that they're small ones that are designed to be good at a small set of specific things.

BotDamian · July 13, 2021

4 hours ago, Kilrah said:

Yep basically.

The reason there can be thousands of cores is that they're small ones that are designed to be good at a small set of specific things.

But what calculations can I do on a GPU exactly?

Bombastinator · July 13, 2021

27 minutes ago, BotDamian said:

But what calculations can I do on a GPU exactly?

I suspect that would depend heavily on the gpu. My memory from long ago is that all a 3d gpu does is make triangles. Lots and lots and lots of triangles. Very very fast. It’s likely changed a good bit.

a gpu could I think be called a coprocessor. Way way back in the day you would have CPUs and coprocessors. Math coprocessors that could do floating point instead of integer math were a thing for early PC chips. Then CPUs could do floating point math too and math Coprocessors vanished.

Edited July 13, 2021 by Bombastinator

wanderingfool2 · July 13, 2021

On 7/11/2021 at 11:03 AM, BotDamian said:

I'm thinking about a project and I have no idea how to implement it because I don't understand CPU and GPU differences well.

Thinking about concepts and what suites what I think is a good mentality when starting things. Ultimately though with CPU vs GPU it's about what kind of data you are working with.

Starting at a very high overview:

CPU's are like the work horse and GPU's are like the stallions. CPU's can do pretty much any task you give it, it does many different things (thus a work horse). GPU's on the other hand are more designed to do specific tasks but do them blazingly fast by breaking it apart into many smaller bits that can be done at the same time (thus stallion).

So typically applications that have tasks that involve manipulating a lot of data (where it can be broken into chunks that can be run at the same time) will benefit from GPU's and for everything else there is CPU's. Although CPU's can run GPU tasks as well (just typically slower).

More detailed:

CPU's were designed for doing instructions, where as GPU's were made just with the concept of higher quality graphics (in video games).

To render graphics to the screen, involving depths/perspective, you need to solve a bunch of matrix equations. Luckily for the most part it doesn't rely on waiting for results of other parts...so GPU's were designed to be able to essentially crunch the numbers as quickly as possible (by dividing up the task between many "cores"). To put it in perspective, on a CPU you might say multiply a1 * b1 = r1, a2 * b2 = r2, a3*b3 = r3...a10000*b10000 = r10000. Each one a separate task. On a GPU, you effectively can say multiply a[]*b[] = r[] and it will try doing it all at once (since it's designed to), thus it's like having tons of CPU cores...but the issue it's a lot more limited.

If it's okay to ask, what kind of project were you thinking of? It would help to explain which parts might lend itself to a GPU task (if any).

Eigenvektor · July 13, 2021

8 hours ago, BotDamian said:

So the part I don't understand is that for example with CPUs you're able to add, subtract, divide, multiply and with GPUs you can only do matrix multiplications?

GPUs are (or used to be) purpose built for 3D acceleration. A lot of that requires matrices. However it can't just do matrix multiplications, that example was for the tensor cores. GPUs also have cuda cores or streaming processors (SP), texturing mapping units (TMU), rasterization processors (ROP) etc.

Essentially it contains anything that is needed for 3D graphics. Rotate triangles (matrix multiplications), texture triangles, perform 2D projection (matrices again) and some effects like AA.

In the past GPUs where mostly fixed function pipelines. You'd dump in data one end, it would do whatever it was designed to do and spit out a 2D image the other end. Modern GPUs are much more flexible and programmable (shaders). So they are usable for more than graphics (think AI, physics and other scientific things). This is often termed GPGPU (general purpose graphics processing unit).

The real strength of a GPU is that it can do a lot of things in parallel. It would be wasted on normal things that your CPU handles (also slower since it usually has a much slower clock rate). If you have a billion triangles that you need to rotate the GPU is what you want to use, because it will be able to process thousands of those triangles in parallel.

If you have complex code that can only operate on a few threads, contains branches and loops that's where the CPU shines, while the GPU would be wasted since all that parallel processing power would go unused.

~edit: The threads on your GPU are also not necessarily independent. The idea is that you give it one command ("rotate by 3°") and e.g. hand it a few billion triangles. It will then perform that same operation on all of them, processing thousands in parallel each clock cycle (-> SIMD).

CPU cores on the other hand are totally independent. Each core is a fully fledged CPU in its own right and can work on tasks independent from any other core. They usually just share some resources like caches.

GPU "cores" on the other hand are (for the most part) specialized function blocks that the GPU can use depending on which operation a particular group of cores is most suited for. For best performance you want to keep as many of them occupied at the same time as possible.

BotDamian · July 13, 2021

So sorting an array wouldn't be possible with a GPU?

Kilrah · July 13, 2021

You probably could write a shader program that sorts an array, but that is not a task that benefits from heavy parallelizing so not much point...

tikker · July 13, 2021

3 hours ago, BotDamian said:

So sorting an array wouldn't be possible with a GPU?

The key thing is that you (probably) wouldn't use a GPU to sort a single huge array. You'd use it to quickly sort many arrays. It's possible and people do it: https://github.com/pcdslab/GPU-ArraySort-2.0 -> "this algorithm is able to sort large number of variable-sized arrays on a GPU."

As said GPUs are specialised, so the first question is whether your tasks even suits the nature of a GPU at all (for example by needing lots of matrix operations). Then they are useful for highly parallelisable workloads, because they have thousands of cores as mentioned above, so if you want to leverage GPU power it's useful to think about whether the problem can broken up as a specific operation many many times instead of many operations a few times.

Franck · July 13, 2021

4 hours ago, BotDamian said:

So sorting an array wouldn't be possible with a GPU?

Sorting array works very well on GPU although to have better performance than CPU you need CUDA cores on your card. Any mathematical formula and collection/array manipulation are extremely fast with CUDA compared to a CPU as these specific type of expressions are optimized for these specially and there are tons of core to process them at a time. You can use either the straight CUDA C++ implementation which is easy enough or the simplified version AleaGPU for C#. both you can throw pretty much any instruction you know on it but again specially any math formula and collections/array/vector iteration are the best.

shadow_ray · July 14, 2021

Here is my two cents' worth...

From a slightly different viewpoint than others before me.

CPUs contain lots of complicated circuitry and caches to minimize latency and synchronize memory between cores. This takes away space and energy from execution units but the small amount they have will be super quick.

GPU architectures contain hardware schedulers to quickly switch wraps/works items when the required data is not available so memory latency won't hurt them as much, but because of the huge amount of cores/execution units they need more memory throughput. Memory synchronization is highly limited, this makes programming them more complicated in some ways.

On 7/11/2021 at 8:03 PM, BotDamian said:

Why can XY programm run on the CPU but not on the GPU? Where is the difference, what instructions are missing etc?

You can think of a cuda core / shader core as a really simple cpu. It can do floating point and integer arithmetic similarly to CPU arithmetic units but they are not used on their own, 8, 16 or 32 of these simple cores are grouped together inside a compute unit (or StreamingMultiprocesor) and the same instructions are given to most of them (depens on arhitecture).

This is the key difference, same instruction is given to multiple execution units. If an algorithm is highly parallelizable then it will work great on a GPU, if not, then it will be much sower because most of the hardware won't be utilized.

To be more concrete: if a problem can't be broken down to 10s of thousands of threads it's probably not a good candidate for GPU implementation.

On 7/11/2021 at 8:03 PM, BotDamian said:

I also know that you don't talk to the GPU directly but use APIs like OpenGL, Vlukan, CUDA etc.

GPU architectures and their instruction sets can be vastly different. There are no standardized instruction sets like armv8 on mobile or amd64 on PC. If the GPU supports (let's say) OpenGL then the GPU driver will be able to compile the GLSL shader code for the card on the fly. Newer APIs support an intermediate representation (similarly to java byte code) but the main concept is the same: the driver will compile the intermediate code to the specific instruction set of the given architecture.

Eigenvektor · July 14, 2021

On 7/13/2021 at 8:44 AM, BotDamian said:

So sorting an array wouldn't be possible with a GPU?

Would it be possible? Yes. Would it make sense? Probably not. The strength of a GPU is to be able to process huge amounts of data in parallel. For anything else the CPU is usually better suited.

Imagine you have an array that contains a million number (N¹, N², N³, …). You want to add pairs of number like this:

add N1, N2
add N3, N4
add N5, N6
…

To add all of these numbers you have to perform 500,000 additions. Those additions can be done in parallel, since there is no interdependence.

If each addition takes one clock cycle, that would be 500,000 clock cycles on a single core CPU to complete. For a 5 GHz CPU (0.2 ns per clock cycle) this would take roughly 0.1 milliseconds to complete.

Now take a GPU with 1000 cores able to perform these additions in parallel, 1000 at a time. Instead of 500,000 clock cycles it will only take 500 clock cycles to complete. Even if your GPU is running at just 1 GHz (1 ns per clock cycle), it would be able to complete that in ~500 nanoseconds or 0.0005 ms.

Now do the same thing again, but make the additions depend on one another like this:

add N1, N2 (= R1)
add R1, N3 (= R2)
add R2, N4 (= R3)
…

This can no longer be run in parallel, because the result of each operation depends on the result of the previous operation. So you need to complete the first addition before you can start the second one and so on. If I'm not mistaken this would require a total of 999,999 additions.

Using the same CPU as above, it would take ~0.2 ms to complete.

What about the GPU? You no longer benefit from having 1000 cores that could theoretically do stuff in parallel, because the operations can't be parallelized. So it will now also need 999,999 clock cycles while the majority of its cores are idle/wasted. At 1 GHz it would take ~1 ms to complete. So you just went from 200 times faster to 5 times slower.

To get the most out of a GPU you need massive amounts of data that can be processed in parallel. Most applications (say running a browser) are not that. Sorting an array is not that (unless you need to sort a million different arrays in parallel). Running these on a GPU would be a waste of resources and slower than doing it on a CPU in pretty much all cases.

Processing millions or billions of triangles all of which need to be rotated, textured, projected is exactly that (3D games). Simulating a huge number of variations of possible protein folds is exactly that (Folding at Home). Testing a huge amount of random numbers whether their hash code matches a specific value is exactly that (Mining). All of these work because you can run these operations on thousands of cores in parallel because they are independent from one another while requiring the exact same operation to be done over and over just using different input values.

wanderingfool2 · July 14, 2021

1 hour ago, Eigenvektor said:

To get the most out of a GPU you need massive amounts of data that can be processed in parallel. Most applications (say running a browser) are not that. Sorting an array is not that (unless you need to sort a million different arrays in parallel). Running these on a GPU would be a waste of resources and slower than doing it on a CPU in pretty much all cases.

Actually, depending on the algorithm sorting can be quite parallel. An example being, taking 2 lists, dividing it in half (and sorting each half) and then sorting the 2 lists (since the 2 lists are sorted, the merging is quick). I'd imagine that for most applications though sorting doesn't really cause enough of a slowdown to justify using a GPU...and with that said, the overhead likely would mean you would need a large array.

To be clear though, sorting can be done on a GPU on a single array and sorting algorithms can be made to be highly parallel. An example of it being quicksort, it uses a divide an conquer, so could easily be split into thousands of cores...but then again there is overhead, so you likely need a large array size (where it takes seconds to sort, instead of milliseconds)

Eigenvektor · July 15, 2021

6 hours ago, wanderingfool2 said:

Actually, depending on the algorithm sorting can be quite parallel. An example being, taking 2 lists, dividing it in half (and sorting each half) and then sorting the 2 lists (since the 2 lists are sorted, the merging is quick). I'd imagine that for most applications though sorting doesn't really cause enough of a slowdown to justify using a GPU...and with that said, the overhead likely would mean you would need a large array.

To be clear though, sorting can be done on a GPU on a single array and sorting algorithms can be made to be highly parallel. An example of it being quicksort, it uses a divide an conquer, so could easily be split into thousands of cores...but then again there is overhead, so you likely need a large array size (where it takes seconds to sort, instead of milliseconds)

Yeah, I know sorting can be parallelized, but as you said the overhead of sending it to the GPU getting it sorted and then send it back to the CPU isn't necessarily worth it unless we're talking about a fairly large list. https://forums.developer.nvidia.com/t/fastest-sorting-algorithm-on-gpu-currently/43958

Bombastinator · July 15, 2021

On 7/12/2021 at 5:54 PM, Kilrah said:

Yep basically.

The reason there can be thousands of cores is that they're small ones that are designed to be good at a small set of specific things.

There is can’t and can’t well. One could use a freight train to crack walnuts for example. Sometimes it’s truly can’t, sometimes it’s just an absurd tool and requires vast ingenuity and complication to do a simple thing. One uses what one has though. There could be a situation where it’s the least bad option

Edited July 15, 2021 by Bombastinator

shadow_ray · July 15, 2021

12 hours ago, Eigenvektor said:

overhead of sending it to the GPU getting it sorted and then send it back to the CPU

Integrated graphics is in a better position when it comes to data transfer. If the data is 4k aligned (and if it's in a well accessable format) then there is no need to copy. It can be mapped into GPU memory space quite quickly. On a mobile SoC it can be beneficial to do calculations on the GPU with a better performance to watt ratio. And nowadays lots of desktop chips are equipped with integrated gpus as well.

BotDamian · July 15, 2021

So when would it make sense to use a GPU filtering of an array? (if possible on gpu)

1000, 100k? I assume that even a SOC mobile GPU can be faster than the 8 arm cores?

Like the RK3399

wanderingfool2 · July 15, 2021

22 minutes ago, BotDamian said:

So when would it make sense to use a GPU filtering of an array? (if possible on gpu)

1000, 100k? I assume that even a SOC mobile GPU can be faster than the 8 arm cores?

Like the RK3399

It really depends likely on the data. Honestly though, it's better to have a real use case first and then figuring out if it makes sense to even use a CPU/GPU.

You could optimize sorting to death, but if it only takes 0.1 seconds to sort and only gets called a few times in an application that takes 5 minutes to run it wouldn't make sense. If sorting is what is the bottleneck sure, then it might make sense...but the best thing would be to run some tests on the hardware you expect it to run on (as all hardware is different)

BotDamian · July 15, 2021

1 hour ago, wanderingfool2 said:

It really depends likely on the data. Honestly though, it's better to have a real use case first and then figuring out if it makes sense to even use a CPU/GPU.

You could optimize sorting to death, but if it only takes 0.1 seconds to sort and only gets called a few times in an application that takes 5 minutes to run it wouldn't make sense. If sorting is what is the bottleneck sure, then it might make sense...but the best thing would be to run some tests on the hardware you expect it to run on (as all hardware is different)

So the thing is that it will filter (not sort) each 3s, it can be an array of 100 items then the next second it can be an array with 10k items, then 4k items.
Short said we will never know how many items there will be, if god wants to it could be even 5mil items. Similar to AI you don't know how much data comes out exactly, depends.

That's why I'm asking, if the Rasp Pi 4 can filter 1mil items in a second then this shouldn't be a problem. If it takes a second longer than 3s it's not bad, it shouldn't become 20s tho.

tikker · July 16, 2021

13 hours ago, BotDamian said:

So the thing is that it will filter (not sort) each 3s, it can be an array of 100 items then the next second it can be an array with 10k items, then 4k items.
Short said we will never know how many items there will be, if god wants to it could be even 5mil items. Similar to AI you don't know how much data comes out exactly, depends.

That's why I'm asking, if the Rasp Pi 4 can filter 1mil items in a second then this shouldn't be a problem. If it takes a second longer than 3s it's not bad, it shouldn't become 20s tho.

The problem you describe here is almost antithetical to optimisation. CPUs are "slow", because they don't know what to expect and need to be able to handle practically anything you throw at them. This comes at the cost of speed or efficiency. GPUs are fast at specific things, because they are purpose-built to do those specific things, but that comes at the cost of flexibility. This is a general problem: the more specific and constrained something is, the easier it is to optimise for. The more general and free it is, the harder it is to optimise for, if at all.

In your case I would first determine what the worst case scenario for the filtering step is. Is there an absolute max it can reach? If yes, figure out the absolute longest it can become and go from there. If not (or an alternative to yes), estimate what the longest the array will realistically become is (100 items? 10k items? 5M items?), and accept that on the rare occasion a longer one pops up that will simply take longer to process. This ties in to @wanderingfool2's point. Is this filtering step mission critical in that every array absolutely has to be filtered in, say, under 20s or are you simply looking to optimise the program? If it's the former, then go ahead, if it's the latter your time may be better spent profiling the application to identify what the biggest bottlenecks are and which are best to tackle from an effort spent vs time saved point of view (e.g. spend a month optimising the way data is handled to gain a 10% speed up vs spending a year writing the ultimate filtering algorithm to gain 15%).

Once you know what the problem you are trying to optimise is, you can start looking into the most efficient filtering algorithms for your data. This is where you start thinking about the nature of the problem. Is it highly parallel, working on lots of independent bits of data? That can be worth investingating a GPU solution for. The second question becomes is it worth it from an efficiency standpoint to run it on the GPU. For example, if it takes you longer to copy the data to and from the GPU than it takes you to filter it and a CPU, although technically slower, can do it in the same time or less it may not be worth going through the hassle of a GPU implementation.

Sign In

CPU vs GPU instruction set difference

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites