Jump to content

Question in C about CPU performance.

Gat Pelsinger

As I like to keep poking around with C and its usage of CPU and memory, I created a program that uses multithreading (pthreads) and each thread runs a function in an infinite optimized for loop (no conditions). The function contains a few register variables and I have very long lines of code which just do random mathematic operations.

 

My laptop CPU power throttles and I can use the clocks to determine the load of the program on the CPU. With just an infinite but empty for loop I get 3.9 GHz (4.2 max of CPU) turbo and settles to 3.5 on PL1. Note, we are stressing all 4 cores. And now if I add a lot of really long lines of random mathematic code (without compiler optimizations obviously), I get something turbo (don't remember) and settles to 2.6 on PL1, which clearly shows we are having more load. As a matter of fact, here I tried replace the register variables with normal variables and the clocks went up to only around 2.7 something. I thought I would see a much bigger difference but I guess we are completely compute bound so it doesn't matter, or just perhaps, C is caching those variables into faster storage areas. Anyways, what I wanted to show is that now (I reverted back to register variables) I copied my random math lines and spammed them so much that VS code couldn't even parse it. I then compiled the code, expecting either more clock drops because more load, or minimal clock drops thinking we have reached the limit, but no, the clocks actually increase to over 3.0. I didn't expect that. Meaning, because the code is so much now, it probably used to cache the code or something but now it can't? How does C actually allocate variables and instruction of code? Even with optimizations of, does it load the instructions in cache and perhaps variables in registers?

Microsoft owns my soul.

 

Also, Dell is evil, but HP kinda nice.

Link to comment
Share on other sites

Link to post
Share on other sites

20 minutes ago, Gat Pelsinger said:

Even with optimizations of, does it load the instructions in cache and perhaps variables in registers?

The way you talk makes it seem like you still think of C like some kind of runtime. It isn't. Your code gets compiled to native machine code, so you can look at the compiled code to figure out what exactly is going on and where the difference is.

 

Data needs to be loaded into registers for the CPU to perform mathematical operations on it. The only question is whether the result stays there or is moved back and forth between memory and registers.

 

Maybe share your code? Otherwise we can't do anything other than speculate.

Remember to either quote or @mention others, so they are notified of your reply

Link to comment
Share on other sites

Link to post
Share on other sites

@Eigenvektor Sharing the code is not that easily possible because it is too large. I spammed so many lines so that there very less loop overhead. I can't take a look at assembly because I don't understand it. I just want to have some idea of what is going on from your experience. Why did the load suddenly get lighter when I dramatically increased the instructions in the code than before?

 

I don't know if C by default caches stuff or not or if the CPU does something, but I think there was something going on before when code was small, that cannot be done now with so much code. Caching?

Microsoft owns my soul.

 

Also, Dell is evil, but HP kinda nice.

Link to comment
Share on other sites

Link to post
Share on other sites

8 hours ago, Gat Pelsinger said:

I don't know if C by default caches stuff or not or if the CPU does something, but I think there was something going on before when code was small, that cannot be done now with so much code. Caching?

This is a bit like saying "Maybe English by default orders a beer when I enter a bar". C is a language. It doesn't do anything on its own. You use it to explicitly tell the computer what to do.

 

The compiler may optimize the code you've written, if possible, but it will only do so if that can be done without changing the semantics of your code. So if you haven't programmed any form of caching, the compiled program will not cache things.

 

That said, a CPU has various caches (L1/L2/L3), so if the amount of data you're working with is small, it is certainly possible your program never has to touch main memory at all.

 

However, I have nothing but a vague idea of what your code even does (some calculations on multiple threads), so it's hard to guess what could've caused the load to increase or decrease.

 

My general advice would be: If you benchmark code and the result seems dubious, double check that your code actually does what you think it does. And double check that you are measuring what you think you're measuring.

Remember to either quote or @mention others, so they are notified of your reply

Link to comment
Share on other sites

Link to post
Share on other sites

Tight loops can do weird things. Maybe when you had less math in the loop, it was going so fast that it was stumbling over memory allocation, and once you slowed it down with more math, that bottleneck was alleviated.

 

Or maybe your code just got long enough that it filled the instruction cache more effectively. When it was short, perhaps each time you stepped into your loop, it had to load the loop's instructions into cache, which meant it was doing a lot of memory work to just a little but of computation. And then when your code was longer, it still had to load the loop's instructions into cache each time, but it was still doing the same amount of memory work and a lot more computation now.

 

As an exaggerated example of this, you can look into the N64's architecture. It infamously has a very small 16k instruction cache, so things that optimize code can actually make things slower if it happens to split your code at the wrong line. A YouTuber called Kaze Emanuar has been doing an amazingly interesting deep dive into optimizing Mario 64. Here's one of his videos talking about instruction caches 

 

Link to comment
Share on other sites

Link to post
Share on other sites

I just want to pop in here and make a quick "drive-by" comment about something minor that you mentioned in your question,

21 hours ago, Gat Pelsinger said:

[...] The function contains a few register variables [..]

 

[...] I tried replace the register variables with normal variables [...] (I reverted back to register variables) [...]

The register storage class in C is only a hint to the compiler. Variables that are not declared with this class can be placed in registers under the correct circumstances, and variables which are in the register storage class may still end up in memory. Unless you examine the assembly directly, I'd caution you against drawing too many conclusions about variable placement based solely on a register storage class.

For example, it is very likely that if you're running a tight loop like,

for (size_t i=0; i<n; i++) {
	// do some quick math
}

that will never hit memory, regardless of whether it is declared using the register or auto storage class.

 

Generally speaking, the only hard-and-fast rule is that a variable must have an assigned memory region if its address is ever taken with &, for obvious reasons (you can't have a pointer to a register). Otherwise, the compiler is free to place the variable in a register or in memory as it feels is best. Also, remember that there is a hard cap on the number of architectural registers available to the compiler, and so with too many variables something will have to spill, regardless of how much you spam a register storage class.

Link to comment
Share on other sites

Link to post
Share on other sites

@Flavius Heraclius I had tested by making another program which actually benchmarked performance between register and non register variables. When I do tell C to use the registers, C obeys it. And when I not, C doesn't. And I don't know if it uses register variables using compiler optimizations.

Microsoft owns my soul.

 

Also, Dell is evil, but HP kinda nice.

Link to comment
Share on other sites

Link to post
Share on other sites

11 minutes ago, Gat Pelsinger said:

I had tested by making another program which actually benchmarked performance between register and non register variables. When I do tell C to use the registers, C obeys it. And when I not, C doesn't.

The only way for us to know if this is truly the case is by looking at the assembly code generated by the compiler.

 

Please post your code.

Computer engineering grad student, cybersecurity researcher, and hobbyist embedded systems developer

 

Daily Driver:

CPU: Ryzen 7 4800H | GPU: RTX 2060 | RAM: 16GB DDR4 3200MHz C16

 

Gaming PC:

CPU: Ryzen 5 5600X | GPU: EVGA RTX 2080Ti | RAM: 32GB DDR4 3200MHz C16

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, Gat Pelsinger said:

@Flavius Heraclius I had tested by making another program which actually benchmarked performance between register and non register variables. When I do tell C to use the registers, C obeys it. And when I not, C doesn't. And I don't know if it uses register variables using compiler optimizations.

Make sure to use an up-to-date compiler and compile your code with full optimizations enabled when you want compare performance.

 

I think it was already mentioned in a previous topic, the register keyword is only a hint, and the compiler is free to ignore it. In fact, it has to, if you declare more variables than there are registers available. The compiler will generally do a better job at figuring out what should be kept in register, when it should be moved there, and for how long it should be kept around.

 

As I said above, to perform arithmetic operations on data, it will need to be loaded into a register beforehand. The only question is how long it stays there before it's value is copied back into memory (unless it is no longer needed and removed entirely).

Remember to either quote or @mention others, so they are notified of your reply

Link to comment
Share on other sites

Link to post
Share on other sites

On 1/20/2024 at 3:34 AM, Gat Pelsinger said:

@Eigenvektor Sharing the code is not that easily possible because it is too large.

One option for sharing it would be to just attach the source file to a post (file attachment, I mean), or perhaps upload it to a public git repository and share the link. I agree that spamming long code blocks into a forum isn't the best approach for something like this--but we really do need to see the code itself if we're going to be able to give anything except vague and general guidance.

When it comes to variable and register allocations, there are a lot of minute details about how you constructed your wall of arithmetic that are very relevant to determining how the compiler is behaving, how the CPU cache is being leveraged, etc.

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×