Jump to content

How is pointer incrementation faster than addition in C?

Gat Pelsinger
Go to solution Solved by wanderingfool2,

Doing tests like this isn't exactly perfect, there are just way too many factors to consider...and unless you want to get into super nitty gritty details on registers/cache sizes operation size etc the general approach I would say just write good/readable code because little micro-optimizations arent typically needed (and often can be effected by functions and code that had run just prior to it).  Also, I mentioned it before but I'll mention it again you shouldn't rely on metrics from running a quick function like this once.  There are too many variables that will throw off the measurement.

 

 

 

Now to get onto why func1 and func2 would be different, where is what I would think (I didn't look at your assembly posted because if you need to look at assembly to optimize code the general ask should be is it even worth the time).

 

func1, you have the memory location str and integer i.  Now both of these will be stored; your for loop is incrementing i a nice and simple task

Now your stop condition of str; what is it doing?  It's taking the str and adding i*sizeof(char) to it.  Now you are effectively doing an additional addition AND multiplication to the loop.  Then after that it derefences and check for str != NULL to continue (btw, maybe it's the way I was taught but you should really be adding != NULL or != '\0').

 

func2, your loop is only adding sizeof(char) to i each iteration.

 

So for myself I would inherently have assumed func2 without optimizations would run quicker because it boils down to having 1 variable in cache that is being added while func2 requires 2 variables, multiplication and 2 additions.

 

 

#include <stdio.h>
#include <time.h>

const char* str = "Hello";

int func1()
{
    register int i = 0;
    for (; str[i]; i++);
    return i;
}

int func2()
{
    register char* i = str;
    for (; *i; i++);
    return (int)(i - str);
}

int main() {
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    int a = func1();
    clock_gettime(CLOCK_MONOTONIC, &end);
    printf("%d %ld\n", a, end.tv_nsec - start.tv_nsec);
    clock_gettime(CLOCK_MONOTONIC, &start);
    int b = func2();
    clock_gettime(CLOCK_MONOTONIC, &end);
    printf("%d %ld\n", b, end.tv_nsec - start.tv_nsec);
    
    return 0;
}

 

Both func1 and func2 return the length of the string. func1 uses pointer addition based approach where it adds a value to the address of 'str' and virtually increments it, whereas func2 uses a pointer incrementation approach, where it makes a new pointer instead of a variable, sets it to the address of 'str', and increments it. Upon running the code, it seems that func2 is way faster than func1. Why is that? Isn't the time complexity of addition O(1)? And upon that, in func2 we are physically incrementing a pointer, which is a memory bounded task (well technically, because our code is so small it might wholly reside in the cache and never hit system memory) instead of a compute bounded. Compute bounding is better than memory bounding.

 

Microsoft owns my soul.

 

Also, Dell is evil, but HP kinda nice.

Link to comment
Share on other sites

Link to post
Share on other sites

Once again the answer will be in the assembly code the compiler put out for them, so look at that.

 

16 minutes ago, Gat Pelsinger said:

Isn't the time complexity of addition O(1)? And upon that, in func2 we are physically incrementing a pointer

Not necessarily depending on how the compiler implemented it.

 

This with the addition of the same caveat you were previously given about timing things that way.

F@H
Desktop: i9-13900K, ASUS Z790-E, 64GB DDR5-6000 CL36, RTX3080, 2TB MP600 Pro XT, 2TB SX8200Pro, 2x16TB Ironwolf RAID0, Corsair HX1200, Antec Vortex 360 AIO, Thermaltake Versa H25 TG, Samsung 4K curved 49" TV, 23" secondary, Mountain Everest Max

Mobile SFF rig: i9-9900K, Noctua NH-L9i, Asrock Z390 Phantom ITX-AC, 32GB, GTX1070, 2x1TB SX8200Pro RAID0, 2x5TB 2.5" HDD RAID0, Athena 500W Flex (Noctua fan), Custom 4.7l 3D printed case

 

Asus Zenbook UM325UA, Ryzen 7 5700u, 16GB, 1TB, OLED

 

GPD Win 2

Link to comment
Share on other sites

Link to post
Share on other sites

@Kilrah

 

func1:
	pushq	%rbp
	.seh_pushreg	%rbp
	pushq	%rbx
	.seh_pushreg	%rbx
	leaq	(%rsp), %rbp
	.seh_setframe	%rbp, 0
	.seh_endprologue
	movl	$0, %ebx
	jmp	.L4
.L5:
	addl	$1, %ebx
.L4:
	movq	str(%rip), %rdx
	movslq	%ebx, %rax
	addq	%rdx, %rax
	movzbl	(%rax), %eax
	testb	%al, %al
	jne	.L5
	movl	%ebx, %eax
	popq	%rbx
	popq	%rbp
	ret
	.seh_endproc
	.globl	func2
	.def	func2;	.scl	2;	.type	32;	.endef
	.seh_proc	func2
func2:
	pushq	%rbp
	.seh_pushreg	%rbp
	pushq	%rbx
	.seh_pushreg	%rbx
	leaq	(%rsp), %rbp
	.seh_setframe	%rbp, 0
	.seh_endprologue
	movq	str(%rip), %rbx
	jmp	.L8
.L9:
	addq	$1, %rbx
.L8:
	movzbl	(%rbx), %eax
	testb	%al, %al
	jne	.L9
	movq	str(%rip), %rax
	movq	%rbx, %rdx
	subq	%rax, %rdx
	movl	%edx, %eax
	popq	%rbx
	popq	%rbp
	ret
	.seh_endproc
	.def	__main;	.scl	2;	.type	32;	.endef
	.section .rdata,"dr"

 

 

oof, am gonna need to learn assembly soon.

Microsoft owns my soul.

 

Also, Dell is evil, but HP kinda nice.

Link to comment
Share on other sites

Link to post
Share on other sites

56 minutes ago, Gat Pelsinger said:

And upon that, in func2 we are physically incrementing a pointer, which is a memory bounded task

Well already you can see that's wrong, in both cases the counters being incremented are registers.

 

But from a quick look you can see func1 transfers back and forth between registers in the loop while func2 doesn't.

In func2 you increment a pointer, that's just held as a 64bit value in rbx and that's what's incremented and used for indirectly addressing the string.

In func1 the counter is an int held in ebx, and every time it is copied into rax, before adding the string's base address to that, then indirectly addressing.

 

Changing the type of i might change it, also not specifying the register keyword could. As you were told the compiler normally knows what to do and adding register can make things worse, it's possible that by telling it "register int" it'll force the compiler to put it in a 32bit register, while just telling it int it'd have pit it in a 64-bit one that doesn't need transfers.

 

 

F@H
Desktop: i9-13900K, ASUS Z790-E, 64GB DDR5-6000 CL36, RTX3080, 2TB MP600 Pro XT, 2TB SX8200Pro, 2x16TB Ironwolf RAID0, Corsair HX1200, Antec Vortex 360 AIO, Thermaltake Versa H25 TG, Samsung 4K curved 49" TV, 23" secondary, Mountain Everest Max

Mobile SFF rig: i9-9900K, Noctua NH-L9i, Asrock Z390 Phantom ITX-AC, 32GB, GTX1070, 2x1TB SX8200Pro RAID0, 2x5TB 2.5" HDD RAID0, Athena 500W Flex (Noctua fan), Custom 4.7l 3D printed case

 

Asus Zenbook UM325UA, Ryzen 7 5700u, 16GB, 1TB, OLED

 

GPD Win 2

Link to comment
Share on other sites

Link to post
Share on other sites

@Kilrah

 

Well even after removing the register keyword, I am getting similar results. Note that I had to add a scaler loop of 10,000 iterations to scale the time higher or else I wouldn't get precise results at all. With register, I got around 52K and 38K nanoseconds of func1 and func2 respectively, and after removing the register keyword, I got 65K and 58K. So the gap got lesser after removing the register keyword (37% to 12%).

 

My assumption would be that addition might be O(1), but the addition itself is "slow" (in this context), where as because our code probably mostly resides in the cache and never touches main memory, iterating the pointer doesn't take much time at all. Which could show why the elapsed time of the pointer based approach got significantly higher (52%) than the addition based approach (25%) when we did not use the register keyword. Addition is fast but in this context, the data set is really small and I guess there is a lot of overhead of invoking, using and operating the ALU. Do you agree?

Microsoft owns my soul.

 

Also, Dell is evil, but HP kinda nice.

Link to comment
Share on other sites

Link to post
Share on other sites

Doing tests like this isn't exactly perfect, there are just way too many factors to consider...and unless you want to get into super nitty gritty details on registers/cache sizes operation size etc the general approach I would say just write good/readable code because little micro-optimizations arent typically needed (and often can be effected by functions and code that had run just prior to it).  Also, I mentioned it before but I'll mention it again you shouldn't rely on metrics from running a quick function like this once.  There are too many variables that will throw off the measurement.

 

 

 

Now to get onto why func1 and func2 would be different, where is what I would think (I didn't look at your assembly posted because if you need to look at assembly to optimize code the general ask should be is it even worth the time).

 

func1, you have the memory location str and integer i.  Now both of these will be stored; your for loop is incrementing i a nice and simple task

Now your stop condition of str; what is it doing?  It's taking the str and adding i*sizeof(char) to it.  Now you are effectively doing an additional addition AND multiplication to the loop.  Then after that it derefences and check for str != NULL to continue (btw, maybe it's the way I was taught but you should really be adding != NULL or != '\0').

 

func2, your loop is only adding sizeof(char) to i each iteration.

 

So for myself I would inherently have assumed func2 without optimizations would run quicker because it boils down to having 1 variable in cache that is being added while func2 requires 2 variables, multiplication and 2 additions.

 

3735928559 - Beware of the dead beef

Link to comment
Share on other sites

Link to post
Share on other sites

5 hours ago, Gat Pelsinger said:

And upon that, in func2 we are physically incrementing a pointer, which is a memory bounded task

A pointer is just a number. That number may represent a memory address, but it does not have to reside in memory itself. Incrementing a pointer is exactly the same as incrementing an integer. There's no fundamental reason why it should be slower, nor is it memory bound any more than incrementing a plain integer is.

 

for (; str[i]; i++);

One thing to be aware of is that "str[n]" is simply a shorthand notation for

for (; *(str + i); i++);

In other words, you're incrementing a pointer.

 

This means your first loop contains four instructions (some of which are "hidden"): increment pointer, dereference pointer, compare value at memory address to zero and finally increment the integer.

 

This loop on the other hand contains only three instructions:

for (; *i; i++);

dereference pointer, compare value at memory address to zero and finally increment pointer.

 

Which means without any compiler optimizations the first loop should actually be slower - Its cost is 4 x n, while the second loop is 3 x n (with n being the length of the string)

Remember to either quote or @mention others, so they are notified of your reply

Link to comment
Share on other sites

Link to post
Share on other sites

Let's only take a look at where the loops appear in the assembly code, since that's what will take up the majority of the execution time. I'll even annotate the assembly to make it easier to understand

 

func1 (index-based):

.L5:
	addl	$1, %ebx           ; Increment the loop counter (ebx is a 32-bit register, we're using 32-bit ints)
.L4:
	movq	str(%rip), %rdx    ; Copy the base address of the string into rdx
	movslq	%ebx, %rax         ; Extend the loop counter from 32 to 64 bits via sign extension      
	addq	%rdx, %rax         ; Add the loop counter to the string's base address. 
	                           ; Now, rax contains the address of the next character to check
	movzbl	(%rax), %eax       ; Dereference that address and place the resulting character in eax
	testb	%al, %al           ; Check if the lowest 8 bits (the char) of eax is 0, null
	jne	.L5                ; If the char wasn't the null byte, continue the loop

 

func2 (iterator-based):

.L9:
	addq	$1, %rbx           ; Increment the pointer stored in rbx (64 bits)
.L8:	                           ; Now, rbx contains the address of the next character to test
	movzbl	(%rbx), %eax       ; Copy the char at that address into eax
	testb	%al, %al           ; Check if the lowest 8 bits is 0 (null)
	jne	.L9                ; If it wasn't the null byte, continue the loop

 

The only meaningful difference between these two bits of code is how the index-based approach calculates the address of the next character to read. func1 must get the address of the string, sign extend the loop counter (because ints are 32 bit while memory addresses are 64 bit), then add the loop counter to the base address. In contrast, func2 already has the address of the previous character stored in a register. All it needs to do to get the next address is add 1 to that register.

 

In short, the pointer approach is faster because it doesn't need to recalculate addresses from scratch all the time. That only saves 3 instructions, which isn't much.

 

One more thing I want to note, have a look at the instruction after labels .L5 and .L9 in the code above. These correspond to the increment step of the for loops. For both approaches, the incrementation is performed by a single addition instruction. Adding to a 32 bit number and a 64 bit number both take the same amount of time. Lastly, it confirms @Eigenvektor's note that pointers are just numbers when you get down to this level of code.

Computer engineering grad student, cybersecurity researcher, and hobbyist embedded systems developer

 

Daily Driver:

CPU: Ryzen 7 4800H | GPU: RTX 2060 | RAM: 16GB DDR4 3200MHz C16

 

Gaming PC:

CPU: Ryzen 5 5600X | GPU: EVGA RTX 2080Ti | RAM: 32GB DDR4 3200MHz C16

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×