Another C quirk 🥴

Gat Pelsinger · February 18

Today I have another "why is this faster than this?" because I have nothing else to do in my life. Like seriously, you all might get tired replying my posts because all I say does not even really matter, but once I get this question in my head, it will bother me so I can't stop.

You know how "ptr" is the same as "*(ptr + i)"? Upon testing, looks like it is not. And the difference is actually something I can talk about.

clock_gettime(CLOCK_MONOTONIC, &start);
    for (int i = 0; i < iteration; i++){
        for (int i = 0; str[i]; i++);
    }
    clock_gettime(CLOCK_MONOTONIC, &end);
    printf("%ld\n", end.tv_nsec - start.tv_nsec);
    
    clock_gettime(CLOCK_MONOTONIC, &start);
    for (int i = 0; i < iteration; i++){
        for (int i = 0; *(str + i); i++);
    }
    clock_gettime(CLOCK_MONOTONIC, &end);
    printf("%ld\n", end.tv_nsec - start.tv_nsec);

This is only a snippet of my main because the string is so BIG it almost crashes VSCode. Anyways, the second loop is quite a bit faster than the first. First loop uses array indexing whereas the second loop uses pointer arithmetic.

Ah yes, the assembly code. Only if I could read it!

main:
	pushq	%rbp
	.seh_pushreg	%rbp
	movq	%rsp, %rbp
	.seh_setframe	%rbp, 0
	subq	$96, %rsp
	.seh_stackalloc	96
	.seh_endprologue
	call	__main
	leaq	.LC0(%rip), %rax
	movq	%rax, -24(%rbp)
	leaq	-48(%rbp), %rax
	movq	%rax, %rdx
	movl	$1, %ecx
	call	clock_gettime
	movl	$0, -4(%rbp)
	jmp	.L4
.L7:
	movl	$0, -8(%rbp)
	jmp	.L5
.L6:
	addl	$1, -8(%rbp)
.L5:
	movl	-8(%rbp), %eax
	cltq
	movq	-24(%rbp), %rdx
	addq	%rdx, %rax
	movzbl	(%rax), %eax
	testb	%al, %al
	jne	.L6
	addl	$1, -4(%rbp)
.L4:
	cmpl	$0, -4(%rbp)
	jle	.L7
	leaq	-64(%rbp), %rax
	movq	%rax, %rdx
	movl	$1, %ecx
	call	clock_gettime
	movl	-56(%rbp), %eax
	movl	-40(%rbp), %edx
	subl	%edx, %eax
	movl	%eax, %edx
	leaq	.LC1(%rip), %rax
	movq	%rax, %rcx
	call	printf
	leaq	-48(%rbp), %rax
	movq	%rax, %rdx
	movl	$1, %ecx
	call	clock_gettime
	movl	$0, -12(%rbp)
	jmp	.L8
.L11:
	movl	$0, -16(%rbp)
	jmp	.L9
.L10:
	addl	$1, -16(%rbp)
.L9:
	movl	-16(%rbp), %eax
	cltq
	movq	-24(%rbp), %rdx
	addq	%rdx, %rax
	movzbl	(%rax), %eax
	testb	%al, %al
	jne	.L10
	addl	$1, -12(%rbp)
.L8:
	cmpl	$0, -12(%rbp)
	jle	.L11
	leaq	-64(%rbp), %rax
	movq	%rax, %rdx
	movl	$1, %ecx
	call	clock_gettime
	movl	-56(%rbp), %eax
	movl	-40(%rbp), %edx
	subl	%edx, %eax
	movl	%eax, %edx
	leaq	.LC1(%rip), %rax
	movq	%rax, %rcx
	call	printf
	movl	$0, %eax
	addq	$96, %rsp
	popq	%rbp
	ret

This is the only part which seems to be relevant.

wanderingfool2 · February 18

Again, I go back to what I had said before you can't just compare these kinds of things against each other and come to conclusions that they must not be equal. There are a bunch of things that can overpower essentially your profiler in this case.

Lets look at the inner loops assembly

.L6:
	addl	$1, -8(%rbp)
.L5:
	movl	-8(%rbp), %eax
	cltq
	movq	-24(%rbp), %rdx
	addq	%rdx, %rax
	movzbl	(%rax), %eax
	testb	%al, %al
	jne	.L6
	addl	$1, -4(%rbp)

vs

.L10:
	addl	$1, -16(%rbp)
.L9:
	movl	-16(%rbp), %eax
	cltq
	movq	-24(%rbp), %rdx
	addq	%rdx, %rax
	movzbl	(%rax), %eax
	testb	%al, %al
	jne	.L10
	addl	$1, -12(%rbp)

Notice how the assembly is the same pretty much? At that stage you are talking about just things like how prior code is having effect on current code.

Gat Pelsinger · February 18

@wanderingfool2

hmm. I changed the order of the loops, and whatever is the second loop always seems to be faster. Looks like there could be some caching or prefetching that is being done.