Jump to content

Why are inbuilt C (and perhaps other languages as well) libraries have so complicated and SLOWER code?

Gat Pelsinger

To prove it, this is my code which replaces strlen with my own strlen.-

 

#include <stdio.h>
#include <string.h>
#include <time.h>

int mystrlen(const char* str){
    int i = 0;
    while (str[i] != '\0'){
        i++;
    }
    return i;
}
int main(void){
    char* str = "Hello World";
    clock_t start;
    clock_t end;
    long double elapsed;
    start = clock();
    printf("%d\n", strlen(str));
    end = clock();
    elapsed = ((long double)(end - start)) / CLOCKS_PER_SEC;
    printf("strlen elapsed: %Lf\n", elapsed);
    start = clock();
    printf("%d\n", mystrlen(str));
    end = clock();
    elapsed = ((long double)(end - start)) / CLOCKS_PER_SEC;
    printf("mystrlen elapsed: %Lf\n", elapsed);
    return 0;
}

 

If you run this, you will see that my code is faster(if you cannot see any difference, try to lower your CPU clocks to a ridiculously low amount). Of course, milliseconds we are talking about, but this scales up in actual helpful and important functions which are really complex. If you see the strlen implementation in string.c source code (https://github.com/torvalds/linux/blob/master/lib/string.c), it shows - 

 

size_t strlen(const char *s)
{
	const char *sc;

	for (sc = s; *sc != '\0'; ++sc)
		/* nothing */;
	return sc - s;
}

 

Why use a new pointer, wasting CPU clocks and memory, using a for loop instead of a while (although I won't talk about it as I don't quite know which one is faster, but literally microseconds we are talking about and this does not scale), and then needing to do a whole subtraction. A WHOLE SUBTRACTION. That's like around 2 more CPU cycles and around a whopping 2 to 4 BYTES of memory. Sheesh, that's a lot of unoptimization I smelling.

 

Ok sarcasm aside, why though? Like whenever I open any inbuilt C header file, my mind is boggled seeing half of the words I don't understand, and all that complexity is for less performance? Like no seriously, I have to be missing on something, or else why would they ever use a new pointer, loop through the string, and then return the difference between the old and new pointer location instead of doing the more easy obvious way as I have done? 

Microsoft owns my soul.

 

Also, Dell is evil, but HP kinda nice.

Link to comment
Share on other sites

Link to post
Share on other sites

My guess: Your code is less portable. The length of "int" in C is architecture/compiler dependent, i.e. it could be 8, 16, 32 or 64 bit. While a pointer is also architecture dependent, I assume it is always large enough to count the length of the longest possible string a pointer can point to.

 

image.png.8b00458bde779ac890ce926a8e611dc2.png

 

~edit: This is a warning shown in VS Code. So I assume your code might fail for longer strings…

 

When you micro-benchmark something you should also generally repeat it thousands or even millions of times, to rule out measuring errors from the CPU doing other things in between (modern operating systems are not single threaded, after all). So you don't know if your program's thread got parked in the middle of your loop. Plus, you're also measuring the speed of printf, since it's between start and end.

 

You might also want to try that again with a much, much longer string. Yours might be faster for short strings, but maybe the other variant is actually faster for longer strings.

 

Last but not least, you might want to look at the compiled code (i.e. the resulting assembler/machine code), to see which one actually has less instructions. Both for and while might very well translate to the same machine code.

 

And, how do you know that the strlen implementation on your platform (I assume you're still on Windows?) matches the implementation used by the Linux kernel?

 

~edit #2

Spoiler
#include <stdio.h>
#include <time.h>

int mystrlen(const char* str){
    int i = 0;
    while (str[i] != '\0'){
        i++;
    }
    return i;
}

size_t strlen(const char *s)
{
	const char *sc;

	for (sc = s; *sc != '\0'; ++sc)
		/* nothing */;
	return sc - s;
}

int main(void){
    char* str = "eiquoh2ohga0xo0seeshahgu4chop6yee0uu7saal5la2aequ8xeichai0Oowee2jieNgai3eighaiy6Uu7Guohingiefoo4Iagoo6rai2awieGhexiacuesh3eekah4aphah9weiniH4ioBah7aingaY7ahcheethai2obaech9eeshohluD4shoipaeTu5oqu5reed8ega9yohtheefohghaeti8thieGh5guj7chieghao4quooduz7Phahz9Ibe7eeghoopei7theuCoi7thohw8aevohee9ojahWu2yohn9cheTeWahke8ceH7ahPuuhohjojo2phooquaighohshe4Gohr9Iedu5HeiFaev6waeYahmahreiho2Pheochoon0eb5echaLo1Hobooraim5Xah7Pheileizaem0JiiLeidaiChi3kuQueiph2cuixoh4ait9looZee0iethooGech3ye1yooghaepexeew5loat2Ahyaetai4eovee5aiphooseeceishae2gahPuiQuahmie7aequ8ahdieshugee9la5kooReiz6Iequ1uli9wuMie3Quay7chaixe7oojoh7ju5Thee1air2gowugu0we4gohx1aebi9ozita0juh6zedui4zaeHaeh2ea2cohgiz6Air8ieXi0lidohxoCahvi9maesheeshaihoo1aethus1deeg0uJaimaezoigaexiePohasux4FiLee6iithew5ue3woh8lo9chae0vei9Fae1eer3Pimohxe0chohn5aiku4sahgahfahduu6eexaeVoothi5zuPoophahy5feghoothim3wi6eetooSadeSh4Tha0Shei7iephub5Thei6HaiseiFee1iniePooshoru1ohl8sehi2quooxief2Eeci9phachiephuequeeGho5quo5de2raewau0ietaili1ahh2aikeeraghiethoThiTie3thihai2phaiwue1noeMain4faeGaiR1AeZaecee3ireesa8luuL0so8AQuu6Aiyumie1quonu0Shaezeojaefee0ungeeXee2EeTheir4yaekekeele6Oov2bue4ahk5voyetahMeen2bievoh2Ka3joe7tai1shee6gaRie5eesaifae5Meeleewae0ieghooh7LeeJ3Udeiquie0ahwolaeyeefohsh4Leequahsahchavanaleeb0NoofohF5aeCh6Wiep7Aiquoo2vuy6cae4eech6udah4feWiepu4xalaequ5Ashotohth9ie6za3coocei5siquolo9fie5xaiY3eiweeGhoghoer0leeB5ubucheish4saem9aezie7phu7tahboh7mieyooPh4ievu8ux8ahho4Se2biChiij3phe6ohciev9Vei6veithongahnahfohch3iuquahKiexeivoo2ieLou1chies5eiWah3eezah3ic0rai3xo9zi4oz8needoo9iapheihaech9Ue9IeY9ieri0lei9iya0aiTahmai1oiwoopuekeecaidoothoo2ii4WaithaeD6saeg4AhPie7gae8Eibeech6Iechaeghir7xezei2Oogai7lai9uu6keep9AG9yiju5Faiy2zebai7axo6EteeniereNg7pu2ajae2eighiec8phahy0teeb2ieCh7quodah1wie0xohS1aerahj8aghahh3soofaeJ7ijied1meezul1EeYo2vu5aengey0funooSookaipeing6auc9XiJe6anoZ0Ca8VuhaexeeTh3ohyohWohdu3eevohLie2eiYoo1oeGeequaeth4aecughoquiWis1aisaiphoz3uDahn1wah9Ohphoobi7to0iChowahCo4jo8gieR3goequ7jo2aeteob6eeph7hi1zoh7aithieB9tiiHodai9phao5eogi1te5va1kaeg0aegh0bae3za6hooroh7tuchahng7uwoon3ciy9aiD";

    clock_t start;
    clock_t end;

    long double elapsed;

    start = clock();
    for(int n = 0; n < 1000000; ++n) {
        strlen(str);
        asm("");
    }
    end = clock();

    elapsed = ((long double)(end - start)) / CLOCKS_PER_SEC;
    printf("strlen elapsed: %Lf\n", elapsed);

    start = clock();
    for(int n = 0; n < 1000000; ++n) {
        mystrlen(str);
        asm("");
    }
    end = clock();

    elapsed = ((long double)(end - start)) / CLOCKS_PER_SEC;
    printf("mystrlen elapsed: %Lf\n", elapsed);

    return 0;
}

 

gcc strlen_compare.c -o strlen_compare

strlen elapsed: 0.000824
mystrlen elapsed: 0.976289

 

gcc -O1 strlen_compare.c -o strlen_compare

strlen elapsed: 0.000457
mystrlen elapsed: 0.000459

Remember to either quote or @mention others, so they are notified of your reply

Link to comment
Share on other sites

Link to post
Share on other sites

43 minutes ago, Gat Pelsinger said:

Like whenever I open any inbuilt C header file, my mind is boggled seeing half of the words I don't understand, and all that complexity is for less performance?

Not a C programmer, but if for large projects (which a kernel certainly is) I feel "my mind is boggled seeing half of the words I don't understand" then it is all but certain that I do not have the full knowledge or picture to conclude

48 minutes ago, Gat Pelsinger said:

why would they ever <their method> instead of doing the more easy obvious way as I have done?

 

A quick search on datatype sizes seems to support at least one of @Eigenvektor's points regarding the use of int and portability: https://en.cppreference.com/w/c/types/size_t

Quote

Notes

size_t can store the maximum size of a theoretically possible object of any type (including array).

size_t is commonly used for array indexing and loop counting. Programs that use other types, such as unsigned int, for array indexing may fail on, e.g. 64-bit systems when the index exceeds UINT_MAX or if it relies on 32-bit modular arithmetic.

 

2 hours ago, Gat Pelsinger said:

If you run this, you will see that my code is faster

I tried to modify it slightly so it loops over them a couple million times and actually typing out the kernel function and compiling it along is slower, but your version only seems as fast as the native strlen when adding -O1 or higher. Any higher levels put it back in the very hard to measure regime.

image.thumb.png.bab9904e3d1fbd8a2240d35c56032a66.png

 

1 hour ago, Eigenvektor said:

And, how do you know that the strlen implementation on your platform (I assume you're still on Windows?) matches the implementation used by the Linux kernel?

Now I'm actually a bit curious about it as strangely if I copy paste that kernel version it doesn't get close to the performance that strlen seems to get on either godbolt or my laptop running Ubuntu. That, or my logic of timing a loop is flawed. I can't say I have ever done much C.

Crystal: CPU: i7 7700K | Motherboard: Asus ROG Strix Z270F | RAM: GSkill 16 GB@3200MHz | GPU: Nvidia GTX 1080 Ti FE | Case: Corsair Crystal 570X (black) | PSU: EVGA Supernova G2 1000W | Monitor: Asus VG248QE 24"

Laptop: Dell XPS 13 9370 | CPU: i5 10510U | RAM: 16 GB

Server: CPU: i5 4690k | RAM: 16 GB | Case: Corsair Graphite 760T White | Storage: 19 TB

Link to comment
Share on other sites

Link to post
Share on other sites

@Eigenvektor @tikker just wanted to let you know that the call to (built-in) strlen in your testing code gets optimized away, even when compiling without optimizations. When compiling with -O1, all functions get optimized out except for the kernelstrlen function, strangely.

No optimizations:

image.png.b5f834f74bc18764bea29ac59377e520.png

 

-O1:

image.png.13ce737619401be08b125322f2e2c717.png

image.png.b5d3bb1c61ddb80d4108fed5b6e62c67.png

 

The same thing happens on GCC 13.2, which is the latest version on Godbolt.

 

My testing code:

Spoiler
// Type your code here, or load an example.
#include <stdio.h>
#include <string.h>
#include <time.h>

int mystrlen(const char* str){
    int i = 0;
    while (str[i] != '\0'){
        i++;
    }
    return i;
}

size_t kernelstrlen(const char *s)
{
	const char *sc;

	for (sc = s; *sc != '\0'; ++sc)
		/* nothing */;
	return sc - s;
}

int main(void){
    char* str = "eiquoh2ohga0xo0seeshahgu4chop6yee0uu7saal5la2aequ8xeichai0Oowee2jieNgai3eighaiy6Uu7Guohingiefoo4Iagoo6rai2awieGhexiacuesh3eekah4aphah9weiniH4ioBah7aingaY7ahcheethai2obaech9eeshohluD4shoipaeTu5oqu5reed8ega9yohtheefohghaeti8thieGh5guj7chieghao4quooduz7Phahz9Ibe7eeghoopei7theuCoi7thohw8aevohee9ojahWu2yohn9cheTeWahke8ceH7ahPuuhohjojo2phooquaighohshe4Gohr9Iedu5HeiFaev6waeYahmahreiho2Pheochoon0eb5echaLo1Hobooraim5Xah7Pheileizaem0JiiLeidaiChi3kuQueiph2cuixoh4ait9looZee0iethooGech3ye1yooghaepexeew5loat2Ahyaetai4eovee5aiphooseeceishae2gahPuiQuahmie7aequ8ahdieshugee9la5kooReiz6Iequ1uli9wuMie3Quay7chaixe7oojoh7ju5Thee1air2gowugu0we4gohx1aebi9ozita0juh6zedui4zaeHaeh2ea2cohgiz6Air8ieXi0lidohxoCahvi9maesheeshaihoo1aethus1deeg0uJaimaezoigaexiePohasux4FiLee6iithew5ue3woh8lo9chae0vei9Fae1eer3Pimohxe0chohn5aiku4sahgahfahduu6eexaeVoothi5zuPoophahy5feghoothim3wi6eetooSadeSh4Tha0Shei7iephub5Thei6HaiseiFee1iniePooshoru1ohl8sehi2quooxief2Eeci9phachiephuequeeGho5quo5de2raewau0ietaili1ahh2aikeeraghiethoThiTie3thihai2phaiwue1noeMain4faeGaiR1AeZaecee3ireesa8luuL0so8AQuu6Aiyumie1quonu0Shaezeojaefee0ungeeXee2EeTheir4yaekekeele6Oov2bue4ahk5voyetahMeen2bievoh2Ka3joe7tai1shee6gaRie5eesaifae5Meeleewae0ieghooh7LeeJ3Udeiquie0ahwolaeyeefohsh4Leequahsahchavanaleeb0NoofohF5aeCh6Wiep7Aiquoo2vuy6cae4eech6udah4feWiepu4xalaequ5Ashotohth9ie6za3coocei5siquolo9fie5xaiY3eiweeGhoghoer0leeB5ubucheish4saem9aezie7phu7tahboh7mieyooPh4ievu8ux8ahho4Se2biChiij3phe6ohciev9Vei6veithongahnahfohch3iuquahKiexeivoo2ieLou1chies5eiWah3eezah3ic0rai3xo9zi4oz8needoo9iapheihaech9Ue9IeY9ieri0lei9iya0aiTahmai1oiwoopuekeecaidoothoo2ii4WaithaeD6saeg4AhPie7gae8Eibeech6Iechaeghir7xezei2Oogai7lai9uu6keep9AG9yiju5Faiy2zebai7axo6EteeniereNg7pu2ajae2eighiec8phahy0teeb2ieCh7quodah1wie0xohS1aerahj8aghahh3soofaeJ7ijied1meezul1EeYo2vu5aengey0funooSookaipeing6auc9XiJe6anoZ0Ca8VuhaexeeTh3ohyohWohdu3eevohLie2eiYoo1oeGeequaeth4aecughoquiWis1aisaiphoz3uDahn1wah9Ohphoobi7to0iChowahCo4jo8gieR3goequ7jo2aeteob6eeph7hi1zoh7aithieB9tiiHodai9phao5eogi1te5va1kaeg0aegh0bae3za6hooroh7tuchahng7uwoon3ciy9aiD";
    

    clock_t start;
    clock_t end;

    long double elapsed;

    start = clock();
    for(int n = 0; n < 1000000; ++n) {
        strlen(str);
        asm("");
    }
    end = clock();

    elapsed = ((long double)(end - start)) / CLOCKS_PER_SEC;
    printf("builtin strlen elapsed: %Lf\n", elapsed);

    start = clock();
    for(int n = 0; n < 1000000; ++n) {
        kernelstrlen(str);
        asm("");
    }
    end = clock();

    elapsed = ((long double)(end - start)) / CLOCKS_PER_SEC;
    printf("kernelstrlen elapsed: %Lf\n", elapsed);

    start = clock();
    for(int n = 0; n < 1000000; ++n) {
        mystrlen(str);
        asm("");
    }
    end = clock();

    elapsed = ((long double)(end - start)) / CLOCKS_PER_SEC;
    printf("mystrlen elapsed: %Lf\n", elapsed);

    start = clock();
    for(int n = 0; n < 1000000; ++n) {
        asm("");
    }
    end = clock();
    
    elapsed = ((long double)(end - start)) / CLOCKS_PER_SEC;
    printf("empty loop elapsed: %Lf\n", elapsed);

    return 0;
}

 

 

Computer engineering grad student, cybersecurity researcher, and hobbyist embedded systems developer

 

Daily Driver:

CPU: Ryzen 7 4800H | GPU: RTX 2060 | RAM: 16GB DDR4 3200MHz C16

 

Gaming PC:

CPU: Ryzen 5 5600X | GPU: EVGA RTX 2080Ti | RAM: 32GB DDR4 3200MHz C16

Link to comment
Share on other sites

Link to post
Share on other sites

Just to build on some of dcgreen2k's comments regarding inlining.

When using library function, there are two significant things that you need to be aware of that can make your code "slower" than if you had implemented it yourself,

  1. Function inlining
  2. Function pointers

For the first point, one thing that a compiler will do for you is "remove" function calls by copying the code of the function directly into where you called it. This is probably the most powerful optimization in practice, because function calls have a lot of overhead.

 

The problem is that the compiler can only do this in particular circumstances. Things are improving, but generally you can only rely on inlining to work if the function code itself, and the call being inlined, are in the same translation unit. When calling libraries, if the function you are using exists as precompiled object code (either static or shared) the compiler's ability to inline it is severely restricted. LTO (link-time optimization) is being worked on, but it isn't nearly as reliable as when the compiler can "see" all the code at once.

 

The second point, function pointers, is more of an issue with complex library functions like qsort or bsearch. To support operation against general data types, these require you to specify a comparison function using a function pointer. This means that every single comparison turns into a function call, which is much slower than using direct comparison operators. Function pointers are also very hard to inline, and additionally hurt branch prediction too--it's generally a mess. You run into the same problems with virtual functions in OOP language, incidentally. Ironically, I've found that C++'s std::sort typically outperforms qsort by a large margin because of this (templates are a form of compile-time polymorphism so no function pointers are used, and inlining is easier for the compiler).

 

The bottom line is that all of these issues are going to swamp any micro-optimizations to the code, like those you are pointing out. Frankly, the compiler is likely to fix all of your issues with the library code (it's very good at that sort of thing) anyway.

 

The reason to use the standard library is not necessarily performance (though some of the functions can be very performant). It is to ensure that you are using a solid, well tested, and mature code-base. The more code that you write, the more likely you are to introduce bugs and errors. Unless you are writing something that is very performance sensitive, there's a strong case to live with the problems of the standard library.

Kernighan and Pike even suggest always using the library bsearch in their book, The Practice of Programming, for this reason. That's a library function that is a) slower (function pointers), and b) requires boilerplate on the order of a binary search implementation to interface with. But it's a lot easier to mess up a binary search than it is the interface boilerplate--so it results in code that is easier to maintain. Personally, I don't follow that advice. But it is still perfectly sound.

EDIT: Also, because function inlining increases the size of the binary, the compiler won't necessarily inline everything that it can at lower optimization levels. If you compile at -O3, that gives the compiler carte-blanche to do whatever it wants, ignoring any effect on the size of the binary. Inlining tends to bloat the binary size (because of code duplication). So you'll probably want to use -O3 for testing inlining.

Link to comment
Share on other sites

Link to post
Share on other sites

@Eigenvektor How are you getting slower readings in my code?

Microsoft owns my soul.

 

Also, Dell is evil, but HP kinda nice.

Link to comment
Share on other sites

Link to post
Share on other sites

38 minutes ago, Flavius Heraclius said:

Just to build on some of dcgreen2k's comments regarding inlining.

I had a look at the generated assembly, and the compiler didn't inline any of the functions. The relevant function calls just aren't there, because the compiler knows we aren't doing anything with the return values.

 

Take a look at the instructions immediately after labels .L12, .L13, .L14, and .L15 here: https://godbolt.org/z/4f156drWK

.L15 corresponds to the empty loop and only subtracts 1 from the loop counter before jumping back up. .L12 and .L14 correspond to the builtin strlen and mystrlen tests and show the same instructions as the empty loop. .L13 is the loop for the kernelstrlen test and is the only one to keep the function call. I'm not sure why this is the case.

Computer engineering grad student, cybersecurity researcher, and hobbyist embedded systems developer

 

Daily Driver:

CPU: Ryzen 7 4800H | GPU: RTX 2060 | RAM: 16GB DDR4 3200MHz C16

 

Gaming PC:

CPU: Ryzen 5 5600X | GPU: EVGA RTX 2080Ti | RAM: 32GB DDR4 3200MHz C16

Link to comment
Share on other sites

Link to post
Share on other sites

6 minutes ago, dcgreen2k said:

I had a look at the generated assembly, and the compiler didn't inline any of the functions. The relevant function calls just aren't there, because the compiler knows we aren't doing anything with the return values.

 

Take a look at the instructions immediately after labels .L12, .L13, .L14, and .L15 here: https://godbolt.org/z/4f156drWK

.L15 corresponds to the empty loop and only subtracts 1 from the loop counter before jumping back up. .L12 and .L14 correspond to the builtin strlen and mystrlen tests and show the same instructions as the empty loop. .L13 is the loop for the kernelstrlen test and is the only one to keep the function call. I'm not sure why this is the case.

Ah! Yes, that old problem. Sorry, I missed that.

Technically, in order for the compiler to eliminate the code, it first has to inline it. Then it knows that it doesn't do anything and can be dropped.

EDIT: The reason it needs to inline it is because, if the compiler is just looking at a function symbol to be later filled in by the linker, it cannot know if the function has side effects or not. Once the function has been inlined, the compiler can know this and will destroy the unneeded code without mercy.

 

The kernelstrlen function was probably not dropped because it uses pointers. The compiler cannot reason well about pointers, so it usually avoids touching code like that with a ten foot pole.

 

Link to comment
Share on other sites

Link to post
Share on other sites

I modified the testing code with some inline assembly so that the builtin strlen function would actually get called. The assembly is designed to match the compiler output for the other loops, and the code should be compiled with optimizations off. Otherwise, it'll segfault.

Spoiler
// Type your code here, or load an example.
#include <stdio.h>
#include <string.h>
#include <time.h>

int mystrlen(const char* str){
    int i = 0;
    while (str[i] != '\0'){
        i++;
    }
    return i;
}

size_t kernelstrlen(const char *s)
{
	const char *sc;

	for (sc = s; *sc != '\0'; ++sc)
		/* nothing */;
	return sc - s;
}

int main(void){
    char* str = "eiquoh2ohga0xo0seeshahgu4chop6yee0uu7saal5la2aequ8xeichai0Oowee2jieNgai3eighaiy6Uu7Guohingiefoo4Iagoo6rai2awieGhexiacuesh3eekah4aphah9weiniH4ioBah7aingaY7ahcheethai2obaech9eeshohluD4shoipaeTu5oqu5reed8ega9yohtheefohghaeti8thieGh5guj7chieghao4quooduz7Phahz9Ibe7eeghoopei7theuCoi7thohw8aevohee9ojahWu2yohn9cheTeWahke8ceH7ahPuuhohjojo2phooquaighohshe4Gohr9Iedu5HeiFaev6waeYahmahreiho2Pheochoon0eb5echaLo1Hobooraim5Xah7Pheileizaem0JiiLeidaiChi3kuQueiph2cuixoh4ait9looZee0iethooGech3ye1yooghaepexeew5loat2Ahyaetai4eovee5aiphooseeceishae2gahPuiQuahmie7aequ8ahdieshugee9la5kooReiz6Iequ1uli9wuMie3Quay7chaixe7oojoh7ju5Thee1air2gowugu0we4gohx1aebi9ozita0juh6zedui4zaeHaeh2ea2cohgiz6Air8ieXi0lidohxoCahvi9maesheeshaihoo1aethus1deeg0uJaimaezoigaexiePohasux4FiLee6iithew5ue3woh8lo9chae0vei9Fae1eer3Pimohxe0chohn5aiku4sahgahfahduu6eexaeVoothi5zuPoophahy5feghoothim3wi6eetooSadeSh4Tha0Shei7iephub5Thei6HaiseiFee1iniePooshoru1ohl8sehi2quooxief2Eeci9phachiephuequeeGho5quo5de2raewau0ietaili1ahh2aikeeraghiethoThiTie3thihai2phaiwue1noeMain4faeGaiR1AeZaecee3ireesa8luuL0so8AQuu6Aiyumie1quonu0Shaezeojaefee0ungeeXee2EeTheir4yaekekeele6Oov2bue4ahk5voyetahMeen2bievoh2Ka3joe7tai1shee6gaRie5eesaifae5Meeleewae0ieghooh7LeeJ3Udeiquie0ahwolaeyeefohsh4Leequahsahchavanaleeb0NoofohF5aeCh6Wiep7Aiquoo2vuy6cae4eech6udah4feWiepu4xalaequ5Ashotohth9ie6za3coocei5siquolo9fie5xaiY3eiweeGhoghoer0leeB5ubucheish4saem9aezie7phu7tahboh7mieyooPh4ievu8ux8ahho4Se2biChiij3phe6ohciev9Vei6veithongahnahfohch3iuquahKiexeivoo2ieLou1chies5eiWah3eezah3ic0rai3xo9zi4oz8needoo9iapheihaech9Ue9IeY9ieri0lei9iya0aiTahmai1oiwoopuekeecaidoothoo2ii4WaithaeD6saeg4AhPie7gae8Eibeech6Iechaeghir7xezei2Oogai7lai9uu6keep9AG9yiju5Faiy2zebai7axo6EteeniereNg7pu2ajae2eighiec8phahy0teeb2ieCh7quodah1wie0xohS1aerahj8aghahh3soofaeJ7ijied1meezul1EeYo2vu5aengey0funooSookaipeing6auc9XiJe6anoZ0Ca8VuhaexeeTh3ohyohWohdu3eevohLie2eiYoo1oeGeequaeth4aecughoquiWis1aisaiphoz3uDahn1wah9Ohphoobi7to0iChowahCo4jo8gieR3goequ7jo2aeteob6eeph7hi1zoh7aithieB9tiiHodai9phao5eogi1te5va1kaeg0aegh0bae3za6hooroh7tuchahng7uwoon3ciy9aiD";

    clock_t start;
    clock_t end;

    long double elapsed;

    start = clock();
    for(int n = 0; n < 1000000; ++n) {
        asm("movq %0,%%rax;" : : "m" (str));
        asm("movq %rax,%rdi");
        asm("call strlen");
    }
    end = clock();

    elapsed = ((long double)(end - start)) / CLOCKS_PER_SEC;
    printf("builtin strlen elapsed: %Lf\n", elapsed);

    start = clock();
    for(int n = 0; n < 1000000; ++n) {
        kernelstrlen(str);
        asm("");
    }
    end = clock();

    elapsed = ((long double)(end - start)) / CLOCKS_PER_SEC;
    printf("kernelstrlen elapsed: %Lf\n", elapsed);

    start = clock();
    for(int n = 0; n < 1000000; ++n) {
        mystrlen(str);
        asm("");
    }
    end = clock();

    elapsed = ((long double)(end - start)) / CLOCKS_PER_SEC;
    printf("mystrlen elapsed: %Lf\n", elapsed);

    start = clock();
    for(int n = 0; n < 1000000; ++n) {
        asm("");
    }
    end = clock();
    
    elapsed = ((long double)(end - start)) / CLOCKS_PER_SEC;
    printf("empty loop elapsed: %Lf\n", elapsed);

    return 0;
}

 

Its assembly can be seen here: https://godbolt.org/z/zzTE4o6Ez

 

Running the new tests, we get this. Again, these results are with optimizations turned off.

image.png.3508f484ffbadd1a4ed846b950b22565.png

 

Even after ensuring we call the function directly, the builtin strlen test runs incredibly quickly. What's going on?

To figure this out, I stepped through the testing code in GDB until I got to the strlen call. Stepping into strlen, GDB reported this:

image.png.ca4a74b8d48983c902df4555244b4122.png

 

Looking online, the source is available here: https://github.com/bminor/glibc/blob/master/sysdeps/x86_64/multiarch/strlen-avx2.S

So the strlen function that's getting called isn't the simple version we found earlier, it's all handwritten AVX2 assembly.

 

At this point, I'm looking at the wall of assembly code and the elapsed time for the builtin strlen test and I don't really believe it. So I decided to step into __strlen_avx2 in GDB and manually count how many times I have to hit enter before it returns. For a 2048-byte string, only 200 instructions were executed with 16 iterations of its loop.

 

Comparing the AVX2 version's 0.023636s with 0.519933s (with -O1) for the kernel strlen we found earlier, there's a 22x speedup. Knowing that the strlen we found earlier iterates over every character and executes 3 instructions per loop, these seem like reasonable results.

 

-----------------

 

Back to the original question, the "regular" builtin version of strlen is in fact slightly faster than your own version despite it appearing more complex. This is due to it being iterator-based compared to your index-based approach. With the index-based approach, the program needs to add the index to the array's base address, then dereference the resulting memory address to get the current character. In contrast, the iterator-based approach only needs to dereference the pointer it hangs onto.

 

I will stress that this difference is a serious micro-optimization and saves only a single assembly instruction when optimizations are turned on. There are better ways to make your program faster.

Computer engineering grad student, cybersecurity researcher, and hobbyist embedded systems developer

 

Daily Driver:

CPU: Ryzen 7 4800H | GPU: RTX 2060 | RAM: 16GB DDR4 3200MHz C16

 

Gaming PC:

CPU: Ryzen 5 5600X | GPU: EVGA RTX 2080Ti | RAM: 32GB DDR4 3200MHz C16

Link to comment
Share on other sites

Link to post
Share on other sites

3 hours ago, Gat Pelsinger said:

@Eigenvektor How are you getting slower readings in my code?

The spoiler below edit #2 contains the code I used for testing. I increased the string's length to 2048 characters, I loop over the functions 1M times and I included the strlen implementation, to make sure that's what I'm actually testing and I excluded printl from the loop.

 

5 hours ago, dcgreen2k said:

@Eigenvektor @tikker just wanted to let you know that the call to (built-in) strlen in your testing code gets optimized away, even when compiling without optimizations.

Hm, dang. I was hoping the asm("") I added into the loop would prevent this. I modified the code slightly to make sure the return value is used.

 

I also included strlen from string.h in the comparison this time, to see how the native version available on Manjaro compares

Spoiler
#include <stdio.h>
#include <time.h>
#include <string.h>

int mystrlen(const char* str){
    int i = 0;
    while (str[i] != '\0'){
        i++;
    }
    return i;
}

size_t nstrlen(const char *s)
{
    const char *sc;

    for (sc = s; *sc != '\0'; ++sc)
        /* nothing */;
    return sc - s;
}

int main(void){
    char* str = "eiquoh2ohga0xo0seeshahgu4chop6yee0uu7saal5la2aequ8xeichai0Oowee2jieNgai3eighaiy6Uu7Guohingiefoo4Iagoo6rai2awieGhexiacuesh3eekah4aphah9weiniH4ioBah7aingaY7ahcheethai2obaech9eeshohluD4shoipaeTu5oqu5reed8ega9yohtheefohghaeti8thieGh5guj7chieghao4quooduz7Phahz9Ibe7eeghoopei7theuCoi7thohw8aevohee9ojahWu2yohn9cheTeWahke8ceH7ahPuuhohjojo2phooquaighohshe4Gohr9Iedu5HeiFaev6waeYahmahreiho2Pheochoon0eb5echaLo1Hobooraim5Xah7Pheileizaem0JiiLeidaiChi3kuQueiph2cuixoh4ait9looZee0iethooGech3ye1yooghaepexeew5loat2Ahyaetai4eovee5aiphooseeceishae2gahPuiQuahmie7aequ8ahdieshugee9la5kooReiz6Iequ1uli9wuMie3Quay7chaixe7oojoh7ju5Thee1air2gowugu0we4gohx1aebi9ozita0juh6zedui4zaeHaeh2ea2cohgiz6Air8ieXi0lidohxoCahvi9maesheeshaihoo1aethus1deeg0uJaimaezoigaexiePohasux4FiLee6iithew5ue3woh8lo9chae0vei9Fae1eer3Pimohxe0chohn5aiku4sahgahfahduu6eexaeVoothi5zuPoophahy5feghoothim3wi6eetooSadeSh4Tha0Shei7iephub5Thei6HaiseiFee1iniePooshoru1ohl8sehi2quooxief2Eeci9phachiephuequeeGho5quo5de2raewau0ietaili1ahh2aikeeraghiethoThiTie3thihai2phaiwue1noeMain4faeGaiR1AeZaecee3ireesa8luuL0so8AQuu6Aiyumie1quonu0Shaezeojaefee0ungeeXee2EeTheir4yaekekeele6Oov2bue4ahk5voyetahMeen2bievoh2Ka3joe7tai1shee6gaRie5eesaifae5Meeleewae0ieghooh7LeeJ3Udeiquie0ahwolaeyeefohsh4Leequahsahchavanaleeb0NoofohF5aeCh6Wiep7Aiquoo2vuy6cae4eech6udah4feWiepu4xalaequ5Ashotohth9ie6za3coocei5siquolo9fie5xaiY3eiweeGhoghoer0leeB5ubucheish4saem9aezie7phu7tahboh7mieyooPh4ievu8ux8ahho4Se2biChiij3phe6ohciev9Vei6veithongahnahfohch3iuquahKiexeivoo2ieLou1chies5eiWah3eezah3ic0rai3xo9zi4oz8needoo9iapheihaech9Ue9IeY9ieri0lei9iya0aiTahmai1oiwoopuekeecaidoothoo2ii4WaithaeD6saeg4AhPie7gae8Eibeech6Iechaeghir7xezei2Oogai7lai9uu6keep9AG9yiju5Faiy2zebai7axo6EteeniereNg7pu2ajae2eighiec8phahy0teeb2ieCh7quodah1wie0xohS1aerahj8aghahh3soofaeJ7ijied1meezul1EeYo2vu5aengey0funooSookaipeing6auc9XiJe6anoZ0Ca8VuhaexeeTh3ohyohWohdu3eevohLie2eiYoo1oeGeequaeth4aecughoquiWis1aisaiphoz3uDahn1wah9Ohphoobi7to0iChowahCo4jo8gieR3goequ7jo2aeteob6eeph7hi1zoh7aithieB9tiiHodai9phao5eogi1te5va1kaeg0aegh0bae3za6hooroh7tuchahng7uwoon3ciy9aiD";
    unsigned long total;

    clock_t start;
    clock_t end;

    long double elapsed;

    total = 0;
    start = clock();

    for(int n = 0; n < 1000000; ++n) {
        total += strlen(str);
    }
    end = clock();

    elapsed = ((long double)(end - start)) / CLOCKS_PER_SEC;
    printf("strlen elapsed: %Lf %lu\n", elapsed, total);

    total = 0;
    start = clock();

    for(int n = 0; n < 1000000; ++n) {
        total += nstrlen(str);
    }
    end = clock();

    elapsed = ((long double)(end - start)) / CLOCKS_PER_SEC;
    printf("nstrlen elapsed: %Lf %lu\n", elapsed, total);

    total = 0;
    start = clock();

    for(int n = 0; n < 1000000; ++n) {
        total += mystrlen(str);
    }
    end = clock();

    elapsed = ((long double)(end - start)) / CLOCKS_PER_SEC;
    printf("mystrlen elapsed: %Lf %lu\n", elapsed, total);

    return 0;
}

gcc strlen_compare.c -o strlen_compare

strlen elapsed: 0.023253 2048000000
nstrlen elapsed: 0.844130 2048000000
mystrlen elapsed: 0.972135 2048000000

 

-O1

strlen elapsed: 0.000457 2048000000
nstrlen elapsed: 0.457448 2048000000
mystrlen elapsed: 0.000207 2048000000

 

-O2

strlen elapsed: 0.000002 2048000000
nstrlen elapsed: 0.457091 2048000000
mystrlen elapsed: 0.000102 2048000000

 

-O3

strlen elapsed: 0.000002 2048000000
nstrlen elapsed: 0.450693 2048000000
mystrlen elapsed: 0.000101 2048000000

 

Not sure why, but something odd seems to be happening with the non-native version of strlen as I increase the optimization level. The native version is clearly a lot faster than mystrlen though.

Remember to either quote or @mention others, so they are notified of your reply

Link to comment
Share on other sites

Link to post
Share on other sites

@Eigenvektor @dcgreen2k

 

Damn you guys went right down to assembly which I didn't expect. If you say my strlen is not faster, why am I getting it as faster on my machine?

Microsoft owns my soul.

 

Also, Dell is evil, but HP kinda nice.

Link to comment
Share on other sites

Link to post
Share on other sites

38 minutes ago, Eigenvektor said:

Not sure why, but something odd seems to be happening with the non-native version of strlen as I increase the optimization level. The native version is clearly a lot faster than mystrlen though.

It's fixed when compiling without optimizations, but the function calls are still getting optimized out with -O1 and up. Take a look at labels .L12 (strlen), .L13 (nstrlen), and .L14 (mystrlen) here: https://godbolt.org/z/bPhbK74Tc

 

The builtin strlen test gets precomputed by the compiler and the result of mystrlen just gets multiplied by 1000000. It's kind of funny how hard the compiler tries to optimize these kinds of things.

 

38 minutes ago, Eigenvektor said:

gcc strlen_compare.c -o strlen_compare

strlen elapsed: 0.023253 2048000000

nstrlen elapsed: 0.844130 2048000000
mystrlen elapsed: 0.972135 2048000000

This is closer to what I expected from the tests. The builtin strlen is super fast due to using AVX2, nstrlen is slower with its simple iterator-based approach, and mystrlen uses a simple index-based approach.

 

30 minutes ago, Gat Pelsinger said:

Damn you guys went right down to assembly which I didn't expect. If you say my strlen is not faster, why am I getting it as faster on my machine?

Are you running our versions of the testing code, where the calls to strlen are inside a loop? You won't get accurate results if you just run the functions once.

Computer engineering grad student, cybersecurity researcher, and hobbyist embedded systems developer

 

Daily Driver:

CPU: Ryzen 7 4800H | GPU: RTX 2060 | RAM: 16GB DDR4 3200MHz C16

 

Gaming PC:

CPU: Ryzen 5 5600X | GPU: EVGA RTX 2080Ti | RAM: 32GB DDR4 3200MHz C16

Link to comment
Share on other sites

Link to post
Share on other sites

31 minutes ago, dcgreen2k said:

The builtin strlen test gets precomputed by the compiler and the result of mystrlen just gets multiplied by 1000000. It's kind of funny how hard the compiler tries to optimize these kinds of things.

Thanks for confirming. I was thinking that this might be what is happening. So effectively the compiler sees that we're calling the same method with the same argument multiple times, so it knows the result doesn't change and just runs the addition instead.

 

It would probably make sense to dynamically generate a new random string in memory each iteration, rather than counting the same string on the stack.

Spoiler
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <string.h>

void rand_str(char *dest, size_t length) {
    char charset[] = "0123456789"
                     "abcdefghijklmnopqrstuvwxyz"
                     "ABCDEFGHIJKLMNOPQRSTUVWXYZ";

    while (length-- > 0) {
        size_t index = (double) rand() / RAND_MAX * (sizeof charset - 1);
        *dest++ = charset[index];
    }
    *dest = '\0';
}

int mystrlen(const char* str){
    int i = 0;
    while (str[i] != '\0'){
        i++;
    }
    return i;
}

size_t kstrlen(const char *s)
{
    const char *sc;

    for (sc = s; *sc != '\0'; ++sc)
        /* nothing */;
    return sc - s;
}

int main(void){
    char str[] = { [2049] = '\1' };

    unsigned long total;
    unsigned long count;

    clock_t start;
    clock_t end;
    clock_t time;

    long double elapsed;

    // "MINE"
    total = 0;
    time = 0;

    for(int n = 0; n < 1000; ++n) {
        rand_str(str, 2048);
    
        start = clock();
        count = mystrlen(str);
        end = clock();

        total += count;
        time += (end - start);
    }
    elapsed = ((long double)time) / CLOCKS_PER_SEC;
    printf("mystrlen elapsed: %Lf %lu\n", elapsed, total);

    // KERNEL
    total = 0;
    time = 0;

    for(int n = 0; n < 1000; ++n) {
        rand_str(str, 2048);
    
        start = clock();
        count = kstrlen(str);
        end = clock();

        total += count;
        time += (end - start);
    }
    elapsed = ((long double)time) / CLOCKS_PER_SEC;
    printf("kernel strlen elapsed: %Lf %lu\n", elapsed, total);

    // NATIVE
    total = 0;
    time = 0;

    for(int n = 0; n < 1000; ++n) {
        rand_str(str, 2048);
    
        start = clock();
        count = strlen(str);
        end = clock();

        total += count;
        time += (end - start);
    }
    elapsed = ((long double)time) / CLOCKS_PER_SEC;
    printf("native strlen elapsed: %Lf %lu\n", elapsed, total);

    return 0;
}

 

I haven't written C in 3 decades, so please double-check 😅

(I get that random probably makes no sense, since the algorithm really only cares about the 0 at the end)

 

gcc strlen_compare.c -o strlen_compare

mystrlen elapsed: 0.003031 2048000
kernel strlen elapsed: 0.002454 2048000
native strlen elapsed: 0.000744 2048000

 

gcc -O3 strlen_compare.c -o strlen_compare

mystrlen elapsed: 0.000940 2048000
kernel strlen elapsed: 0.001558 2048000
native strlen elapsed: 0.000712 2048000

Remember to either quote or @mention others, so they are notified of your reply

Link to comment
Share on other sites

Link to post
Share on other sites

37 minutes ago, Eigenvektor said:

I haven't written C in 3 decades, so please double-check 😅

Yes, the code you wrote works now. The results for the unoptimized code are what I expected, but it's interesting how mystrlen performs closer to the builtin implementation at -O3. Poking around in the generated assembly, it turns out the compiler inserts a call to the builtin strlen instead of using the simple while-loop.

 

I guess the code in mystrlen is a common enough pattern that the compiler knows it can replace that with the builtin implementation. I've heard of this being possible, but never seen it happen before.

Computer engineering grad student, cybersecurity researcher, and hobbyist embedded systems developer

 

Daily Driver:

CPU: Ryzen 7 4800H | GPU: RTX 2060 | RAM: 16GB DDR4 3200MHz C16

 

Gaming PC:

CPU: Ryzen 5 5600X | GPU: EVGA RTX 2080Ti | RAM: 32GB DDR4 3200MHz C16

Link to comment
Share on other sites

Link to post
Share on other sites

19 minutes ago, dcgreen2k said:

I guess the code in mystrlen is a common enough pattern that the compiler knows it can replace that with the builtin implementation. I've heard of this being possible, but never seen it happen before.

Interesting. So I suppose you'd either have to "mask" the code somehow to prevent the compiler from doing it, or disable that particular optimization if possible, to see its actual performance in this case.

Remember to either quote or @mention others, so they are notified of your reply

Link to comment
Share on other sites

Link to post
Share on other sites

What works for one doesn't work for all.

 

You have things like memory stuff that makes a difference and it can be hard to tell if what you are doing if quicker in specific situations vs other situations.

 

As a note as well when it comes to trying to measure the speed of a function, never run it just once (unless it's a super slow function).  I normally try targeting things that will take 5 second plus (looping) so that it hopefully balances itself out.

 

Another thing to note, as @Eigenvektor mentioned it's not right to use int for this (in terms of overall compatibility).

 

Currently if you create a 2.5 GiB char* object [yes I've dealt with this kind of thing before when dealing with a binary file].  If you were to run strlen on it, you would overflow the int type.

 

Even if you used uint, you would still have an issue on 64 bit systems when uint is defined as uint32 and someone creates a 4.7 GiB array.

 

 

 

Overall as well, you would want to also test the function in different orders...as running lets say strlen might cache some things then mystrlen might use part of the cache etc.

 

You should run the test with size_t though, to see whether or not your function starts slowing down when you introduce effectively an UINT64

3735928559 - Beware of the dead beef

Link to comment
Share on other sites

Link to post
Share on other sites

11 hours ago, dcgreen2k said:

@Eigenvektor @tikker just wanted to let you know that the call to (built-in) strlen in your testing code gets optimized away, even when compiling without optimizations. When compiling with -O1, all functions get optimized out except for the kernelstrlen function, strangely.

Well I'll be damned. strlen indeed doesn't even show up in my assembly output. That feels like something I should have noticed. I enjoy seeing the breakdown of it and trying to comprehend what magic the compliler does behind the scenes to your code. Even more interesting (or scary?) to see that the behaviour varies enough between compilers and optimisation levels that just for strlen it's not hard to find a bunch of questions about why it is fast or slow for people's code. Crazy how different the compiled code mayu end up looking from what you'd naievely expect from just reading the source.

Crystal: CPU: i7 7700K | Motherboard: Asus ROG Strix Z270F | RAM: GSkill 16 GB@3200MHz | GPU: Nvidia GTX 1080 Ti FE | Case: Corsair Crystal 570X (black) | PSU: EVGA Supernova G2 1000W | Monitor: Asus VG248QE 24"

Laptop: Dell XPS 13 9370 | CPU: i5 10510U | RAM: 16 GB

Server: CPU: i5 4690k | RAM: 16 GB | Case: Corsair Graphite 760T White | Storage: 19 TB

Link to comment
Share on other sites

Link to post
Share on other sites

Okay and here is why things don't always work the way you expect (someone can run this on their local system, I'm just using godbolt.org because I'm writing this on a system with a compiler)

 

strlen elapsed: 0.434618 9999000000000
mystrlen elapsed: 0.493594 9999000000000
strlen again elapsed: 0.596780 9999000000000
mystrlen again elapsed: 1.033423 9999000000000

 

It calls strlen, then mystrl, but then key is it recalls them again.  Notice how there can be wild fluctuations...although on a normal system it would be less so.

 

Play around with this a bit changing the CHARLEN and LOOPCOUNT macros...you will find that for some strlen is faster and for others mystrlen was faster.

 

Spoiler

#include <stdio.h>
#include <time.h>
#include <string.h>
#include <stdlib.h>

int mystrlen(const char* str){
    int i = 0;
    while (str[i] != '\0'){
        i++;
    }
    return i;
}

size_t nstrlen(const char *s)
{
    const char *sc;

    for (sc = s; *sc != '\0'; ++sc)
        /* nothing */;
    return sc - s;
}

#define CHARLEN 10000
#define LOOPCOUNT 1000000000
int main(void){
    char* str = (char*) malloc(CHARLEN);

    for(int i = 0; i < CHARLEN-1; i++) {
        str[i] = (char)('a' + i%26);
    }
    str[CHARLEN - 1] = 0;
    unsigned long total;

    clock_t start;
    clock_t end;

    long double elapsed;

    total = 0;
    start = clock();

    for(unsigned long n = 0; n < LOOPCOUNT; ++n) {
        total += strlen(str);
    }
    end = clock();

    elapsed = ((long double)(end - start)) / CLOCKS_PER_SEC;
    printf("strlen elapsed: %Lf %lu\n", elapsed, total);


    total = 0;
    start = clock();

    for(unsigned long n = 0; n < LOOPCOUNT; ++n) {
        total += mystrlen(str);
    }
    end = clock();

    elapsed = ((long double)(end - start)) / CLOCKS_PER_SEC;
    printf("mystrlen elapsed: %Lf %lu\n", elapsed, total);

    

    total = 0;
    start = clock();

    for(unsigned long n = 0; n < LOOPCOUNT; ++n) {
        total += strlen(str);
    }
    end = clock();

    elapsed = ((long double)(end - start)) / CLOCKS_PER_SEC;
    printf("strlen again elapsed: %Lf %lu\n", elapsed, total);

    total = 0;
    start = clock();

    for(unsigned long n = 0; n < LOOPCOUNT; ++n) {
        total += mystrlen(str);
    }
    end = clock();

    elapsed = ((long double)(end - start)) / CLOCKS_PER_SEC;
    printf("mystrlen again elapsed: %Lf %lu\n", elapsed, total);

    return 0;
}

 

 

3735928559 - Beware of the dead beef

Link to comment
Share on other sites

Link to post
Share on other sites

47 minutes ago, wanderingfool2 said:

someone can run this on their local system

This is the result of gcc -O3:

strlen elapsed: 0.000003 9999000000000
mystrlen elapsed: 38.421469 9999000000000
strlen again elapsed: 0.000000 9999000000000
mystrlen again elapsed: 38.380904 9999000000000

My guess would be that the call to strlen is optimized away again, because the compiler sees it's measuring the same constant expression multiple times, so it does the measure-once-and-multiply optimization again.

 

Did some further tweaking to make it easier to repeat runs and hopefully avoid this kind of optimization:

Spoiler
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <string.h>

#define CHARLEN 10000
#define LOOPCOUNT 100000

size_t mystrlen(const char* str)
{
    int i = 0;
    while (str[i] != '\0'){
        i++;
    }
    return i;
}

size_t kstrlen(const char *s)
{
    const char *sc;

    for (sc = s; *sc != '\0'; ++sc)
        /* nothing */;
    return sc - s;
}

void str_gen(char *dest, size_t length, int offset)
{
    char charset[] = "0123456789"
                     "abcdefghijklmnopqrstuvwxyz"
                     "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
    size_t len = sizeof charset - 1;
    size_t index = offset % len;

    while (length-- > 0) {
        *dest++ = charset[index++];
        index %= len;
    }
    *dest = '\0';
}

void measure(char* name, int run, size_t (*funPtr)(const char*)) {
    clock_t start;
    clock_t end;
    clock_t time = 0;

    unsigned long total = 0;
    unsigned long count;

    char str[] = { [CHARLEN] = '\1' };

    for(int n = 0; n < LOOPCOUNT; ++n) {
        str_gen(str, CHARLEN, total);

        start = clock();
        count = (*funPtr)(str);
        end = clock();

        total += count;
        time += (end - start);
    }

    long double elapsed = ((long double)time) / CLOCKS_PER_SEC;
    printf("%s #%d count %lu took: %Lf\n", name, run, total, elapsed);
}

int main(void)
{
    for(int run = 0; run < 3; ++run)
    {
        measure("strlen", run, &strlen);
        measure("kstrlen", run, &kstrlen);
        measure("mystrlen", run, &mystrlen);
    }

    return 0;
}

strlen #0 count 1000000000 took: 0.045912
kstrlen #0 count 1000000000 took: 0.460489
mystrlen #0 count 1000000000 took: 0.045166
strlen #1 count 1000000000 took: 0.045310
kstrlen #1 count 1000000000 took: 0.460203
mystrlen #1 count 1000000000 took: 0.045258
strlen #2 count 1000000000 took: 0.045375
kstrlen #2 count 1000000000 took: 0.460215
mystrlen #2 count 1000000000 took: 0.045255

 

Remember to either quote or @mention others, so they are notified of your reply

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Eigenvektor said:

My guess would be that the call to strlen is optimized away again, because the compiler sees it's measuring the same constant expression multiple times, so it does the measure-once-and-multiply optimization again.

 

Did some further tweaking to make it easier to repeat runs and hopefully avoid this kind of optimization:

Using function pointers to do the testing indeed prevents the compiler from optimizing away the function calls. Nice catch! Again, the compiler replaces the code inside mystrlen (but not kstrlen) with a call to the builtin strlen after -O2.

 

image.png.a946a593fc0717f1c4edef4b0da7d28f.png

Computer engineering grad student, cybersecurity researcher, and hobbyist embedded systems developer

 

Daily Driver:

CPU: Ryzen 7 4800H | GPU: RTX 2060 | RAM: 16GB DDR4 3200MHz C16

 

Gaming PC:

CPU: Ryzen 5 5600X | GPU: EVGA RTX 2080Ti | RAM: 32GB DDR4 3200MHz C16

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, dcgreen2k said:

Again, the compiler replaces the code inside mystrlen (but not kstrlen) with a call to the builtin strlen after -O2.

Right. Dang. I toyed around a little and adding an empty asm("") into the while-loop appears to prevent it from doing that. This is the result:

strlen   #0 count 1000000000 took: 0.046596
kstrlen  #0 count 1000000000 took: 0.255741
mystrlen #0 count 1000000000 took: 0.291593
strlen   #1 count 1000000000 took: 0.046283
kstrlen  #1 count 1000000000 took: 0.255478
mystrlen #1 count 1000000000 took: 0.287263
strlen   #2 count 1000000000 took: 0.045174
kstrlen  #2 count 1000000000 took: 0.250946
mystrlen #2 count 1000000000 took: 0.285842

 

Remember to either quote or @mention others, so they are notified of your reply

Link to comment
Share on other sites

Link to post
Share on other sites

21 hours ago, wanderingfool2 said:

Okay and here is why things don't always work the way you expect (someone can run this on their local system, I'm just using godbolt.org because I'm writing this on a system with a compiler)

 

strlen elapsed: 0.434618 9999000000000
mystrlen elapsed: 0.493594 9999000000000
strlen again elapsed: 0.596780 9999000000000
mystrlen again elapsed: 1.033423 9999000000000

 

It calls strlen, then mystrl, but then key is it recalls them again.  Notice how there can be wild fluctuations...although on a normal system it would be less so.

 

Play around with this a bit changing the CHARLEN and LOOPCOUNT macros...you will find that for some strlen is faster and for others mystrlen was faster.

 

  Reveal hidden contents
#include <stdio.h>
#include <time.h>
#include <string.h>
#include <stdlib.h>

int mystrlen(const char* str){
    int i = 0;
    while (str[i] != '\0'){
        i++;
    }
    return i;
}

size_t nstrlen(const char *s)
{
    const char *sc;

    for (sc = s; *sc != '\0'; ++sc)
        /* nothing */;
    return sc - s;
}

#define CHARLEN 10000
#define LOOPCOUNT 1000000000
int main(void){
    char* str = (char*) malloc(CHARLEN);

    for(int i = 0; i < CHARLEN-1; i++) {
        str[i] = (char)('a' + i%26);
    }
    str[CHARLEN - 1] = 0;
    unsigned long total;

    clock_t start;
    clock_t end;

    long double elapsed;

    total = 0;
    start = clock();

    for(unsigned long n = 0; n < LOOPCOUNT; ++n) {
        total += strlen(str);
    }
    end = clock();

    elapsed = ((long double)(end - start)) / CLOCKS_PER_SEC;
    printf("strlen elapsed: %Lf %lu\n", elapsed, total);


    total = 0;
    start = clock();

    for(unsigned long n = 0; n < LOOPCOUNT; ++n) {
        total += mystrlen(str);
    }
    end = clock();

    elapsed = ((long double)(end - start)) / CLOCKS_PER_SEC;
    printf("mystrlen elapsed: %Lf %lu\n", elapsed, total);

    

    total = 0;
    start = clock();

    for(unsigned long n = 0; n < LOOPCOUNT; ++n) {
        total += strlen(str);
    }
    end = clock();

    elapsed = ((long double)(end - start)) / CLOCKS_PER_SEC;
    printf("strlen again elapsed: %Lf %lu\n", elapsed, total);

    total = 0;
    start = clock();

    for(unsigned long n = 0; n < LOOPCOUNT; ++n) {
        total += mystrlen(str);
    }
    end = clock();

    elapsed = ((long double)(end - start)) / CLOCKS_PER_SEC;
    printf("mystrlen again elapsed: %Lf %lu\n", elapsed, total);

    return 0;
}

 

 

To elaborate, the likely reason this happens is that the operating system may schedule execution of other processes in the middle of your strlen call, varying the time required. Generally tests like this are more relevant over thousands or millions of executions where you might see a relevant change in the average time.

 

As pointed out by others @Gat Pelsinger, never assume you're smarter than the compiler unless you've analyzed the resulting binary.

 

Also, "int" is not the same size on all systems. using size_t makes it so that the function will work on any system, on strings as long as the system allows.

On 1/19/2024 at 9:33 PM, Gat Pelsinger said:

Why use a new pointer, wasting CPU clocks and memory

Your code also declares a new variable (i) for that matter.

On 1/19/2024 at 9:33 PM, Gat Pelsinger said:

using a for loop instead of a while (although I won't talk about it as I don't quite know which one is faster, but literally microseconds we are talking about and this does not scale)

On any decent compiler, a for loop and a while loop doing the same thing will result in the same exact binary. It's just syntactic sugar.

On 1/19/2024 at 9:33 PM, Gat Pelsinger said:

A WHOLE SUBTRACTION. That's like around 2 more CPU cycles and around a whopping 2 to 4 BYTES of memory. Sheesh, that's a lot of unoptimization I smelling.

As mentioned this all gets optimized away by the compiler but, if it weren't, consider that every time you use

str[i]

you're doing an addition.

Don't ask to ask, just ask... please 🤨

sudo chmod -R 000 /*

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×