Jump to content

dcgreen2k

Member
  • Posts

    1,944
  • Joined

  • Last visited

Everything posted by dcgreen2k

  1. I can't say too much about the Windows API, but to answer the question in your title: Yes, libraries written in C++ generally will not work with C code (unless you do things like writing wrappers for the C++ code). In contrast, most "plain" C code can be used in C++ just fine. There are some C features/syntax that aren't valid in C++ though: https://en.wikipedia.org/wiki/Compatibility_of_C_and_C%2B%2B If you want to learn C++, go ahead. I certainly find C++ nicer to use than C for building larger programs, although there's a lot of added complexity for someone new-ish to programming. It also has a much larger standard library compared to C (hooray, generic data structures). As for the Rust vs. C++ argument, it's really up to you to decide. Rust is newer, popular, and has builtin memory safety while C++ is a lot more mature. C++ is my choice because of how common it is in my fields of interest, but I'll get around to learning Rust too someday.
  2. The actual code that gets generated appears to be mostly the same, aside from some differences in how they name things and manage the stack. The lines with INCLUDELIB in the MSVC version are likely directives for the linker. LIBCMT is just the system's C library and OLDNAMES seems to be a compatibility layer for Windows: https://devblogs.microsoft.com/oldnewthing/20200730-00/?p=104021
  3. Memory error checkers are used very often, and UBSan seems to be the most popular currently. I believe the usefulness of those tools is somewhat tied to how good the project's architecture and testing practices are. If you thoroughly test a module as soon as you write it, it's easy to pinpoint errors and fix them before they propagate to later parts of the project. On the other hand, if you wait to write any tests, have sparse tests, or write spaghetti code, then going through a list of errors, fixing old code, and making sure nothing breaks is going to be a painful process.
  4. I agree - there are tools to help find and debug these kinds of errors like valgrind and ubsan, but it's better to use a language that prevents them entirely for important applications. I remember one of my cybersecurity professors mentioning how he switched most of his department's new developments to Go for this reason. That certainly seems like what's going to happen. I know embedded systems development has an incredibly high concentration of C and C++ code due to speed and memory constraints, but there's been growing interest in Rust toolchains for the popular MCUs.
  5. Can confirm. When you get down to the level of C and C++, all accesses to heap memory and things like passing variables by reference require pointers. If you go down a level further to assembly, ALL memory accesses require pointers, both to the stack and the heap. Languages like Java and Python still use pointers, they just do a good job of hiding it from you.
  6. I've been using a Logitech G303 for the past 8 years and it starting double clicking and having trouble registering a held down button about 3 years ago. The solution is to replace the switches or take them apart and clean them.
  7. That comma in between x and y doesn't get interpreted as a dictionary entry separator because it's part of the lambda expression. The interpreter is smarter than just looking for the first comma when parsing something like this. Instead, the lambda is a sub-expression where any special syntax like the commas are treated as being part of the lambda's syntax instead of the dictionary's. Only when it finishes parsing the lambda will a comma be treated as part of the dictionary's syntax.
  8. I always use virtual environments for Python projects. It's really just a question of whether you want to install libraries for your current project only or for the entire system. In my experience, they're most useful for managing different versions of libraries. For example, I recently started working on a project that required an older version of a library that I was already using for some other programs. Instead of installing the older version system-wide and potentially having version conflicts, you can just install it in a virtual environment for the one project. In cases like this, the developers will include a requirements.txt file which contains a list of libraries to use and their version numbers. This makes it super easy to get a new virtual environment up and running with just a few commands.
  9. I completely agree with this. I can count on one hand the number of times I've had to use these Windows-specific typedefs on a project, and all of them were times I was writing a program meant to only run on Windows. If you're familiar with the important aspects of programming, picking up these kinds of things whenever you need it is easy. The same thing goes for optimizing code. Nobody cares how fast the code runs at first, only that it's correct. If at some point you figure out that it runs too slow, then you can optimize it. Otherwise it's wasted effort most of the time. Reminds me of when I implemented bubble sort in a final project for my data structures and algorithms class once. Was it fast? No, but it didn't matter because I was only sorting 10 elements. Was it quick to write? Yes, and that was great considering the time crunch.
  10. Let's only take a look at where the loops appear in the assembly code, since that's what will take up the majority of the execution time. I'll even annotate the assembly to make it easier to understand func1 (index-based): .L5: addl $1, %ebx ; Increment the loop counter (ebx is a 32-bit register, we're using 32-bit ints) .L4: movq str(%rip), %rdx ; Copy the base address of the string into rdx movslq %ebx, %rax ; Extend the loop counter from 32 to 64 bits via sign extension addq %rdx, %rax ; Add the loop counter to the string's base address. ; Now, rax contains the address of the next character to check movzbl (%rax), %eax ; Dereference that address and place the resulting character in eax testb %al, %al ; Check if the lowest 8 bits (the char) of eax is 0, null jne .L5 ; If the char wasn't the null byte, continue the loop func2 (iterator-based): .L9: addq $1, %rbx ; Increment the pointer stored in rbx (64 bits) .L8: ; Now, rbx contains the address of the next character to test movzbl (%rbx), %eax ; Copy the char at that address into eax testb %al, %al ; Check if the lowest 8 bits is 0 (null) jne .L9 ; If it wasn't the null byte, continue the loop The only meaningful difference between these two bits of code is how the index-based approach calculates the address of the next character to read. func1 must get the address of the string, sign extend the loop counter (because ints are 32 bit while memory addresses are 64 bit), then add the loop counter to the base address. In contrast, func2 already has the address of the previous character stored in a register. All it needs to do to get the next address is add 1 to that register. In short, the pointer approach is faster because it doesn't need to recalculate addresses from scratch all the time. That only saves 3 instructions, which isn't much. One more thing I want to note, have a look at the instruction after labels .L5 and .L9 in the code above. These correspond to the increment step of the for loops. For both approaches, the incrementation is performed by a single addition instruction. Adding to a 32 bit number and a 64 bit number both take the same amount of time. Lastly, it confirms @Eigenvektor's note that pointers are just numbers when you get down to this level of code.
  11. Yes. I was able to figure out what code the builtin strlen was using with GDB and some searching on the internet. GDB is a debugger that lets you view the program's assembly code while you step through it. When I stepped into the call to strlen, GDB reported this: This means the code actually called a function named __strlen_avx2. Searching online, I found the source code here: https://github.com/bminor/glibc/blob/master/sysdeps/x86_64/multiarch/strlen-avx2.S It's an implementation of strlen optimized with AVX2 instructions, and it's all handwritten assembly code. When I tested it out, this function only executed 200 instructions to find the length of a 2048-character string. The "regular" versions of strlen would have taken many thousands of executed instructions to do the same. Also, the code for strlen isn't placed into your program because it's dynamically linked. This means the operating system stores its own copy of the function and your program simply jumps to it whenever it's needed. This also means the operating system can replace a call to "plain" strlen with an optimized version of it - in this case one using AVX2.
  12. The test for the builtin strlen function runs so fast because the function call gets optimized away. The compiler knows you aren't doing anything useful with strlen's return value, so it's free to do whatever it likes in this case. This code in the C source file: clock_gettime(CLOCK_MONOTONIC, &start); for (int i = 0; i < iteration; i++) strlen(str); clock_gettime(CLOCK_MONOTONIC, &end); corresponds to this code in the assembly output, the first loop after main begins: call clock_gettime movl $0, -4(%rbp) jmp .L24 .L25: incl -4(%rbp) .L24: cmpl $99, -4(%rbp) jle .L25 leaq -64(%rbp), %rax movq %rax, %rdx movl $1, %ecx call clock_gettime The for loop corresponds to the instruction after .L25 and the two after .L24. As you can see, there's no call to strlen in there. To fix this, you need to find a way to prevent the compiler from optimizing away any function calls. One way to do this, as we've seen in previous posts, is to take a pointer to the function you want to test and pass it to another function that runs the benchmark. Here's an example I adapted from your code above as well as @Eigenvektor's previous benchmarking code.
  13. Handbrake is my personal favorite software for video transcoding and compression https://handbrake.fr/
  14. It's normal for new electronics to smell like chemicals, especially once they heat up. I bought a Corsair PSU a while back and it smelled terrible while gaming, but it went away after about a week.
  15. My apologies for not being as active this week, I've been very busy with university work lately. Quite the opposite, actually. My comment that it was a large program referred to the amount of source code, of which there are over 5000 lines. The code for this simulator isn't just "Read a line of assembly code, then do something based on what it says", it's more like: Read the assembly source file and tokenize it (Move away from a string representation as soon as possible) Parse the token representation to see if it's syntactically correct Turn the tokens into an easily interpretable list of instructions Associate source code labels with instructions and variables Populate the simulator with the instructions and variables in data memory Take user input to either single-step through the code or run continuously In short, there's a boatload of setup to do turning source code into something easily interpretable. This is so that we can check if the program is valid and minimize the amount of additional processing the CPU simulator has to do. This also means there aren't too many data structures that live until that last step. I suspect that a significant portion of the 113MB of RAM being used in my previous screenshot is due to the GUI library I'm using, Qt. Thankfully, I also made a command line only version of the program, so we can check that pretty easily. The command line version uses just 3328 bytes of RAM for its data, which I find quite surprising. To check this, I went through the code and looked for all the data structures that live to that last step. The UI controller takes up 768 bytes on the stack Lots of small variables to control the program and processing thread 1 std::queue to pass messages to the processing thread Some input/output stream pointers to direct text output wherever the user wants The CPU simulator takes a minimum of 1620 bytes, of which 596 are on the stack 2 std::arrays, 1 for the register file and another for a jump table 2 std::vectors, 1 for instruction memory and the other for 1024 bytes of data memory 2 std::unordered_maps for associating labels with instructions and the addresses of variables in data memory A couple small variables for bookkeeping As you can see, the program is surprisingly lightweight and shows that the high RAM usage was indeed due to the GUI library I was using. Only 5 of the data structures are allocated on the heap, those being the queue, vectors, and unordered_maps. Out of these, only the queue may grow in size after the setup is finished. The size of the compiled code is around 160kB, so now there's much less data compared to the instructions. ------------------ Yes, stack allocated variables are "freed" automatically when they go out of scope. This is why they're sometimes also called automatic variables. I put freed in quotes here because there's no call to free() like there is for heap allocated memory, but rather the space they take up is left as is and overwritten by later code. The way this works is that, when you enter a function, there will be an instruction that grows the program stack by some amount dependent on how much space is needed by that function for its variables. The function is then able to use this new area of the stack (called a stack frame) however it likes. When that function exits, it shrinks the stack by the same amount as before, leaving it in the same state as it was before the function was called.
  16. One simple way to switch turns is to keep a turn counter - if it's even then it's player 1's turn, and if it's odd then it's player 2's turn. It kind of looks like that's what you were going for. What's the purpose of first_turn_counter? You declare it as an array of 100 ints but access one element past the end, which is out of bounds. Remember that arrays start at index 0 in C, so your valid indices are 0 through 99. I think it would be easiest to have your loop variable be the turn counter. Something like this:
  17. At O2 and O3, the compiler optimizes those function calls away, like you saw before. If we instead pass those functions into a benchmarking function via their pointers like so: measure("avx2", array1, array2, &avx2); // &avx2 is a pointer to the avx2 function measure("plain", array1, array2, &plain); // Same with &plain we can prevent the compiler from removing the function calls. This lets us get useful benchmarking results.
  18. Also, here's some example code for baremetal x86 programs: https://github.com/cirosantilli/x86-bare-metal-examples
  19. The kind of programming you're talking about is called baremetal programming, since there's no OS providing resources, builtin functions, etc. for your code to use. There's a fair bit of setup you need to do before you can get even simple code running. I'll note that I don't have much experience in this area, but you will have to define interrupt handlers + an interrupt descriptor table and implement functions for each syscall yourself. For the syscalls, you would write code that directly interfaces with hardware. (Example below) Here are some useful links: https://wiki.osdev.org/Interrupts https://wiki.osdev.org/Interrupts_tutorial https://filippo.io/linux-syscall-table/ Most of the tutorials on the OSDev Wiki require using both assembly and C. You will also need to create a simple kernel to get your code running, if you decide to go this route. Here's the tutorial for that: https://wiki.osdev.org/Bare_Bones ----------- (Example) Back when I was learning this kind of thing, I remember writing an implementation of puts() for my basic kernel's C library. All it did was print a string to the screen while in VGA text mode. To do this, you write characters directly into memory at address 0xB8000 which is the VGA text buffer. This is the lowest level of code needed to write characters to the screen, but after this you have a nice C function to use wherever you'd like.
  20. The only way for us to know if this is truly the case is by looking at the assembly code generated by the compiler. Please post your code.
  21. Using function pointers to do the testing indeed prevents the compiler from optimizing away the function calls. Nice catch! Again, the compiler replaces the code inside mystrlen (but not kstrlen) with a call to the builtin strlen after -O2.
  22. Yes, the code you wrote works now. The results for the unoptimized code are what I expected, but it's interesting how mystrlen performs closer to the builtin implementation at -O3. Poking around in the generated assembly, it turns out the compiler inserts a call to the builtin strlen instead of using the simple while-loop. I guess the code in mystrlen is a common enough pattern that the compiler knows it can replace that with the builtin implementation. I've heard of this being possible, but never seen it happen before.
  23. It's fixed when compiling without optimizations, but the function calls are still getting optimized out with -O1 and up. Take a look at labels .L12 (strlen), .L13 (nstrlen), and .L14 (mystrlen) here: https://godbolt.org/z/bPhbK74Tc The builtin strlen test gets precomputed by the compiler and the result of mystrlen just gets multiplied by 1000000. It's kind of funny how hard the compiler tries to optimize these kinds of things. This is closer to what I expected from the tests. The builtin strlen is super fast due to using AVX2, nstrlen is slower with its simple iterator-based approach, and mystrlen uses a simple index-based approach. Are you running our versions of the testing code, where the calls to strlen are inside a loop? You won't get accurate results if you just run the functions once.
  24. I modified the testing code with some inline assembly so that the builtin strlen function would actually get called. The assembly is designed to match the compiler output for the other loops, and the code should be compiled with optimizations off. Otherwise, it'll segfault. Its assembly can be seen here: https://godbolt.org/z/zzTE4o6Ez Running the new tests, we get this. Again, these results are with optimizations turned off. Even after ensuring we call the function directly, the builtin strlen test runs incredibly quickly. What's going on? To figure this out, I stepped through the testing code in GDB until I got to the strlen call. Stepping into strlen, GDB reported this: Looking online, the source is available here: https://github.com/bminor/glibc/blob/master/sysdeps/x86_64/multiarch/strlen-avx2.S So the strlen function that's getting called isn't the simple version we found earlier, it's all handwritten AVX2 assembly. At this point, I'm looking at the wall of assembly code and the elapsed time for the builtin strlen test and I don't really believe it. So I decided to step into __strlen_avx2 in GDB and manually count how many times I have to hit enter before it returns. For a 2048-byte string, only 200 instructions were executed with 16 iterations of its loop. Comparing the AVX2 version's 0.023636s with 0.519933s (with -O1) for the kernel strlen we found earlier, there's a 22x speedup. Knowing that the strlen we found earlier iterates over every character and executes 3 instructions per loop, these seem like reasonable results. ----------------- Back to the original question, the "regular" builtin version of strlen is in fact slightly faster than your own version despite it appearing more complex. This is due to it being iterator-based compared to your index-based approach. With the index-based approach, the program needs to add the index to the array's base address, then dereference the resulting memory address to get the current character. In contrast, the iterator-based approach only needs to dereference the pointer it hangs onto. I will stress that this difference is a serious micro-optimization and saves only a single assembly instruction when optimizations are turned on. There are better ways to make your program faster.
  25. I had a look at the generated assembly, and the compiler didn't inline any of the functions. The relevant function calls just aren't there, because the compiler knows we aren't doing anything with the return values. Take a look at the instructions immediately after labels .L12, .L13, .L14, and .L15 here: https://godbolt.org/z/4f156drWK .L15 corresponds to the empty loop and only subtracts 1 from the loop counter before jumping back up. .L12 and .L14 correspond to the builtin strlen and mystrlen tests and show the same instructions as the empty loop. .L13 is the loop for the kernelstrlen test and is the only one to keep the function call. I'm not sure why this is the case.
×