Jump to content

Trying to learn SIMD programming in C.

Gat Pelsinger

There are so many SIMD instruction sets. Which one to use and why are there so many? I was reading this article which used VMX which I have never heard of before - http://ftp.cvut.cz/kernel/people/geoff/cell/ps3-linux-docs/CellProgrammingTutorial/BasicsOfSIMDProgramming.html.

Microsoft owns my soul.

 

Also, Dell is evil, but HP kinda nice.

Link to comment
Share on other sites

Link to post
Share on other sites

7 minutes ago, Gat Pelsinger said:

Which one to use

The one your target supports.

 

7 minutes ago, Gat Pelsinger said:

I was reading this article which used VMX which I have never heard of before

If you look this doc is specifically about programming the PS3 with its pretty unique architecture.

F@H
Desktop: i9-13900K, ASUS Z790-E, 64GB DDR5-6000 CL36, RTX3080, 2TB MP600 Pro XT, 2TB SX8200Pro, 2x16TB Ironwolf RAID0, Corsair HX1200, Antec Vortex 360 AIO, Thermaltake Versa H25 TG, Samsung 4K curved 49" TV, 23" secondary, Mountain Everest Max

Mobile SFF rig: i9-9900K, Noctua NH-L9i, Asrock Z390 Phantom ITX-AC, 32GB, GTX1070, 2x1TB SX8200Pro RAID0, 2x5TB 2.5" HDD RAID0, Athena 500W Flex (Noctua fan), Custom 4.7l 3D printed case

 

Asus Zenbook UM325UA, Ryzen 7 5700u, 16GB, 1TB, OLED

 

GPD Win 2

Link to comment
Share on other sites

Link to post
Share on other sites

29 minutes ago, Gat Pelsinger said:

@Kilrah

 

My CPU supports a lot SIMD instructions, as it should. Which one to use? This is a good guide - https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html. Heck, its even built into C libraries!

What do you mean by this? You use the instruction that does what you want to do.

¯\_(ツ)_/¯

 

 

Desktop:

Intel Core i7-11700K | Noctua NH-D15S chromax.black | ASUS ROG Strix Z590-E Gaming WiFi  | 32 GB G.SKILL TridentZ 3200 MHz | ASUS TUF Gaming RTX 3080 | 1TB Samsung 980 Pro M.2 PCIe 4.0 SSD | 2TB WD Blue M.2 SATA SSD | Seasonic Focus GX-850 Fractal Design Meshify C Windows 10 Pro

 

Laptop:

HP Omen 15 | AMD Ryzen 7 5800H | 16 GB 3200 MHz | Nvidia RTX 3060 | 1 TB WD Black PCIe 3.0 SSD | 512 GB Micron PCIe 3.0 SSD | Windows 11

Link to comment
Share on other sites

Link to post
Share on other sites

On 1/20/2024 at 10:43 AM, Gat Pelsinger said:

My CPU supports a lot SIMD instructions, as it should. Which one to use? This is a good guide - https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html. Heck, its even built into C libraries!

Your question is a bit like "there are so many fields of mathematics, which one should I learn?, so as the others said you use the one that addresses your needs. The number one question you need to answer first is what problem you are trying to solve. Then you can figure out which or what kind of SIMD operation you are after.

 

On 1/20/2024 at 10:02 AM, Gat Pelsinger said:

why are there so many?

As you can see from that page each intrinsic exists for a specific task. Why so many? I would guess it is a combination of 1) it is nice to have a convenience function to do a complex operation and 2) the engineers behind it perhaps knowing an or the optimal way to do that operation on their CPU. Instead of having the compiler or programmer try and figure something out, you can then leverage that instruction directly if you know that it does exactly what you are after.

Crystal: CPU: i7 7700K | Motherboard: Asus ROG Strix Z270F | RAM: GSkill 16 GB@3200MHz | GPU: Nvidia GTX 1080 Ti FE | Case: Corsair Crystal 570X (black) | PSU: EVGA Supernova G2 1000W | Monitor: Asus VG248QE 24"

Laptop: Dell XPS 13 9370 | CPU: i5 10510U | RAM: 16 GB

Server: CPU: i5 4690k | RAM: 16 GB | Case: Corsair Graphite 760T White | Storage: 19 TB

Link to comment
Share on other sites

Link to post
Share on other sites

@Kilrah @BobVonBob @tikker

 

So I spent some time but I am completely overwhelmed in how much complexity there is and how MANY function there are in Intel Intrinsics. I don't have the right time to really learn all this know, I just want to create a fun little program that accelerates the speed on computing. I just want some brief information how I allocate my variables (using aligned_malloc I guess), and which functions I need to use to simply add 2 arrays.

 

Edit - btw I am using avx2.

Microsoft owns my soul.

 

Also, Dell is evil, but HP kinda nice.

Link to comment
Share on other sites

Link to post
Share on other sites

3 hours ago, Gat Pelsinger said:

@Kilrah @BobVonBob @tikker

 

So I spent some time but I am completely overwhelmed in how much complexity there is and how MANY function there are in Intel Intrinsics. I don't have the right time to really learn all this know, I just want to create a fun little program that accelerates the speed on computing. I just want some brief information how I allocate my variables (using aligned_malloc I guess), and which functions I need to use to simply add 2 arrays.

 

Edit - btw I am using avx2.

I've never done such low-level programming myself, so I can't give a direct example, but I guess you can look for examples like

https://stackoverflow.com/questions/10930595/sse-instructions-to-add-all-elements-of-an-array

https://stackoverflow.blog/2020/07/08/improving-performance-with-simd-intrinsics-in-three-use-cases/

https://stackoverflow.com/questions/39759936/the-correct-way-to-sum-two-arrays-with-sse2-simd-in-c

 

I think you are finding out that "simply" and advanced subjects like this don't really go hand in hand. I like this video about optimisations since I found it:

I think it highlights well that this stuff is hard and is why why big math libraries like BLAS (Basic Linear Algebra Subprograms) , MKL (Intel's Math Kernel Libary), AOC (AMD Optmised CPU Libraries) etc. exist for you to write C code with and link against. Even with all the clever optimisations covered in that video, MKL still wipes the floor with it all. You are way past the point of "simply add 2 arrays" and are more trying to do pretty hardcore optimisation trying to use assembly intrinsics, requiring detailed knowledge about your algorithm.

 

3 hours ago, Gat Pelsinger said:

I am completely overwhelmed in how much complexity there is and how MANY function there are in Intel Intrinsics. I don't have the right time to really learn all this know, I just want to create a fun little program that accelerates the speed on computing.

I don't know what your level of experience is, but I think you are going way too fast in that case. Your topics seem to fluctuate wildly from in-depth expert questions like "I'm writing my own heap memory and garbage collector" to the in comparison very basic like "how do I build/install a library" and then right back to "I'm going to write assembly". I like jumping in the deep end as well, but you can shoot yourself in the foot with it if you skip too much of the groundwork at once.

 

If you just need fast array/vector math, use a BLAS library. They are made so you don't have to deal with the hassle of assembly. for micro optimisations.

 

If you are trying to learn optimisation to this level I would maybe try to break this program down in smaller, easier (relatively speaking) chunks:

  1. Write a normal C program adding two arrays.
  2. Learn about optimisations for that operation. There is no way around having to learn these things.
  3. Learn about/from the optimsations the compiler already does or can optimise for you by looking at the generated assembly code.
  4. Start learning how the appropriate SIMD instruction works

Crystal: CPU: i7 7700K | Motherboard: Asus ROG Strix Z270F | RAM: GSkill 16 GB@3200MHz | GPU: Nvidia GTX 1080 Ti FE | Case: Corsair Crystal 570X (black) | PSU: EVGA Supernova G2 1000W | Monitor: Asus VG248QE 24"

Laptop: Dell XPS 13 9370 | CPU: i5 10510U | RAM: 16 GB

Server: CPU: i5 4690k | RAM: 16 GB | Case: Corsair Graphite 760T White | Storage: 19 TB

Link to comment
Share on other sites

Link to post
Share on other sites

@tikker @Kilrah @Eigenvektor

 

Well I did make a program that adds 2 arrays using AVX2 and by also not using AVX2 and comparing their time required.

But I got AVX2 elapsed as 8.6 seconds and Non-AVX2 as 12.6 seconds, which is not even double the performance by using AVX2, considering we are computing approximately 4 times less. Here is my code and remember to compile with -mavx2 and -mconsole flags.

 

EDIT - using O1 optimization does give me pretty much 4 times the boost. But I don't know why I can't replicate the same thing without optimizations. No changes to code done here.

#include <stdio.h>
#include <immintrin.h>
#include <time.h>

void main(void)
{
    {
        double elapsed1 = 0.0;
        double elapsed2 = 0.0;
        double array1[8] __attribute__((aligned(32))) = {433166.809964274230879, 100685.267207268989296, 259778.714776604552753,
                                                         986307.925920814042911, 781291.718875396531075, 454397.978855219436809,
                                                         675246.924393196823075, 828451.099642499582842};
        double array2[8] __attribute__((aligned(32))) = {352745.423174547438975, 434129.107862508215476, 1094696.210192577913404,
                                                         168401.119294500269461, 530600.455753521993756, 452582.745474771712907,
                                                         527037.630485591012985, 920686.536348481662571};

        {
            {
                __m256d a, b, result_vec;
                double result[8] = {0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0};
                clock_t start = clock();
                for (unsigned long long z = 0ULL; z < 1000000000ULL; z++) //1 billion
                {
                    for (unsigned char i = 0; i < 8; i += 4)
                    {
                        a = _mm256_loadu_pd(&array1[i]);
                        b = _mm256_loadu_pd(&array2[i]);
                        result_vec = _mm256_add_pd(a, b);
                        _mm256_storeu_pd(&result[i], result_vec);
                    }
                }
                clock_t end = clock();
                elapsed1 = ((double)(end - start)) / CLOCKS_PER_SEC;
                printf("AVX2 result:\n\n");
                for (unsigned char i = 0; i < 8; i++)
                {
                    printf("%lf\n", result[i]);
                }
            }

            double result[8] = {0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0};
            clock_t start = clock();
            for (unsigned long long z = 0ULL; z < 1000000000ULL; z++) //1 billion
            {
                for (unsigned char i = 0; i < 8; i++)
                {
                    result[i] = array1[i] + array2[i];
                }
            }
            clock_t end = clock();
            elapsed2 = ((double)(end - start)) / CLOCKS_PER_SEC;
            printf("Non-AVX2 result:\n\n");
            for (unsigned char i = 0; i < 8; i++)
            {
                printf("%lf\n", result[i]);
            }
            printf("AVX2 elapsed time: %lf\n", elapsed1);
            printf("Non-AVX2 elapsed time: %lf\n", elapsed2);
        }
    }
    return;
}

 

Microsoft owns my soul.

 

Also, Dell is evil, but HP kinda nice.

Link to comment
Share on other sites

Link to post
Share on other sites

Let me clean up the code a bit. I'll have to toy around a bit more when I return from work

Spoiler
#include <immintrin.h>
#include <stdio.h>
#include <time.h>

void avx2(double *array1, double *array2, double *result) {
  __m256d a, b, result_vec;

  for (unsigned char i = 0; i < 8; i += 4) {
    a = _mm256_loadu_pd(&array1[i]);
    b = _mm256_loadu_pd(&array2[i]);

    result_vec = _mm256_add_pd(a, b);
    _mm256_storeu_pd(&result[i], result_vec);
  }
}

void plain(double *array1, double *array2, double *result) {
  for (unsigned char i = 0; i < 8; i++) {
    result[i] = array1[i] + array2[i];
  }
}

void measure(char *name, double *array1, double *array2,
             void (*funPtr)(double *, double *, double *)) {
  double elapsed = 0.0;
  double result[8];

  clock_t start = clock();
  for (unsigned long long z = 0ULL; z < 1000000000ULL; z++) // 1 billion
  {
    funPtr(array1, array2, result);
  }
  clock_t end = clock();

  elapsed = ((double)(end - start)) / CLOCKS_PER_SEC;

  printf("Result for %s:\n", name);

  for (unsigned char i = 0; i < 8; i++) {
    printf("%lf\n", result[i]);
  }

  printf("Elapsed time for %s: %lf\n\n", name, elapsed);
}

int main(void) {
  double array1[8] __attribute__((aligned(32))) = {
      433166.809964274230879, 100685.267207268989296, 259778.714776604552753,
      986307.925920814042911, 781291.718875396531075, 454397.978855219436809,
      675246.924393196823075, 828451.099642499582842};

  double array2[8] __attribute__((aligned(32))) = {
      352745.423174547438975, 434129.107862508215476, 1094696.210192577913404,
      168401.119294500269461, 530600.455753521993756, 452582.745474771712907,
      527037.630485591012985, 920686.536348481662571};

  measure("avx2", array1, array2, &avx2);
  measure("plain", array1, array2, &plain);

  return 0;
}

 

 

Without the function pointers I start to get unrealistic results with O2. Presumably because the compiler once again recognizes that the same code is performed multiple times, so it only measures once then sums up the results.

 

With O3, I get the same result for both. I'd have to look at the compiled code, but most likely the compiler recognizes what the plain loop is doing and vectorizes it as an optimization.

gcc -O3 -mavx2 simd.c -o simd
./simd

Remember to either quote or @mention others, so they are notified of your reply

Link to comment
Share on other sites

Link to post
Share on other sites

@Eigenvektor Do not try with O2 and O3. The compiler just sees that unnecessary computation is being over and over again and it just skips it.

Microsoft owns my soul.

 

Also, Dell is evil, but HP kinda nice.

Link to comment
Share on other sites

Link to post
Share on other sites

3 hours ago, Gat Pelsinger said:

@Eigenvektor Do not try with O2 and O3. The compiler just sees that unnecessary computation is being over and over again and it just skips it.

The function pointer should avoid that, going by the execution time I saw.

Remember to either quote or @mention others, so they are notified of your reply

Link to comment
Share on other sites

Link to post
Share on other sites

12 minutes ago, Eigenvektor said:

The function pointer should avoid that, going by the execution time I saw.

?

Microsoft owns my soul.

 

Also, Dell is evil, but HP kinda nice.

Link to comment
Share on other sites

Link to post
Share on other sites

17 minutes ago, Gat Pelsinger said:

?

At O2 and O3, the compiler optimizes those function calls away, like you saw before. If we instead pass those functions into a benchmarking function via their pointers like so:

measure("avx2", array1, array2, &avx2); // &avx2 is a pointer to the avx2 function
measure("plain", array1, array2, &plain); // Same with &plain

we can prevent the compiler from removing the function calls. This lets us get useful benchmarking results.

Computer engineering grad student, cybersecurity researcher, and hobbyist embedded systems developer

 

Daily Driver:

CPU: Ryzen 7 4800H | GPU: RTX 2060 | RAM: 16GB DDR4 3200MHz C16

 

Gaming PC:

CPU: Ryzen 5 5600X | GPU: EVGA RTX 2080Ti | RAM: 32GB DDR4 3200MHz C16

Link to comment
Share on other sites

Link to post
Share on other sites

7 hours ago, Gat Pelsinger said:

?

Look at the code I posted in the spoiler. I'm using a function pointer to invoke the function to measure. This prevents the compiler from removing it, and also reduces duplicate code and makes it easier to repeat tests multiple times or in different order.

 

At O3 execution time was still 1.x, where before it was something very close to zero (0.00...4). So I'm fairly certain it did do something.

 

~edit:

Here's the output of the original code. As you said, O2 and higher must be removing the methods, the execution time is suspiciously short. However, I think even O1 is already no longer measuring the AVX2 results correctly.

Spoiler
None
AVX2 elapsed time: 8.767082
Non-AVX2 elapsed time: 12.622161

O1
AVX2 elapsed time: 0.240121
Non-AVX2 elapsed time: 2.106372

O2
AVX2 elapsed time: 0.000002
Non-AVX2 elapsed time: 0.000001

O3
AVX2 elapsed time: 0.000002
Non-AVX2 elapsed time: 0.000001

 

 

Here's the output of the code using function pointers. This introduces a little bit of overhead, but it produces sensible results at higher optimization levels.

None
Elapsed time for avx2: 9.941513
Elapsed time for plain: 15.848516

O1
Elapsed time for avx2: 1.134473
Elapsed time for plain: 4.320873

O2
Elapsed time for avx2: 1.310947
Elapsed time for plain: 2.779258

O3
Elapsed time for avx2: 1.082800
Elapsed time for plain: 1.089202

 

Interestingly, at optimization level O3 the result for both functions seems to be virtually identical. Looking at the assembler output of "gcc -O3 -mavx2 -S simd.c" seems to confirm what I expected.

 

The compiler appears to recognize what is going on, so it replaces the loop with SIMD instructions. The resulting code is not identical, but appears to perform the same. In some runs it is ever so slightly faster.

avx2:
.LFB6439:
	.cfi_startproc
	vmovupd	(%rdi), %ymm1
	vaddpd	(%rsi), %ymm1, %ymm0
	vmovupd	%ymm0, (%rdx)
	vmovupd	32(%rsi), %ymm0
	vaddpd	32(%rdi), %ymm0, %ymm0
	vmovupd	%ymm0, 32(%rdx)
	vzeroupper
	ret
	.cfi_endproc
.LFE6439:
	.size	avx2, .-avx2
	.p2align 4
	.globl	plain
	.type	plain, @function
plain:
.LFB6440:
	.cfi_startproc
	leaq	8(%rdi), %rcx
	movq	%rdx, %rax
	subq	%rcx, %rdx
	cmpq	$16, %rdx
	jbe	.L4
	leaq	8(%rsi), %rcx
	movq	%rax, %rdx
	subq	%rcx, %rdx
	cmpq	$16, %rdx
	jbe	.L4
	vmovupd	(%rdi), %ymm1
	vaddpd	(%rsi), %ymm1, %ymm0
	vmovupd	%ymm0, (%rax)
	vmovupd	32(%rsi), %ymm0
	vaddpd	32(%rdi), %ymm0, %ymm0
	vmovupd	%ymm0, 32(%rax)
	vzeroupper
	ret
	.p2align 4,,10
	.p2align 3
.L4:
	vmovsd	(%rdi), %xmm0
	vaddsd	(%rsi), %xmm0, %xmm0
	vmovsd	%xmm0, (%rax)
	vmovsd	8(%rdi), %xmm0
	vaddsd	8(%rsi), %xmm0, %xmm0
	vmovsd	%xmm0, 8(%rax)
	vmovsd	16(%rdi), %xmm0
	vaddsd	16(%rsi), %xmm0, %xmm0
	vmovsd	%xmm0, 16(%rax)
	vmovsd	24(%rdi), %xmm0
	vaddsd	24(%rsi), %xmm0, %xmm0
	vmovsd	%xmm0, 24(%rax)
	vmovsd	32(%rdi), %xmm0
	vaddsd	32(%rsi), %xmm0, %xmm0
	vmovsd	%xmm0, 32(%rax)
	vmovsd	40(%rdi), %xmm0
	vaddsd	40(%rsi), %xmm0, %xmm0
	vmovsd	%xmm0, 40(%rax)
	vmovsd	48(%rdi), %xmm0
	vaddsd	48(%rsi), %xmm0, %xmm0
	vmovsd	%xmm0, 48(%rax)
	vmovsd	56(%rdi), %xmm0
	vaddsd	56(%rsi), %xmm0, %xmm0
	vmovsd	%xmm0, 56(%rax)
	ret
	.cfi_endproc

 

This is the result at O3 with 10B iterations

Elapsed time for avx2: 10.518030

Elapsed time for plain: 10.515053

Remember to either quote or @mention others, so they are notified of your reply

Link to comment
Share on other sites

Link to post
Share on other sites

@Eigenvektor ok whatever, but what I was asking is that why can I not get the 4 times performance increase I get in O1 in no optimization? It's just straightforward code.

Microsoft owns my soul.

 

Also, Dell is evil, but HP kinda nice.

Link to comment
Share on other sites

Link to post
Share on other sites

7 hours ago, Gat Pelsinger said:

@Eigenvektor ok whatever, but what I was asking is that why can I not get the 4 times performance increase I get in O1 in no optimization? It's just straightforward code.

Your code being straightforward does not mean the resulting machine code has to be. Just because you're calling a single method doesn't mean the resulting binary contains a single instruction. I don't know what "_mm256_loadu_pd" looks like internally, but presumably it contains things that can be optimized by the compiler.

 

My guess would be the code behind "_mm256_loadu_pd" is written in a way that makes it more readable, in the knowledge that the optimizer will take care of it once the code is optimized. There's generally no reason to ship a binary that isn't.

 

I'm not really an assembler programmer, but looking at the output of "gcc -mavx2 -S simd.c" vs "gcc -O1 -mavx2 -S simd.c", I can see that the instructions in the "avx2" section are very different.

 

I can even see some NOPs in the unoptimized code (i.e. NOOP - no-operation), which is effectively a pause. It takes some time to fetch and interpret the instruction, which otherwise doesn't do anything.

 

If I'm not mistaken most CPUs also have a lower turbo speed when running AVX2 code, compared to regular instructions. So it may be able to do more per clock, but at the same time clocks are lower, negating some of that advantage.

 

gcc -mavx2 -S simd.c

Spoiler
avx2:
.LFB4865:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register 6
	andq	$-32, %rsp
	subq	$136, %rsp
	movq	%rdi, -96(%rsp)
	movq	%rsi, -104(%rsp)
	movq	%rdx, -112(%rsp)
	movb	$0, -81(%rsp)
	jmp	.L2
.L6:
	movzbl	-81(%rsp), %eax
	leaq	0(,%rax,8), %rdx
	movq	-96(%rsp), %rax
	addq	%rdx, %rax
	movq	%rax, -64(%rsp)
	movq	-64(%rsp), %rax
	vmovupd	(%rax), %ymm0
	vmovapd	%ymm0, -56(%rsp)
	movzbl	-81(%rsp), %eax
	leaq	0(,%rax,8), %rdx
	movq	-104(%rsp), %rax
	addq	%rdx, %rax
	movq	%rax, -72(%rsp)
	movq	-72(%rsp), %rax
	vmovupd	(%rax), %ymm0
	vmovapd	%ymm0, -24(%rsp)
	vmovapd	-56(%rsp), %ymm0
	vmovapd	%ymm0, 72(%rsp)
	vmovapd	-24(%rsp), %ymm0
	vmovapd	%ymm0, 104(%rsp)
	vmovapd	72(%rsp), %ymm0
	vaddpd	104(%rsp), %ymm0, %ymm0
	vmovapd	%ymm0, 8(%rsp)
	movzbl	-81(%rsp), %eax
	leaq	0(,%rax,8), %rdx
	movq	-112(%rsp), %rax
	addq	%rdx, %rax
	movq	%rax, -80(%rsp)
	vmovapd	8(%rsp), %ymm0
	vmovapd	%ymm0, 40(%rsp)
	vmovapd	40(%rsp), %ymm0
	movq	-80(%rsp), %rax
	vmovupd	%ymm0, (%rax)
	nop
	addb	$4, -81(%rsp)
.L2:
	cmpb	$7, -81(%rsp)
	jbe	.L6
	nop
	nop
	leave
	.cfi_def_cfa 7, 8
	ret
	.cfi_endproc

 

gcc -O1 -mavx2 -S simd.c

Spoiler
avx2:
.LFB6439:
	.cfi_startproc
	vmovupd	(%rdi), %ymm1
	vaddpd	(%rsi), %ymm1, %ymm0
	vmovupd	%ymm0, (%rdx)
	vmovupd	32(%rsi), %ymm0
	vaddpd	32(%rdi), %ymm0, %ymm0
	vmovupd	%ymm0, 32(%rdx)
	ret
	.cfi_endproc

 

(I only copied the avx2 section and the sections it jumps to, not all of the assembler output - you can see that O1 is severely fewer lines of machine code)

 

The unoptimized code contains a loop and appears to use 8 bit instructions, while the optimized variant replaces it with a single 32 bit instruction? At least that's my guess. Maybe someone with more assembler/AVX2 knowledge can chip in.

 

This might also be relevant:

https://stackoverflow.com/a/52628753/7252334

Remember to either quote or @mention others, so they are notified of your reply

Link to comment
Share on other sites

Link to post
Share on other sites

@Eigenvektor @Kilrah

 

Okay, I was working on something else. I wanted to know how many AVX2 registers are available and how I can use all of them + multithreading to use other cores' AVX2 registers. And I do need an answer on this, but before that I was writing a program to allocate a lot of __m256i variables in an array, so that I could use all the AVX2 registers and plus spill out the others in memory, and benchmark all of them.

And uh, oh wait, I just realized, this is not like the program I made that made a lot of register variables and not all of them could fit in actual registers, and I benchmarked iterating them. The ones with less time would be the ones allocated in the registers all the time, and the ones with a higher time will not be. Although it is possible for the C compiler to shift the non-used register variables and give space for other register variables but C didn't do that there + I think everything is allocated at once only. Anyways, what I am trying to say is that I just realized that this program is not like the program I just told you, because it needs to load the data in the registers for processing so I think C has to move the data between the registers and the memory I think. Idk, I am getting very confused. All I wanted to say is that the following code has not much reason to exist because I was thinking to code it the way I coded that register program but it is encountering a problem that you should look.

So the problem is that in the inner loop which I used to scale the execution time higher for getting readable numbers of performance profiling, there is a magic number between like 100 and 150 where if the loop runs till it, the program executes in no time and I get the execution time as all zeroes, and if the loop is written to run the number of times above that magic number, it just keeps running infinitely and there is no output. It is so weird. I tried with all different compiler options and optimization levels.

 

#include <immintrin.h>
#include <time.h>
#include <stdio.h>

void main(void)
{
    
    __m256i array[50];
    __m256i res;
    time_t t;
    clock_t start, end;
    double elapsed;
    srand((unsigned) time(&t));
    for (char i = 0; i < 50; i++)
    {
        array[i] = _mm256_set_epi8((char)rand() % 100, (char)rand() % 100,(char)rand() % 100,(char)rand() % 100,(char)rand() % 100,(char)rand() % 100,(char)rand() % 100,(char)rand() % 100,(char)rand() % 100,(char)rand() % 100,(char)rand() % 100,(char)rand() % 100,(char)rand() % 100,(char)rand() % 100,(char)rand() % 100,(char)rand() % 100,(char)rand() % 100,(char)rand() % 100,(char)rand() % 100,(char)rand() % 100,(char)rand() % 100,(char)rand() % 100,(char)rand() % 100,(char)rand() % 100,(char)rand() % 100,(char)rand() % 100,(char)rand() % 100,(char)rand() % 100,(char)rand() % 100,(char)rand() % 100,(char)rand() % 100,(char)rand() % 100);
    }
    for (char i = 0; i < 49; i++)
    {
        start = clock();
        for (char j = 0; j < 200; j++) //problem?
        {
            res = _mm256_add_epi8(array[i], array[i+1]);
        }
        end = clock();
        elapsed = ((double)(start - end)) / CLOCKS_PER_SEC;
        printf("%lf\n", elapsed);
    }

    return;
}

 

Microsoft owns my soul.

 

Also, Dell is evil, but HP kinda nice.

Link to comment
Share on other sites

Link to post
Share on other sites

Quote
for (char j = 0; j < 200; j++) //problem?

char is signed, so the "magic number" will be 127...

I prefer using the "uint8_t" like types for clarity.

F@H
Desktop: i9-13900K, ASUS Z790-E, 64GB DDR5-6000 CL36, RTX3080, 2TB MP600 Pro XT, 2TB SX8200Pro, 2x16TB Ironwolf RAID0, Corsair HX1200, Antec Vortex 360 AIO, Thermaltake Versa H25 TG, Samsung 4K curved 49" TV, 23" secondary, Mountain Everest Max

Mobile SFF rig: i9-9900K, Noctua NH-L9i, Asrock Z390 Phantom ITX-AC, 32GB, GTX1070, 2x1TB SX8200Pro RAID0, 2x5TB 2.5" HDD RAID0, Athena 500W Flex (Noctua fan), Custom 4.7l 3D printed case

 

Asus Zenbook UM325UA, Ryzen 7 5700u, 16GB, 1TB, OLED

 

GPD Win 2

Link to comment
Share on other sites

Link to post
Share on other sites

7 hours ago, Gat Pelsinger said:

Okay, I was working on something else. I wanted to know how many AVX2 registers are available and how I can use all of them

I don't really know the answer to that, since I've never had the need to work with SIMD/AVX2. I would assume you generally don't need to worry about the number of registers. Write code that solves the problem you have and let the compiler deal with these low level details.

 

Naive approach: Write code as you normally would, enable the appropriate compiler flags and let it deal with generating AVX2 instructions. Based on the benchmarks results I got, the code generated by the compiler at O3 performs on par with the manually written AVX2 instructions.

 

7 hours ago, Gat Pelsinger said:

multithreading to use other cores' AVX2 registers.

There's no simple answer to that. It depends a lot on what you're doing and how much data there is. Let's assume you have two arrays with a few billion entries and you want to add them together, as before.

 

Possible approach: Depending on the number of entries in each array, split them into multiple (virtual) chunks of appropriate size (at most $num_cores chunks). Spawn a matching number of worker threads. Hand the arrays, an offset and a length to each thread and let them work on adding their subsection of the arrays.

 

This is only really worth it if you're dealing with a lot of data and the overhead introduced by splitting it (and possibly combining results) doesn't eat up all of the performance you gain from using multiple threads.

 

7 hours ago, Gat Pelsinger said:

I was writing a program to allocate a lot of __m256i variables in an array, so that I could use all the AVX2 registers and plus spill out the others in memory, and benchmark all of them.

A more realistic approach would likely be: Load the data you need to work on into memory. Then load n values into registers, call the appropriate AVX2 instruction, copy the results into memory. Then continue to iterate over the rest of the data in similar fashion.

 

Though unless you're writing low level assembler, I don't think you should be thinking in terms of registers at all. Allocate memory, load the data you need into it. Iterate over that data, calling the appropriate C library function with the supported number of arguments each time.

Remember to either quote or @mention others, so they are notified of your reply

Link to comment
Share on other sites

Link to post
Share on other sites

On 1/24/2024 at 11:34 PM, Gat Pelsinger said:

why can I not get the 4 times performance increase I get in O1 in no optimization? It's just straightforward code.

This tells you that it's not just straighforward code and that what you implemented is not the most optimal way to do it (within the constraints/freedom the compiler has). Others have elaborated on things in the mean time, but the straightforward interpretation would be simply because the code you write is not as optimal as the compiler can write it. As came up in another topic: (generally speaking) if you think you are smarter than the compiler, think again. The video I linked earlier mentions at the end that even the order in which you load data into the SIMD registers can matter for performance. The compiler can (try to) take all of that into account, humans typically can't. They probably don't even know half the tricks it leverages exist.

 

I think the curiosity is useful, and that this type of experimentation is great for obtaining a deeper understanding about how code works and can be optimised, but do accept that the optimisation settings exist exactly because you will not be writing code that is faster than what the compiler can make it unless maybe if you are a compiler egineer. The other day I was watching this video about why Rust was slower than C in some problem despite both being compiled langages and the underlying compiler being the same, and it came down to something as obscure as odd vs even number of fields in a struct. The odd number of fields was "problematic" when compiling Rust and changing it to be more complex in terms of code was what actually led to more optimal assembly being generated.

Crystal: CPU: i7 7700K | Motherboard: Asus ROG Strix Z270F | RAM: GSkill 16 GB@3200MHz | GPU: Nvidia GTX 1080 Ti FE | Case: Corsair Crystal 570X (black) | PSU: EVGA Supernova G2 1000W | Monitor: Asus VG248QE 24"

Laptop: Dell XPS 13 9370 | CPU: i5 10510U | RAM: 16 GB

Server: CPU: i5 4690k | RAM: 16 GB | Case: Corsair Graphite 760T White | Storage: 19 TB

Link to comment
Share on other sites

Link to post
Share on other sites

13 hours ago, Kilrah said:

char is signed, so the "magic number" will be 127...

I prefer using the "uint8_t" like types for clarity.

omg I am dumb how do I miss such things 😐

Microsoft owns my soul.

 

Also, Dell is evil, but HP kinda nice.

Link to comment
Share on other sites

Link to post
Share on other sites

@Eigenvektor @tikker @Kilrah

 

So I ran my program and I am getting the execution time for all the __m256i variables as same. And I think expected this because unlike that register variables program I mentioned, data is cleared after it is not needed from the registers I think. But anyways, all I wanted to know is how I can know how many free AVX2 registers are there that I can utilize, or perhaps just unroll my loop with using as many __m256i variables because it doesn't matter because the data is going to moved in and out anyways?

Microsoft owns my soul.

 

Also, Dell is evil, but HP kinda nice.

Link to comment
Share on other sites

Link to post
Share on other sites

@Eigenvektor @tikker @Kilrah

 

Well I have another question. I want to drag race traditional register computation vs AVX2 computation. And I want the gap to me as huge as possible, and so my eyes full upon FMA (fused multiply add), as it does multiplication and addition both at the same time, which would be even faster. But, on Intel Intrinsics website, the only FMA functions for AVX2 take either single or double precision floats (and only do multiplication on 2 floats and add a value per function), and for maximum computation, I intend to use 8 bit integers so more data can be processed at once.

Microsoft owns my soul.

 

Also, Dell is evil, but HP kinda nice.

Link to comment
Share on other sites

Link to post
Share on other sites

On 1/26/2024 at 2:32 AM, Gat Pelsinger said:

But anyways, all I wanted to know is how I can know how many free AVX2 registers are there that I can utilize, or perhaps just unroll my loop with using as many __m256i variables because it doesn't matter because the data is going to moved in and out anyways?

You can find that information in Wikipedia: https://en.wikipedia.org/wiki/Advanced_Vector_Extensions

 

Spoiler
Quote

AVX uses sixteen YMM registers to perform a single instruction on multiple pieces of data (see SIMD). Each YMM register can hold and do simultaneous operations (math) on:

  • eight 32-bit single-precision floating point numbers or
  • four 64-bit double-precision floating point numbers.

 

Advanced Vector Extensions 2 (AVX2) [...] makes the following additions:

  • expansion of most vector integer SSE and AVX instructions to 256 bits
  • Gather support, enabling vector elements to be loaded from non-contiguous memory locations
  • DWORD- and QWORD-granularity any-to-any permutes
  • vector shifts.

 

tl;dr: There are 16 registers available. It's up to you to keep track of how much data you've loaded into them (i.e. its your job to know how many are still free).

 

The point of using higher level languages is that you should no longer have to worry about hardware details such as this. Unless you're writing assembler code, let the compiler worry about these hardware implementation details and concentrate on writing code that solves your problem instead.

 

Maybe also have a look at https://stackoverflow.com/questions/48892733/can-avx2-compiled-program-still-use-32-registers-of-an-avx-512-capable-cpu

 

On 1/26/2024 at 7:01 AM, Gat Pelsinger said:

But, on Intel Intrinsics website, the only FMA functions for AVX2 take either single or double precision floats (and only do multiplication on 2 floats and add a value per function), and for maximum computation, I intend to use 8 bit integers so more data can be processed at once.

There's a hardware cost associated with adding more operations, so it makes sense to only offer functionality that people actually need. If 32/64 bit floating point operations are everything that's available, you'll have to design your benchmark around that.

Remember to either quote or @mention others, so they are notified of your reply

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×