Jump to content

Segfault in C when using FMA AVX2.

Gat Pelsinger

This code is trying to benchmark the performance between multiplying a matrix on normal registers vs using fused multiply add with AVX2 registers. I know, my AVX2 function isn't the best optimized, and I want it to be like that so that I can add more functions which are more optimized and I can measure performance between all of them.

 

#include <stdio.h>
#include <time.h>
#include <immintrin.h>
#include <stdbool.h>

#define PIE 3.14159
#define ROWS 64
#define COLS 64
#define N 64

static inline void mul(float m1[ROWS][COLS], float m2[ROWS][COLS], float res[ROWS][COLS])
{
    for (size_t i = 0; i < ROWS; i++)
    {
        for (size_t j = 0; j < COLS; j++)
        {
            for (size_t k = 0; k < N; k++)
            {
                res[i][j] += m1[i][k] * m2[k][j];
            }
        }
    }
    return;
}

static inline void mul_AVX2_FMA_1(const __m256 m1[N], const __m256 m2[N], __m256 res[N])
{
    for (size_t i = 0ULL; i < ROWS; i++)
    {
        for (size_t j = 0ULL; j < COLS; j++)
        {
            for (size_t k = 0ULL; k < N; k++)
            {
                res[i * COLS + k] = _mm256_fmadd_ps(m1[i * COLS + j], m2[j * COLS + k], res[i * COLS + k]); //problem
            }
        }
    }
    return;
}

static inline const unsigned long long bench(void (*funcPtr)(void *, void *, void *), void *A, void *B, void *C)
{
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    funcPtr(A, B, C);
    clock_gettime(CLOCK_MONOTONIC, &end);
    return end.tv_nsec - start.tv_nsec;
}

static inline const void print(float C[ROWS][COLS])
{
    for (size_t i = 0ULL; i < N; i++)
    {
        for (size_t j = 0ULL; j < N; j++)
        {
            printf("%f ", C[i][j]);
        }
        printf("\n");
    }
    return;
}

static inline const bool check(const float A[ROWS][COLS], const __m256 A_[N]) //just checking if both the functions' results match
{
    size_t i = 0ULL;
    size_t j = 0ULL;
    do
    { //some branchless programming attempt
        while (j < N && A[i][j] == A_[i][j])
        {
            j++;
        }
        if (j == N)
        {
            i++;
            j = 0ULL;
        }
        else{
            return false;
        }
    }
    while(i < ROWS);
    return true;
}

void main(void)
{
    float A[ROWS][COLS] __attribute__((aligned(32)));
    float B[ROWS][COLS] __attribute__((aligned(32)));
    float C[ROWS][COLS] __attribute__((aligned(32)));
    {
        for (size_t i = 0ULL; i < N; i++)
        {
            for (size_t j = 0ULL; j < N; j++)
            {
                A[i][j] = (rand() % 100) / PIE;
                B[i][j] = (rand() % 100) / PIE;
            }
        }
    }
    {
        unsigned long long elapsed = bench(&mul, A, B, C);
        print(C);
        printf("mul elapsed: %llu nanoseconds\n\n\n", elapsed);
    }
    {
        __m256 A_[N], B_[N], C_[N];
        for (size_t i = 0ULL; i < N; i++)
        {
            A_[i] = _mm256_load_ps(&A[i]);
            B_[i] = _mm256_load_ps(&B[i]);
            C_[i] = _mm256_setzero_ps();
        }
        unsigned long long elapsed = bench(&mul_AVX2_FMA_1, A_, B_, C_);
        print(C_);
        printf("mul_AVX2_FMA_1 elapsed: %llu nanoseconds\n", elapsed);
        printf("%s", check(C, C_) ? "Results match" : "Results DO NOT match!");
    }
    return;
}

 

The code prints everything till "mul elapsed: " and its time, and then nothing else. GDB says - 

 

Thread 1 received signal SIGSEGV, Segmentation fault.
mul_AVX2_FMA_1 (m1=0xd4407ff300, m2=0xd4407feb00, res=0xd4407fe300) at main.c:34
34                      res[i * COLS + k] = _mm256_fmadd_ps(m1[i * COLS + j], m2[j * COLS + k], res[i * COLS + k]);

 

Compiled with -

gcc -o main main.c -lpthread -mavx2 -march=native -mconsole -g

 

Is there a problem in how I am accessing my arrays or their alignment?

Microsoft owns my soul.

 

Also, Dell is evil, but HP kinda nice.

Link to comment
Share on other sites

Link to post
Share on other sites

Quote
res[i * COLS + k] = _mm256_fmadd_ps(m1[i * COLS + j], m2[j * COLS + k], res[i * COLS + k]); //problem

 

I'm not exactly sure what you're trying to do here, but I think you want

res[i * j + k] = _mm256_fmadd_ps(m1[j], m2[k], res[i * j + k]);

At the very least, you were not accessing m1 and m2 properly. They're both size 64, but you are immediately overrunning the moment i=1 (m1[1 * 64 + 0]).

 

Edit: Oh, and res is also size N, which is probably wrong. Maybe you want it to be size ROWS * COLS + N?

Link to comment
Share on other sites

Link to post
Share on other sites

@QuantumRand Exactly, I suck at math. I don't know what's going on and I follow some code that ChatGPT, our AI godfather generates. One thing that I think might be wrong, is that I am only giving the  __m256 variables 1d array in the function, because I thought the m256 variables are an array (vector) by themselves, so 1+1 = 2, but except, we are doing vectorized computation on vectors, which I think cancels the vectors out and the vectors just become a group of scaler values.

 

So, if I visualize this way, then my __m256 matrix has multiple rows, but only 1 column. So I think I need to add another dimension of array, and then just do the straightforward m1 [ i ] [ j ]?

Microsoft owns my soul.

 

Also, Dell is evil, but HP kinda nice.

Link to comment
Share on other sites

Link to post
Share on other sites

@QuantumRand btw it still segfaults if I change my code with your.

Microsoft owns my soul.

 

Also, Dell is evil, but HP kinda nice.

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, Gat Pelsinger said:

@QuantumRand btw it still segfaults if I change my code with your.

I'm not sure what you're trying to do with your mul_AVX2_FMA_1 function. It's strange that you have nested loops as if you're trying to do some matrix operations, but _mm256_fmadd_ps is doing all that for you. 

Link to comment
Share on other sites

Link to post
Share on other sites

@QuantumRand

 

huuuh? Ok I seriously don't get anything. If you know how to multiply 2 matrices using fmadd, why don't you take like 2 mins and write the function for me? Also, I know there are optimizations I can do but I just wanted make this function a complete AVX2 replacement of the non AVX2 multiplication function. I want to add more functions which are more optimized later.

Microsoft owns my soul.

 

Also, Dell is evil, but HP kinda nice.

Link to comment
Share on other sites

Link to post
Share on other sites

Rule #1: never copy and paste code you dont understand. Applies to even you(and all of us), stackoverflow copy pasters. 

Sudo make me a sandwich 

Link to comment
Share on other sites

Link to post
Share on other sites

10 minutes ago, Gat Pelsinger said:

@QuantumRand

 

huuuh? Ok I seriously don't get anything. If you know how to multiply 2 matrices using fmadd, why don't you take like 2 mins and write the function for me? Also, I know there are optimizations I can do but I just wanted make this function a complete AVX2 replacement of the non AVX2 multiplication function. I want to add more functions which are more optimized later.

I've never used _mm256_fmadd_ps, and I don't have a system that supports AVX2 instructions, so I can't even run code to check that I'm implementing it correctly for you. All I have is the documentation, and that's what I'm going off of.

 

Per https://www.smcm.iqfr.csic.es/docs/intel/compiler_c/main_cls/intref_cls/common/intref_avx_fmadd_ps.htm, that function does a matrix multiplication of and b, and then adds it to c. So I don't know why you have a nested loop in your mul_AVX2_FMA_1 function. Are you trying to iterate over different combinations of m1*m2+res? That doesn't really make sense since you're assigning that result to res and then continuing the loop. So then was your intention for res to be N*N in size? Or were you just trying to iterate through the N m1/m2 arrays?

 

I can't re-write your code for you if I don't understand what you're trying to do. Best I can do is point out things that strike me as odd and hope that it helps.

Link to comment
Share on other sites

Link to post
Share on other sites

@QuantumRand As I said, I am completely confused, but I will try my best to express my understanding. First of all, there is no difference between N and ROWS and COLS. That is why they are all 64. Second, I have tried replicating what the normal mul function contains with AVX. 3 loops. The code is right in front of you.

Microsoft owns my soul.

 

Also, Dell is evil, but HP kinda nice.

Link to comment
Share on other sites

Link to post
Share on other sites

26 minutes ago, Gat Pelsinger said:

@QuantumRand As I said, I am completely confused, but I will try my best to express my understanding. First of all, there is no difference between N and ROWS and COLS. That is why they are all 64. Second, I have tried replicating what the normal mul function contains with AVX. 3 loops. The code is right in front of you.

The difficulty is that your mul function and mul_AVX2_FMA_1 function are not doing the same thing, so I do not know what you're intending to do with your code.

 

Perhaps there's a misunderstanding of what AVX Intrinsics is for? It's not for arbitrary matrix math. What it's doing is allowing for higher precision math and/or simultaneous processing. So instead of being limited to a 64bit Long, for example, you have a __mm256 which represents a 256bit number as a vector or four 64bit numbers.

 

Here is the defined operation for _mm256_fmadd_ps for example:

FOR j := 0 to 7
	i := j*32
	dst[i+31:i] := (a[i+31:i] * b[i+31:i]) + c[i+31:i]
ENDFOR
dst[MAX:256] := 0

 

Link to comment
Share on other sites

Link to post
Share on other sites

@QuantumRand What do you mean you do not understand what I am doing with my code? I just want to multiply 2 matrices, that's it. I see as that my AVX2 function is doing the same thing as the normal mul function but just computing 8 times less.

If your machine does not support AVX2 (wait, how old of a PC you have?), you can use AVX 128-bit, and I will just convert the code to 256-bit.

Microsoft owns my soul.

 

Also, Dell is evil, but HP kinda nice.

Link to comment
Share on other sites

Link to post
Share on other sites

9 hours ago, Gat Pelsinger said:

@QuantumRand What do you mean you do not understand what I am doing with my code? I just want to multiply 2 matrices, that's it. I see as that my AVX2 function is doing the same thing as the normal mul function but just computing 8 times less.

If your machine does not support AVX2 (wait, how old of a PC you have?), you can use AVX 128-bit, and I will just convert the code to 256-bit.

I'm on Apple silicon. Their chips are Arm based and don't have the same instruction sets as x86.

 

Here's what I think you're trying to do. I have not run this code, and I have no idea if it even compiles, but maybe it'll point you in the right direction.

#include <stdio.h>
#include <stdlib.h>

#ifdef __AVX__
  #include <immintrin.h>
#else
  // warning AVX is not available. Code will not compile!
#endif

#define ITERATIONS 1000

// perform a multiply and add on 3 groups of 8 floats, and then pack them into a 256bit array
float* software_fmadd(float a[8], float b[8], float c[8]) {
    float* dst = malloc(8 * sizeof(float));
    for (int i = 0; i < 8; i++) {
        dst[i] = a[i]*b[i]+c[i];
    }
    return dst;
}

int main() {
    
    float data1[ITERATIONS][8];
    float data2[ITERATIONS][8];
    float data3[ITERATIONS][8];;
    __m256* avxData1 = malloc(ITERATIONS * sizeof(__m256));
    __m256* avxData2 = malloc(ITERATIONS * sizeof(__m256));
    __m256* avxData3 = malloc(ITERATIONS * sizeof(__m256));
    
    for (int i = 0; i < ITERATIONS; i++) {
        for (int j = 0; j < 8; j++) {
            data1[i][j] = (float) (rand() / 3.14159);
            data2[i][j] = (float) (rand() / 3.14159);
            data3[i][j] = (float) (rand() / 3.14159);
        }
        // allocate memory and pack values from data into __mm256
        // you may want to benchmark this as well since the documenation suggests that this is slower than the actual fmadd operation
        avxData1[i] = _mm256_load_ps(data1[i]);
        avxData2[i] = _mm256_load_ps(data2[i]);
        avxData3[i] = _mm256_load_ps(data3[i]);
    }
    
    // Time how long this takes for software performance
    for (int i = 0; i < ITERATIONS; i++) {
        software_fmadd(data1[i], data2[i], data3[i]);
    }
    
    // Time how long this takes for AVX performance
    for (int i = 0; i < ITERATIONS; i++) {
        _mm256_fmadd_ps(avxData1[i], avxData2[i], avxData3[i]);
    }
    return 0;
}

 

Link to comment
Share on other sites

Link to post
Share on other sites

@Gat Pelsinger I threw a gcc docker up on my storage server and got some code working:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#ifdef __AVX__
  #include <immintrin.h>
#else
  // warning AVX is not available. Code will not compile!
#endif

#define ITERATIONS 1000

// perform a multiply and add on 3 groups of 8 floats, and then pack them into a 256bit array
float* software_fmadd(float a[8], float b[8], float c[8]) {
    float* dst = malloc(8 * sizeof(float));
    for (int i = 0; i < 8; i++) {
        dst[i] = a[i]*b[i]+c[i];
    }
    return dst;
}

int main() {
    
    float* data1[ITERATIONS];
    float* data2[ITERATIONS];
    float* data3[ITERATIONS];
    __m256 avxData1[ITERATIONS]; 
    __m256 avxData2[ITERATIONS]; 
    __m256 avxData3[ITERATIONS];

    for (int i = 0; i < ITERATIONS; i++) {

        data1[i] = (float*)malloc(sizeof(float) * 8);
        data2[i] = (float*)malloc(sizeof(float) * 8);
        data3[i] = (float*)malloc(sizeof(float) * 8);

        for (int j = 0; j < 8; j++) {
            data1[i][j] = (float) (rand() / 3.14159);
            data2[i][j] = (float) (rand() / 3.14159);
            data3[i][j] = (float) (rand() / 3.14159);
        }
        // allocate memory and pack values from data into __mm256
        // you may want to benchmark this as well since the documenation suggests that this is slower than the actual fmadd operation
        avxData1[i] = _mm256_loadu_ps(data1[i]);
        avxData2[i] = _mm256_loadu_ps(data2[i]);
        avxData3[i] = _mm256_loadu_ps(data3[i]);
    }
    
    // Time how long this takes for software performance

    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    for (int i = 0; i < ITERATIONS; i++) {
        software_fmadd(data1[i], data2[i], data3[i]);
    }
    clock_gettime(CLOCK_MONOTONIC, &end);
    printf("Software took %d ns\n", end.tv_nsec - start.tv_nsec);
    
    // Time how long this takes for AVX performance
    clock_gettime(CLOCK_MONOTONIC, &start);
    for (int i = 0; i < ITERATIONS; i++) {
        __m256 r = _mm256_fmadd_ps(avxData1[i], avxData2[i], avxData3[i]);
    }
    clock_gettime(CLOCK_MONOTONIC, &end);
    printf("AVX2 took %d ns\n", end.tv_nsec - start.tv_nsec);
    return 0;
}

 

Couple of things. __m256 expect to be loaded with an uninterrupted block of memory. That could have been why you were running into seg faults. Using _mm256_loadu_ps() instead helped mitigate this. It's probably a little slower than the loader that enforces memory alignment, but you're not measuring that anyway.

 

Also software_fmadd is doing a memory allocation. It would probably ~2x faster if that allocation was done separately ahead of time.

 

On my system, using AVX intrinsics was about 14x faster than the software implementation:

Software took 140633 ns
AVX2 took 9530 ns

 

Obviously, this is going to be system dependent and you may want to up the ITERATIONS to get a more accurate comparison. Alternatively, you could set up a loop to run for N seconds and count how many times it completes a set of ITERATIONS for each implementation.

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×