Jump to content

GCC up to 50% SLOWER vs LLVM (Clang) and Intel compilers

Camofelix

TLDR; Programs compiled with GCC (version 7-12) are taking up to 50% longer to complete vs those compiled by open rouce LLVM based CLang (11-13) and Intel ICX,  and closed source Intel ICC compilers. 

Hi Ladies and Gents,

I've been working on profiling the ins and outs of how alder lake works with various workloads in various environments as a way of previewing how Sapphire Rapids, which utilizes the same Golden Cove core, will perform in HPC tasks.

 

To that end, I've already published a few hundred results on twitter in different scenarios with different kernels, compilers memory sub timings etc. those can be found here:
External Link

 

Of interest for today however is this test of Binary trees:

compiler_comparison_binarry_trees.png.a26611047c52c55d8677537cd6c051a7.png

gcc numbers 7 8 9 10 11 12
time taken is 379.943716 
time taken is 395.665537 
time taken is 373.488119 
time taken is 392.596422 
time taken is 382.825910 
time taken is 390.466340 

clang numbers 11 12 13
time taken is 256.381165 
time taken is 290.616438 
time taken is 284.877824 

intel numbers icc icx
time taken is 249.630150 
time taken is 250.511041 

Above tests were completed with Tree size of 26

The above was the output after running the test 20 times, and results were within run to run of +/- 0.2%

Git with the test can be found here: https://github.com/FCLC/Choosing-a-compiler-performance-testing-GCC_ICC_ICPX_NVCC_CLANG_HIP/tree/main/Binary_tree


Would love to see results from anyone else and their thoughts

Link to comment
Share on other sites

Link to post
Share on other sites

Interesting, I wonder how much of this is due to alder lake being new and possibly not fully optimized for in gcc... might try this on sandy later.

Don't ask to ask, just ask... please 🤨

sudo chmod -R 000 /*

Link to comment
Share on other sites

Link to post
Share on other sites

On 12/20/2021 at 2:35 AM, Sauron said:

Interesting, I wonder how much of this is due to alder lake being new and possibly not fully optimized for in gcc... might try this on sandy later.

Turns out it isn't alder lake at all. It's pervasive as far back as nehalem on all Gcc versions.

There didnt seem to be a lot of interest on the LTT forums, so I stopped updating this thread, but the main L1T thread has much more info:

I've been tracking this more on the Level1 techs forums (https://forum.level1techs.com/t/wip-testing-update-its-not-just-alder-lake-it-goes-back-to-nehalem-gcc-50-performance-regressions-vs-clang-and-intel-compilers-in-specific-workloads-across-all-opt-settings/179712/10)


I've dug through a lot of the assembly, but haven't gone *all the way down* the rabbit hole as it were. (If you count 100+ different runs as not going all the way down I guess 😂)

It seem's GCC is trying to pre-cache instructions a lot, almost n64 instruction cache style, using wayyyyyyyyyyyyyyyyyyyyy more registers at times and wasting cycles.

 

Link to comment
Share on other sites

Link to post
Share on other sites

On 12/31/2021 at 10:01 PM, ahmad13610 said:

how about bionic chips compiler. can it come into competitive?

Not quite sure what you mean 🤔

On 12/20/2021 at 2:35 AM, Sauron said:

Interesting, I wonder how much of this is due to alder lake being new and possibly not fully optimized for in gcc... might try this on sandy later.

Went further down the rabbit hole, and it's a bug in how Glibc (the GNU C library) and GCC do malloc. 

Replacing the memory allocation subroutines with TC malloc, JE malloc or HOARD malloc all yielded *massive* uplifts in performance, leading to GCC-12 surpassing ICC and CLANG-11 (when those 2 are using GLIBC malloc)

I haven't had the time to integrate TC malloc etc. with OneAPI yet, but hope to do so soon. 

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, Camofelix said:

it's a bug in how Glibc (the GNU C library) and GCC do malloc. 

well that's not great 😆 hopefully it's fixed soon

Don't ask to ask, just ask... please 🤨

sudo chmod -R 000 /*

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, Sauron said:

well that's not great 😆 hopefully it's fixed soon

Yeah, time permitting I'm hoping to have time after the kernel 5.17 merge window to look into it. 

It's a *somewhat* niche case, but malloc mixed with bit-shifting for exponential trees isnt completely uncommon in HPC, so could have some problems sitting there sucking up cycles in super computers as I type this 😔

Thankfully those environments tend to use the CRAY intel or custom compilers which are immune to this.

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×