https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100745
--- Comment #1 from Nicolas F. <ajidala at gmail dot com> --- I'll attach a second version of profile.c, with the vector extension code that's actually going to be used in mpv (some cleanup has been done). Performance is unchanged. Some absolute numbers from gcc 11.1.0: $ ./profile old: 811703 nicolas: 262007 (3.10x as fast) niklas: 679524 (1.19x as fast) Some absolute numbers from Clang -O3: $ ./profile old: 1547552 nicolas: 269081 (5.75x as fast) niklas: 246508 (6.28x as fast) As you can see, Clang does significantly worse on the C version (yay GCC!), but significantly, and most importantly, in absolute terms, better on the vector version. Like more than twice as fast than GCC's code. Looking at GCC's assembly output, I can see some odd choices, such as shuffling vectors around on the stack instead of using the other scratch registers (v21-v30), whereas clang does use those scratch registers.