https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100745

--- Comment #1 from Nicolas F. <ajidala at gmail dot com> ---
I'll attach a second version of profile.c, with the vector extension code
that's actually going to be used in mpv (some cleanup has been done).
Performance is unchanged. Some absolute numbers from gcc 11.1.0:

$ ./profile 
old: 811703
nicolas: 262007 (3.10x as fast)
niklas: 679524 (1.19x as fast)

Some absolute numbers from Clang -O3:

$ ./profile 
old: 1547552
nicolas: 269081 (5.75x as fast)
niklas: 246508 (6.28x as fast)

As you can see, Clang does significantly worse on the C version (yay GCC!), but
significantly, and most importantly, in absolute terms, better on the vector
version. Like more than twice as fast than GCC's code.

Looking at GCC's assembly output, I can see some odd choices, such as shuffling
vectors around on the stack instead of using the other scratch registers
(v21-v30), whereas clang does use those scratch registers.

Reply via email to