https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111166

--- Comment #3 from gnu_bugzilla_gcc at catelyn dot tech ---
(In reply to Richard Biener from comment #1)
> Unless you can come up with an actual benchmark showing the vector code is
> slower I'd say it's not.  Given it's smaller it should win on the icache
> side if not executed frequently as well.

I'm not an expert in benchmarking C, so my benchmark may be incorrect, but I
compiled the same (attached preprocessed) file with -O2, -O3, and -Os into an
object file, and then compiled a benchmarking file into an object as well (to
avoid variance caused by the benchmarking file being compiled with different
optimization levels), I added a very simple implementation for
`do_smth_with_4_u32`, and ran the `turn_into_struct` function in a hot loop,
with varying (pre-generated) input data and storing the result in an array, I
timed this hot loop using `(float)clock()/CLOCKS_PER_SEC;` at the start and
end, then added up the calculated results to ensure all three programs get the
same result

on my machine (Ryzen 9 5900X) the -Os version takes ~.36s, while the -O2 and
-O3 versions take ~.43 and ~.42 seconds

I tried both -O2 and -O3 to get a slightly better view of the typical variance
between program runs, and their times are very similar, but the -Os version is
a decent amount faster (around 16%, which I'd assume is significant)

I've added the preprocessed benchmark file as well, which I then compiled with
-mtune=generic and -march=x86-64 to match the system-under-test

Reply via email to