https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81616

--- Comment #14 from Andrew Roberts <andrewm.roberts at sky dot com> ---
It would be nice if znver1 for -march and -mtune could be improved before the
gcc 8 release. At present -march=znver1 -mtune=znver1 looks be to about the
worst thing you could do, and not just on this vectorizable code. And given we
tell people to use -march=native which gives this, it would be nice to improve.

With the attached example switching to larger vectors still only gets to 200000
clocks, whereas other combinations get down to 116045

mult took 116045 clocks -march=corei7-avx -mtune=skylake

So there is more going on here than just the vector length.

If there is any testing to isolate other options I would be happy to help, just
point me in the right direction. If there are good (open) benchmarks I can
routinely test on a range of targets I would be happy to. I have ryzen,
haswell, skylake, arm, aarch64, etc.

Reply via email to