https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81616
--- Comment #14 from Andrew Roberts <andrewm.roberts at sky dot com> --- It would be nice if znver1 for -march and -mtune could be improved before the gcc 8 release. At present -march=znver1 -mtune=znver1 looks be to about the worst thing you could do, and not just on this vectorizable code. And given we tell people to use -march=native which gives this, it would be nice to improve. With the attached example switching to larger vectors still only gets to 200000 clocks, whereas other combinations get down to 116045 mult took 116045 clocks -march=corei7-avx -mtune=skylake So there is more going on here than just the vector length. If there is any testing to isolate other options I would be happy to help, just point me in the right direction. If there are good (open) benchmarks I can routinely test on a range of targets I would be happy to. I have ryzen, haswell, skylake, arm, aarch64, etc.