https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #20 from Chris Elrod <elrodc at gmail dot com> --- To add a little more: I used inline asm for direct access to the rsqrt instruction "vrsqrt14ps" in Julia. Without adding a Newton step, the answers are wrong beyond just a couple significant digits. With the Newton step, the answers are correct. My point is that LLVM-compiled code (Clang/Flang/ispc) are definitely adding the Newton step. They get the correct answer. That leaves my best guess for the performance difference as owing to the masked "vrsqrt14ps" that gcc is using (g++ does this too): vcmpps $4, %zmm0, %zmm5, %k1 vrsqrt14ps %zmm0, %zmm1{%k1}{z} Is there any way for me to test that idea? Edit the asm to remove the vcmppss and mask, compile the asm with gcc, and benchmark it? Okay, I just tried playing around with flags and looking at asm. I compiled with: g++ -O3 -ffinite-math-only -fexcess-precision=fast -fno-math-errno -fno-signed-zeros -fno-trapping-math -fassociative-math -freciprocal-math -fno-rounding-math -fno-signaling-nans -march=native -shared -fPIC -mprefer-vector-width=512 -fno-semantic-interposition -o libgppvectorization_test.so vectorization_test.cpp which is basically all flags implied by "-ffast-math", except "-funsafe-math-optimizations". This does include the flags implied by the unsafe-math optimizations, just not that flag itself. This list can be simplified to (only "-fno-math-errno" is needed): g++ -O3 -fno-math-errno -march=native -shared -fPIC -mprefer-vector-width=512 -fno-semantic-interposition -o libgppvectorization_test.so vectorization_test.cpp or gfortran -O3 -fno-math-errno -march=native -shared -fPIC -mprefer-vector-width=512 -fno-semantic-interposition -o libgfortvectorization_test.so vectorization_test.f90 This results in the following: vsqrtps (%r8,%rax), %zmm0 vdivps %zmm0, %zmm7, %zmm0 ie, vsqrt and a division, rather than the masked reciprocal square root. With N = 2827, that speeds gfortran and g++ from about 4.3 microseconds to 3.5 microseconds. For comparison, Clang takes about 2 microseconds, and Flang/ispc/and awful looking unsafe Rust take 2.3-2.4 microseconds, using the vrsqrt14ps (without a mask) and a Newton step, instead of vsqrtps followed by a division. So, "-funsafe-math-optimizations" results in a regression here.