https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #19 from Chris Elrod <elrodc at gmail dot com> --- To add a little more: I used inline asm for direct access to the rsqrt instruction "vrsqrt14ps" in Julia. Without adding a Newton step, the answers are wrong beyond just a couple significant digits. With the Newton step, the answers are correct. My point is that LLVM-compiled code (Clang/Flang/ispc) are definitely adding the Newton step. They get the correct answer. That leaves my best guess for the performance difference as owing to the masked "vrsqrt14ps" that gcc is using: vcmpps $4, %zmm0, %zmm5, %k1 vrsqrt14ps %zmm0, %zmm1{%k1}{z} Is there any way for me to test that idea? Edit the asm to remove the vcmppss and mask, compile the asm with gcc, and benchmark it?