https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #28 from Chris Elrod <elrodc at gmail dot com> --- Created attachment 45501 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45501&action=edit Minimum working example of the rsqrt problem. Can be compiled with: gcc -Ofast -S -march=skylake-avx512 -mprefer-vector-width=512 -shared -fPIC rsqrt.c -o rsqrt.s I attached a minimum working example, demonstrating the problem of excessive code generation for reciprocal square root, in the file rsqrt.c. You can compile with: gcc -Ofast -S -march=skylake-avx512 -mprefer-vector-width=512 -shared -fPIC rsqrt.c -o rsqrt.s clang -Ofast -S -march=skylake-avx512 -mprefer-vector-width=512 -shared -fPIC rsqrt.c -o rsqrt.s Or compare the asm of both on Godbolt: https://godbolt.org/z/c7Z0En For gcc: vmovups (%rsi), %zmm0 vxorps %xmm1, %xmm1, %xmm1 vcmpps $4, %zmm0, %zmm1, %k1 vrsqrt14ps %zmm0, %zmm1{%k1}{z} vmulps %zmm0, %zmm1, %zmm2 vmulps %zmm1, %zmm2, %zmm0 vmulps .LC1(%rip), %zmm2, %zmm2 vaddps .LC0(%rip), %zmm0, %zmm0 vmulps %zmm2, %zmm0, %zmm0 vrcp14ps %zmm0, %zmm1 vmulps %zmm0, %zmm1, %zmm0 vmulps %zmm0, %zmm1, %zmm0 vaddps %zmm1, %zmm1, %zmm1 vsubps %zmm0, %zmm1, %zmm0 vmovups %zmm0, (%rdi) for Clang: vmovups (%rsi), %zmm0 vrsqrt14ps %zmm0, %zmm1 vmulps %zmm1, %zmm0, %zmm0 vfmadd213ps .LCPI0_0(%rip){1to16}, %zmm1, %zmm0 # zmm0 = (zmm1 * zmm0) + mem vmulps .LCPI0_1(%rip){1to16}, %zmm1, %zmm1 vmulps %zmm0, %zmm1, %zmm0 vmovups %zmm0, (%rdi) Clang looks like it is is doing /* rsqrt(a) = -0.5 * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0) */ where .LCPI0_0(%rip) = -3.0 and LCPI0_1(%rip) = -0.5. gcc is doing much more, and fairly different.