[Bug tree-optimization/88713] Vectorized code slow vs. flang

elrodc at gmail dot com Mon, 21 Jan 2019 21:05:20 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713


--- Comment #19 from Chris Elrod <elrodc at gmail dot com> ---
To add a little more:
I used inline asm for direct access to the rsqrt instruction "vrsqrt14ps" in
Julia. Without adding a Newton step, the answers are wrong beyond just a couple
significant digits.
With the Newton step, the answers are correct.

My point is that LLVM-compiled code (Clang/Flang/ispc) are definitely adding
the Newton step. They get the correct answer.

That leaves my best guess for the performance difference as owing to the masked
"vrsqrt14ps" that gcc is using:

        vcmpps  $4, %zmm0, %zmm5, %k1
        vrsqrt14ps      %zmm0, %zmm1{%k1}{z}

Is there any way for me to test that idea?
Edit the asm to remove the vcmppss and mask, compile the asm with gcc, and
benchmark it?

[Bug tree-optimization/88713] Vectorized code slow vs. flang

Reply via email to