[Bug target/82459] AVX512BW instruction costs: vpmovwb is 2 uops on Skylake and not always worth using vs. vpack + vpermq lane-crossing fixup

peter at cordes dot ca Tue, 29 Oct 2019 20:22:28 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459


Peter Cordes <peter at cordes dot ca> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           See Also|                            |https://gcc.gnu.org/bugzill
                   |                            |a/show_bug.cgi?id=89346
            Summary|AVX512F instruction costs:  |AVX512BW instruction costs:
                   |vmovdqu8 stores may be an   |vpmovwb is 2 uops on
                   |extra uop, and vpmovwb is 2 |Skylake and not always
                   |uops on Skylake and not     |worth using vs. vpack +
                   |always worth using          |vpermq lane-crossing fixup

--- Comment #5 from Peter Cordes <peter at cordes dot ca> ---
Turns out vmovdqu8 with no masking doesn't cost an extra uop.  IACA was wrong,
and Agner Fog's results were *only* for the masked case.  The only downside of
that is the code-size cost of using EVEX load/store instructions instead of
AVX2 VEX. That's bug 89346


https://www.uops.info/table.html confirms that SKX non-masked vmovdqu8 load and
store are both single uop.  (Or the usual micro-fused store-address +
store-data).
 https://www.uops.info/html-tp/SKX/VMOVDQU8_ZMM_M512-Measurements.html
 https://www.uops.info/html-tp/SKX/VMOVDQU8_M512_ZMM-Measurements.html

And between registers it can be eliminated if there's no masking.

But *with* masking, as a load it's a micro-fused load+ALU uop, and as a masked
store it's just a normal store uop for xmm and ymm.  But zmm masked store is 5
uops (micro-fused to 4 front-end uops)! (Unlike vmovdqu16 or 32 masked stores
which are efficient even for zmm).

https://www.uops.info/html-tp/SKX/VMOVDQU8_M512_K_ZMM-Measurements.html

uops.info's table also shows us that IACA3.0 is wrong about vmovdqu8 as an
*unmasked* ZMM store: IACA thinks that's also 5 uops.

Retitling this bug report since that part was based on Intel's bogus data, not
real testing.

vpmovwb is still 2 uops, and current trunk gcc still uses  2x vpmovwb +
vinserti64x4 for ZMM auto-vec.  -mprefer-vector-width=512 is not the default,
but people may enable it in code that heavily uses 512-bit vectors.

YMM auto-vec is unchanged since previous comments: we do get vpackusbw +
vpermq, but an indexed addressing mode defeats micro-fusion.  And we have
redundant VPAND after shifting.

---

For icelake-client/server (AVX512VBMI) GCC is using vpermt2b, but it doesn't
fold the shifts into the 2-source byte shuffle.   (vpermt2b has 5c latency and
2c throughput on ICL, so probably its uop count is the same as uops.info
measured for CannonLake: 1*p05 + 2*p5.  Possible 2x 1-uop vpermb with
merge-masking for the 2nd into the first would work better.)

IceLake vpmovwb ymm,zmm is still 2-cycle throughput, 4-cycle latency, so
probably still 2 uops.

[Bug target/82459] AVX512BW instruction costs: vpmovwb is 2 uops on Skylake and not always worth using vs. vpack + vpermq lane-crossing fixup

Reply via email to