https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459
Peter Cordes <peter at cordes dot ca> changed: What |Removed |Added ---------------------------------------------------------------------------- See Also| |https://gcc.gnu.org/bugzill | |a/show_bug.cgi?id=89346 Summary|AVX512F instruction costs: |AVX512BW instruction costs: |vmovdqu8 stores may be an |vpmovwb is 2 uops on |extra uop, and vpmovwb is 2 |Skylake and not always |uops on Skylake and not |worth using vs. vpack + |always worth using |vpermq lane-crossing fixup --- Comment #5 from Peter Cordes <peter at cordes dot ca> --- Turns out vmovdqu8 with no masking doesn't cost an extra uop. IACA was wrong, and Agner Fog's results were *only* for the masked case. The only downside of that is the code-size cost of using EVEX load/store instructions instead of AVX2 VEX. That's bug 89346 https://www.uops.info/table.html confirms that SKX non-masked vmovdqu8 load and store are both single uop. (Or the usual micro-fused store-address + store-data). https://www.uops.info/html-tp/SKX/VMOVDQU8_ZMM_M512-Measurements.html https://www.uops.info/html-tp/SKX/VMOVDQU8_M512_ZMM-Measurements.html And between registers it can be eliminated if there's no masking. But *with* masking, as a load it's a micro-fused load+ALU uop, and as a masked store it's just a normal store uop for xmm and ymm. But zmm masked store is 5 uops (micro-fused to 4 front-end uops)! (Unlike vmovdqu16 or 32 masked stores which are efficient even for zmm). https://www.uops.info/html-tp/SKX/VMOVDQU8_M512_K_ZMM-Measurements.html uops.info's table also shows us that IACA3.0 is wrong about vmovdqu8 as an *unmasked* ZMM store: IACA thinks that's also 5 uops. Retitling this bug report since that part was based on Intel's bogus data, not real testing. vpmovwb is still 2 uops, and current trunk gcc still uses 2x vpmovwb + vinserti64x4 for ZMM auto-vec. -mprefer-vector-width=512 is not the default, but people may enable it in code that heavily uses 512-bit vectors. YMM auto-vec is unchanged since previous comments: we do get vpackusbw + vpermq, but an indexed addressing mode defeats micro-fusion. And we have redundant VPAND after shifting. --- For icelake-client/server (AVX512VBMI) GCC is using vpermt2b, but it doesn't fold the shifts into the 2-source byte shuffle. (vpermt2b has 5c latency and 2c throughput on ICL, so probably its uop count is the same as uops.info measured for CannonLake: 1*p05 + 2*p5. Possible 2x 1-uop vpermb with merge-masking for the 2nd into the first would work better.) IceLake vpmovwb ymm,zmm is still 2-cycle throughput, 4-cycle latency, so probably still 2 uops.