https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89346
Peter Cordes <peter at cordes dot ca> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |peter at cordes dot ca
--- Comment #1 from Peter Cordes <peter at cordes dot ca> ---
Still present in pre10.0.0 trunk 20191022. We pessimize vmovdqu/a in AVX2
intrinsics and autovectorization with -march=skylake-avx512 (and arch=native on
such machines)
It seems only VMOVDQU/A load/store/register-copy instructions are affected; we
get AVX2 VEX vpxor instead of AVX512VL EVEX vpxord for xor-zeroing, and
non-zeroing XOR. (And most other instructions have the same mnemonic for VEX
and EVEX, like vpaddd. This includes FP moves like VMOVUPS/PD)
(https://godbolt.org/z/TEvWiU for example)
The good options are:
* use VEX whenever possible instead of AVX512VL to save code-size. (2 or 3
byte prefix instead of 4-byte EVEX)
* Avoid the need for vzeroupper by using only x/y/zmm16..31. (Still has a
max-turbo penalty so -mprefer-vector-width=256 is still appropriate for code
that doesn't spend a lot of time in vectorized loops.)
This might be appropriate for very simple functions / blocks that only have a
few SIMD instructions before the next vzeroupper would be needed. (e.g.
copying or zeroing some memory); could be competitive on code-size as well as
saving the 4-uop instruction.
VEX instructions can't access x/y/zmm16..31 so this forces an EVEX encoding
for everything involving the vector (and rules out using AVX2 and earlier
instructions, which may be a problem for KNL without AVX512VL unless we narrow
to 128-bit in an XMM reg)
----
(citation for not needing vzeroupper if y/zmm0..15 aren't written explicitly:
https://stackoverflow.com/questions/58568514/does-skylake-need-vzeroupper-for-turbo-clocks-to-recover-after-a-512-bit-instruc
- it's even safe to do
vpxor xmm0,xmm0,xmm0
vpcmpeqb k0, zmm0, [rdi]
without vzeroupper. Although that will reduce max turbo *temporarily* because
it's a 512-bit uop.
Or more frequently useful: to zero some memory with vpxor xmm zeroing and YMM
stores.