[Bug target/91103] AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element

crazylht at gmail dot com via Gcc-bugs Sun, 12 Sep 2021 18:16:52 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103


--- Comment #10 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Peter Cordes from comment #9)
> Thanks for implementing my idea :)
> 
> (In reply to Hongtao.liu from comment #6)
> > For elements located above 128bits, it seems always better(?) to use
> > valign{d,q}
> 
> TL:DR:
>  I think we should still use vextracti* / vextractf* when that can get the
> job done in a single instruction, especially when the VEX-encoded
> vextracti/f128 can save a byte of code size for v[4].
> 
> Extracts are simpler shuffles that might have better throughput on some
> future CPUs, especially the upcoming Zen4, so even without code-size savings
> we should use them when possible.  Tiger Lake has a 256-bit shuffle unit on
> port 1 that supports some common shuffles (like vpshufb); a future Intel
> might add 256->128-bit extracts to that.
> 
> It might also save a tiny bit of power, allowing on-average higher turbo
> clocks.
> 
> ---
> 
> On current CPUs with AVX-512, valignd is about equal to a single vextract,
Yes, they're equal but consider the below comments, i thinks it's reasonable to
use vextract instead of valign for byte_offset % 16 == 0.

> and better than multiple instruction.  It doesn't really have downsides on
> current Intel, since I think Intel has continued to not have int/FP bypass
> delays for shuffles.
> 
> We don't know yet what AMD's Zen4 implementation of AVX-512 will look like. 
> If it's like Zen1 was AVX2 (i.e. if it decodes 512-bit instructions other
> than insert/extract into at least 2x 256-bit uops) a lane-crossing shuffle
> like valignd probably costs more than 2 uops.  (vpermq is more than 2 uops
> on Piledriver/Zen1).  But a 128-bit extract will probably cost just one uop.
> (And especially an extract of the high 256 might be very cheap and low
> latency, like vextracti128 on Zen1, so we might prefer vextracti64x4 for
> v[8].)
> 
> So this change is good, but using a vextracti64x2 or vextracti64x4 could be
> a useful peephole optimization when byte_offset % 16 == 0.  Or of course
> vextracti128 when possible (x/ymm0..15, not 16..31 which are only accessible
> with an EVEX-encoded instruction).
> 
> vextractf-whatever allows an FP shuffle on FP data in case some future CPU
> cares about that for shuffles.
> 
> An extract is a simpler shuffle that might have better throughput on some
> future CPU even with full-width execution units.  Some future Intel CPU
> might add support for vextract uops to the extra shuffle unit on port 1. 
> (Which is available when no 512-bit uops are in flight.)  Currently (Ice
> Lake / Tiger Lake) it can only run some common shuffles like vpshufb ymm,
> but not including any vextract or valign.  Of course port 1 vector ALUs are
> shut down when 512-bit uops are in flight, but could be relevant for __m256
> vectors on these hypothetical future CPUs.
> 
> When we can get the job done with a single vextract-something, we should use
> that instead of valignd.  Otherwise use valignd.
> 
> We already check the index for low-128 special cases to use vunpckhqdq vs.
> vpshufd (or vpsrldq) or similar FP shuffles.
> 
> -----
> 
> On current Intel, with clean YMM/ZMM uppers (known by the CPU hardware to be
> zero), an extract that only writes a 128-bit register will keep them clean
> (even if it reads a ZMM), not needing a VZEROUPPER.  Since VZEROUPPER is
> only needed for dirty y/zmm0..15, not with dirty zmm16..31, so a function
> like
> 
> float foo(float *p) {
>   some vector stuff that can use high zmm regs;
>   return scalar that happens to be from the middle of a vector;
> }
> 
> could vextract into XMM0, but would need vzeroupper if it used valignd into
> ZMM0.
> 
> (Also related
> https://stackoverflow.com/questions/58568514/does-skylake-need-vzeroupper-
> for-turbo-clocks-to-recover-after-a-512-bit-instruc re reading a ZMM at all
> and turbo clock).
> 
> ---
> 
> Having known zeros outside the low 128 bits (from writing an xmm instead of
> rotating a zmm) is unlikely to matter, although for FP stuff copying fewer
> elements that might be subnormal could happen to be an advantage, maybe
> saving an FP assist for denormal.  We're unlikely to be able to take
> advantage of it to save instructions/uops (like OR instead of blend).  But
> it's not worse to use a single extract instruction instead of a single
> valignd.

[Bug target/91103] AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element

Reply via email to