https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103
--- Comment #10 from Hongtao.liu <crazylht at gmail dot com> --- (In reply to Peter Cordes from comment #9) > Thanks for implementing my idea :) > > (In reply to Hongtao.liu from comment #6) > > For elements located above 128bits, it seems always better(?) to use > > valign{d,q} > > TL:DR: > I think we should still use vextracti* / vextractf* when that can get the > job done in a single instruction, especially when the VEX-encoded > vextracti/f128 can save a byte of code size for v[4]. > > Extracts are simpler shuffles that might have better throughput on some > future CPUs, especially the upcoming Zen4, so even without code-size savings > we should use them when possible. Tiger Lake has a 256-bit shuffle unit on > port 1 that supports some common shuffles (like vpshufb); a future Intel > might add 256->128-bit extracts to that. > > It might also save a tiny bit of power, allowing on-average higher turbo > clocks. > > --- > > On current CPUs with AVX-512, valignd is about equal to a single vextract, Yes, they're equal but consider the below comments, i thinks it's reasonable to use vextract instead of valign for byte_offset % 16 == 0. > and better than multiple instruction. It doesn't really have downsides on > current Intel, since I think Intel has continued to not have int/FP bypass > delays for shuffles. > > We don't know yet what AMD's Zen4 implementation of AVX-512 will look like. > If it's like Zen1 was AVX2 (i.e. if it decodes 512-bit instructions other > than insert/extract into at least 2x 256-bit uops) a lane-crossing shuffle > like valignd probably costs more than 2 uops. (vpermq is more than 2 uops > on Piledriver/Zen1). But a 128-bit extract will probably cost just one uop. > (And especially an extract of the high 256 might be very cheap and low > latency, like vextracti128 on Zen1, so we might prefer vextracti64x4 for > v[8].) > > So this change is good, but using a vextracti64x2 or vextracti64x4 could be > a useful peephole optimization when byte_offset % 16 == 0. Or of course > vextracti128 when possible (x/ymm0..15, not 16..31 which are only accessible > with an EVEX-encoded instruction). > > vextractf-whatever allows an FP shuffle on FP data in case some future CPU > cares about that for shuffles. > > An extract is a simpler shuffle that might have better throughput on some > future CPU even with full-width execution units. Some future Intel CPU > might add support for vextract uops to the extra shuffle unit on port 1. > (Which is available when no 512-bit uops are in flight.) Currently (Ice > Lake / Tiger Lake) it can only run some common shuffles like vpshufb ymm, > but not including any vextract or valign. Of course port 1 vector ALUs are > shut down when 512-bit uops are in flight, but could be relevant for __m256 > vectors on these hypothetical future CPUs. > > When we can get the job done with a single vextract-something, we should use > that instead of valignd. Otherwise use valignd. > > We already check the index for low-128 special cases to use vunpckhqdq vs. > vpshufd (or vpsrldq) or similar FP shuffles. > > ----- > > On current Intel, with clean YMM/ZMM uppers (known by the CPU hardware to be > zero), an extract that only writes a 128-bit register will keep them clean > (even if it reads a ZMM), not needing a VZEROUPPER. Since VZEROUPPER is > only needed for dirty y/zmm0..15, not with dirty zmm16..31, so a function > like > > float foo(float *p) { > some vector stuff that can use high zmm regs; > return scalar that happens to be from the middle of a vector; > } > > could vextract into XMM0, but would need vzeroupper if it used valignd into > ZMM0. > > (Also related > https://stackoverflow.com/questions/58568514/does-skylake-need-vzeroupper- > for-turbo-clocks-to-recover-after-a-512-bit-instruc re reading a ZMM at all > and turbo clock). > > --- > > Having known zeros outside the low 128 bits (from writing an xmm instead of > rotating a zmm) is unlikely to matter, although for FP stuff copying fewer > elements that might be subnormal could happen to be an advantage, maybe > saving an FP assist for denormal. We're unlikely to be able to take > advantage of it to save instructions/uops (like OR instead of blend). But > it's not worse to use a single extract instruction instead of a single > valignd.