https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820

--- Comment #5 from Peter Cordes <peter at cordes dot ca> ---
AVX512F with marge-masking for integer->vector broadcasts give us a single-uop
replacement for vpinsrq/d, which is 2 uops on Intel/AMD.

See my answer on
https://stackoverflow.com/questions/50779309/loading-an-xmm-from-gp-regs.  I
don't have access to real hardware, but according to reported uop counts, this
should be very good: 1 uop per instruction on Skylake-avx512 or KNL

vmovq         xmm0, rax                        1 uop p5   2c latency
vpbroadcastq  xmm0{k1}, rdx   ; k1 = 0b0010    1 uop p5   3c latency
vpbroadcastq  ymm0{k2}, rdi   ; k2 = 0b0100    1 uop p5   3c latency
vpbroadcastq  ymm0{k3}, rsi   ; k3 = 0b1000    1 uop p5   3c latency

xmm vs. ymm vs. zmm makes no difference to latency, according to InstLatx64

(For a full ZMM vector, maybe start a 2nd dep chain and vinsert to combine
256-bit halves.  Also means only 3 k registers instead of 7)

vpbroadcastq  zmm0{k4}, rcx   ; k4 =0b10000     3c latency
... filling up the ZMM reg


Starting with k1 = 2 = 0b0010, we can init the rest with KSHIFT:

    mov      eax, 0b0010 = 2
    kmovw    k1, eax
    KSHIFTLW k2, k1, 1
    KSHIFTLW k3, k1, 2

  #  KSHIFTLW k4, k1, 3
     ...

KSHIFT runs only on port 5 (SKX), but so does KMOV; moving from integer
registers would just cost extra instructions to set up integer regs first.

It's actually ok if the upper bytes of the vector are filled with broadcasts,
not zeros, so we could use 0b1110 / 0b1100 etc. for the masks.  We could start
with kxnor to generate a -1 and left-shift that, but that's 2 port5 uops vs.
mov eax,2 / kmovw k1, eax being p0156 + p5.

Loading k registers from memory is not helpful: according to IACA, it costs 3
uops.  (But that includes p237, and a store-AGU uop makes no sense, so it might
be wrong.)

Reply via email to