https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820
--- Comment #5 from Peter Cordes <peter at cordes dot ca> --- AVX512F with marge-masking for integer->vector broadcasts give us a single-uop replacement for vpinsrq/d, which is 2 uops on Intel/AMD. See my answer on https://stackoverflow.com/questions/50779309/loading-an-xmm-from-gp-regs. I don't have access to real hardware, but according to reported uop counts, this should be very good: 1 uop per instruction on Skylake-avx512 or KNL vmovq xmm0, rax 1 uop p5 2c latency vpbroadcastq xmm0{k1}, rdx ; k1 = 0b0010 1 uop p5 3c latency vpbroadcastq ymm0{k2}, rdi ; k2 = 0b0100 1 uop p5 3c latency vpbroadcastq ymm0{k3}, rsi ; k3 = 0b1000 1 uop p5 3c latency xmm vs. ymm vs. zmm makes no difference to latency, according to InstLatx64 (For a full ZMM vector, maybe start a 2nd dep chain and vinsert to combine 256-bit halves. Also means only 3 k registers instead of 7) vpbroadcastq zmm0{k4}, rcx ; k4 =0b10000 3c latency ... filling up the ZMM reg Starting with k1 = 2 = 0b0010, we can init the rest with KSHIFT: mov eax, 0b0010 = 2 kmovw k1, eax KSHIFTLW k2, k1, 1 KSHIFTLW k3, k1, 2 # KSHIFTLW k4, k1, 3 ... KSHIFT runs only on port 5 (SKX), but so does KMOV; moving from integer registers would just cost extra instructions to set up integer regs first. It's actually ok if the upper bytes of the vector are filled with broadcasts, not zeros, so we could use 0b1110 / 0b1100 etc. for the masks. We could start with kxnor to generate a -1 and left-shift that, but that's 2 port5 uops vs. mov eax,2 / kmovw k1, eax being p0156 + p5. Loading k registers from memory is not helpful: according to IACA, it costs 3 uops. (But that includes p237, and a store-AGU uop makes no sense, so it might be wrong.)