https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #18 from Richard Biener <rguenth at gcc dot gnu.org> ---
There's another thing - we end up with
vmovq %rax, %xmm3
vpinsrq $1, %rdx, %xmm3, %xmm0
but that has way worse latency than the alternative you'd get w/o SSE 4.1:
vmovq %rax, %xmm3
vmovq %rdx, %xmm7
punpcklqdq %xmm7, %xmm3
for example on Zen3 vmovq and vpisnrq have latencies of 3 while punpck
has a latency of only one. So the second variant should have 2 cycles
less latency.
Testcase:
typedef long v2di __attribute__((vector_size(16)));
v2di foo (long a, long b)
{
return (v2di){a, b};
}
Even on Skylake it's 2 (movq) + 3 (vpinsr), so there it's 6 vs. 3. Not
sure if we should somehow do this late somehow (peephole or splitter) since
it requires one more %xmm register.