https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87718
--- Comment #4 from Terry Guo <xuepeng.guo at intel dot com> --- (In reply to Uroš Bizjak from comment #2) > Following testcase: > > --cut here-- > typedef int V __attribute__((vector_size (8))); > > void foo (int x, int y) > { > register int a __asm ("xmm1"); > register int b __asm ("xmm2"); > register V c __asm ("xmm3"); > a = x; > b = y; > asm volatile ("" : "+v" (a), "+v" (b)); > c = (V) { a, b }; > asm volatile ("" : "+v" (c)); > } > --cut here-- > > gets compiled with -O2 -mavx -mtune=intel: > > vmovd %edi, %xmm1 > vmovd %esi, %xmm2 > vmovd %xmm2, %eax > vpinsrd $1, %eax, %xmm1, %xmm3 > ret > > The relevant pattern is defined as: > > (define_insn "*vec_concatv2si_sse4_1" > [(set (match_operand:V2SI 0 "register_operand" > "=Yr,*x, x, v,Yr,*x, v, v, *y,*y") > (vec_concat:V2SI > (match_operand:SI 1 "nonimmediate_operand" > " 0, 0, x,Yv, 0, 0,Yv,rm, 0,rm") > (match_operand:SI 2 "nonimm_or_0_operand" > " rm,rm,rm,rm,Yr,*x,Yv, C,*ym, C")))] > "TARGET_SSE4_1 && !(MEM_P (operands[1]) && MEM_P (operands[2]))" > "@ > pinsrd\t{$1, %2, %0|%0, %2, 1} > pinsrd\t{$1, %2, %0|%0, %2, 1} > vpinsrd\t{$1, %2, %1, %0|%0, %1, %2, 1} > vpinsrd\t{$1, %2, %1, %0|%0, %1, %2, 1} > punpckldq\t{%2, %0|%0, %2} > punpckldq\t{%2, %0|%0, %2} > vpunpckldq\t{%2, %1, %0|%0, %1, %2} > %vmovd\t{%1, %0|%0, %1} > punpckldq\t{%2, %0|%0, %2} > movd\t{%1, %0|%0, %1}" > > but for some reason RA chooses alternative 2 (x<-x,rm) instead of > alternative 6 (v<-Yv,Yv), although alternative 2 needs an extra reload from > %xmm2 to %eax. I dig this a bit and looks like we missed something in combine pass, hence fail to get a pattern that can match alternative 6. The combine pass dump of old gcc shows: ------------------- REG_UNUSED flags:CC insn_cost 4 for 10: r82:SI=xmm16:SI REG_DEAD xmm16:SI insn_cost 4 for 11: r83:SI=xmm17:SI REG_DEAD xmm17:SI insn_cost 4 for 12: r87:V2SI=vec_concat(r82:SI,r83:SI) REG_DEAD r83:SI REG_DEAD r82:SI ------------------- then we got: ------------------- Trying 10 -> 12: 10: r82:SI=xmm16:SI REG_DEAD xmm16:SI 12: r87:V2SI=vec_concat(r82:SI,r83:SI) REG_DEAD r83:SI REG_DEAD r82:SI Successfully matched this instruction: (set (reg:V2SI 87) (vec_concat:V2SI (reg/v:SI 52 xmm16 [ a ]) (reg:SI 83 [ b.1_2 ]))) allowing combination of insns 10 and 12 original costs 4 + 4 = 8 replacement cost 4 deferring deletion of insn with uid = 10. modifying insn i3 12: r87:V2SI=vec_concat(xmm16:SI,r83:SI) REG_DEAD xmm16:SI REG_DEAD r83:SI deferring rescan insn with uid = 12. Trying 11 -> 12: 11: r83:SI=xmm17:SI REG_DEAD xmm17:SI 12: r87:V2SI=vec_concat(xmm16:SI,r83:SI) REG_DEAD xmm16:SI REG_DEAD r83:SI Successfully matched this instruction: (set (reg:V2SI 87) (vec_concat:V2SI (reg/v:SI 52 xmm16 [ a ]) (reg/v:SI 53 xmm17 [ b ]))) allowing combination of insns 11 and 12 original costs 4 + 4 = 8 replacement cost 4 deferring deletion of insn with uid = 11. modifying insn i3 12: r87:V2SI=vec_concat(xmm16:SI,xmm17:SI) REG_DEAD xmm17:SI REG_DEAD xmm16:SI deferring rescan insn with uid = 12. ------------------- There are two successful combine attempts. We end up with pattern that can match alternative 6. However dump from current GCC trunk shows: ------------------- insn_cost 4 for 19: r90:SI=xmm16:SI REG_DEAD xmm16:SI insn_cost 4 for 10: r82:SI=r90:SI REG_DEAD r90:SI insn_cost 4 for 20: r91:SI=xmm17:SI REG_DEAD xmm17:SI insn_cost 4 for 11: r83:SI=r91:SI REG_DEAD r91:SI insn_cost 4 for 12: r87:V2SI=vec_concat(r82:SI,r83:SI) REG_DEAD r83:SI REG_DEAD r82:SI insn_cost 4 for 13: xmm3:V2SI=r87:V2SI REG_DEAD r87:V2SI ------------------- Trying 11 -> 12: 11: r83:SI=r91:SI REG_DEAD r91:SI 12: r87:V2SI=vec_concat(r90:SI,r83:SI) REG_DEAD r90:SI REG_DEAD r83:SI Successfully matched this instruction: (set (reg:V2SI 87) (vec_concat:V2SI (reg:SI 90) (reg:SI 91))) allowing combination of insns 11 and 12 original costs 4 + 4 = 8 replacement cost 4 deferring deletion of insn with uid = 11. modifying insn i3 12: r87:V2SI=vec_concat(r90:SI,r91:SI) REG_DEAD r91:SI REG_DEAD r90:SI deferring rescan insn with uid = 12. ------------------- We end up with "12: r87:V2SI=vec_concat(r90:SI,r91:SI)", later in LRA pass, the operand r90 is replaced with XMM register, the r91 is kept as general register. Then no chance match against preferred alternative 6.