https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #29 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Uroš Bizjak from comment #27) > (In reply to Richard Biener from comment #26) > > but that doesn't seem to match for some unknown reason. > > Try this: > > (define_peephole2 > [(match_scratch:DI 5 "Yv") > (set (match_operand:DI 0 "sse_reg_operand") > (match_operand:DI 1 "general_reg_operand")) > (set (match_operand:V2DI 2 "sse_reg_operand") > (vec_concat:V2DI (match_operand:DI 3 "sse_reg_operand") > (match_operand:DI 4 "nonimmediate_gr_operand")))] > "" > [(set (match_dup 0) > (match_dup 1)) > (set (match_dup 5) > (match_dup 4)) > (set (match_dup 2) > (vec_concat:V2DI (match_dup 3) > (match_dup 5)))]) Ah, I messed up operands. The following works (the above position of match_scratch happily chooses an operand matching operand 0): ;; Further split pinsrq variants of vec_concatv2di with two GPR sources, ;; one already reloaded, to hide the latency of one GPR->XMM transitions. (define_peephole2 [(set (match_operand:DI 0 "sse_reg_operand") (match_operand:DI 1 "general_reg_operand")) (match_scratch:DI 2 "Yv") (set (match_operand:V2DI 3 "sse_reg_operand") (vec_concat:V2DI (match_dup 0) (match_operand:DI 4 "nonimmediate_gr_operand")))] "reload_completed && optimize_insn_for_speed_p ()" [(set (match_dup 0) (match_dup 1)) (set (match_dup 2) (match_dup 4)) (set (match_dup 3) (vec_concat:V2DI (match_dup 0) (match_dup 2)))]) but for some reason it again doesn't work for the important loop. There we have 389: xmm0:DI=cx:DI REG_DEAD cx:DI 390: dx:DI=[sp:DI+0x10] 56: {dx:DI=dx:DI 0>>0x3f;clobber flags:CC;} REG_UNUSED flags:CC 57: xmm0:V2DI=vec_concat(xmm0:DI,dx:DI) I suppose the reason is that there's two unrelated insns between the xmm0 = cx:DI and the vec_concat. Which would hint that we somehow need to not match this GPR->XMM move in the peephole pattern but instead somehow in the condition (can we use DF there?) The simplified variant below works but IMHO matches cases we do not want to transform. I can't find any example on how to achieve that though. ;; Further split pinsrq variants of vec_concatv2di with two GPR sources, ;; one already reloaded, to hide the latency of one GPR->XMM transitions. (define_peephole2 [(match_scratch:DI 3 "Yv") (set (match_operand:V2DI 0 "sse_reg_operand") (vec_concat:V2DI (match_operand:DI 1 "sse_reg_operand") (match_operand:DI 2 "nonimmediate_gr_operand")))] "reload_completed && optimize_insn_for_speed_p ()" [(set (match_dup 3) (match_dup 2)) (set (match_dup 0) (vec_concat:V2DI (match_dup 1) (match_dup 3)))])