On Mon, Mar 08, 2021 at 12:04:22PM +0100, Richard Biener wrote: > +;; Further split pinsrq variants of vec_concatv2di to hide the latency > +;; the GPR->XMM transition(s). > +(define_peephole2 > + [(match_scratch:DI 3 "Yv") > + (set (match_operand:V2DI 0 "sse_reg_operand") > + (vec_concat:V2DI (match_operand:DI 1 "sse_reg_operand") > + (match_operand:DI 2 "nonimmediate_gr_operand")))] > + "TARGET_64BIT && TARGET_SSE4_1 > + && !optimize_insn_for_size_p ()" > + [(set (match_dup 3) > + (match_dup 2)) > + (set (match_dup 0) > + (vec_concat:V2DI (match_dup 1) > + (match_dup 3)))])
Do we really want to do it for all vpinsrqs and not just those where operands[1] is set from a nonimmediate_gr_operand a few instructions earlier (or perhaps e.g. other insertions from GPRs)? I mean, whether this is a win should depend on the latency of the operands[1] setter if it is not too far from the vec_concat, if it has low latency, this will only grow code without benefit, if it has high latency it indeed could perform the GRP -> XMM move concurrently with the previous operation. Hardcoding the operands[1] setter in the peephole2 would mean we couldn't match some small number of unrelated insns in between, but perhaps the peephole2 condition could just call a function that walks the IL backward a little and checks where the setter is and what latency it has? Jakub