On Mon, Mar 08, 2021 at 12:04:22PM +0100, Richard Biener wrote:
> +;; Further split pinsrq variants of vec_concatv2di to hide the latency
> +;; the GPR->XMM transition(s).
> +(define_peephole2
> +  [(match_scratch:DI 3 "Yv")
> +   (set (match_operand:V2DI 0 "sse_reg_operand")
> +     (vec_concat:V2DI (match_operand:DI 1 "sse_reg_operand")
> +                      (match_operand:DI 2 "nonimmediate_gr_operand")))]
> +  "TARGET_64BIT && TARGET_SSE4_1
> +   && !optimize_insn_for_size_p ()"
> +  [(set (match_dup 3)
> +        (match_dup 2))
> +   (set (match_dup 0)
> +     (vec_concat:V2DI (match_dup 1)
> +                      (match_dup 3)))])

Do we really want to do it for all vpinsrqs and not just those where
operands[1] is set from a nonimmediate_gr_operand a few instructions
earlier (or perhaps e.g. other insertions from GPRs)?
I mean, whether this is a win should depend on the latency of the
operands[1] setter if it is not too far from the vec_concat, if it has low
latency, this will only grow code without benefit, if it has high latency
it indeed could perform the GRP -> XMM move concurrently with the previous
operation.
Hardcoding the operands[1] setter in the peephole2 would mean we couldn't
match some small number of unrelated insns in between, but perhaps the
peephole2 condition could just call a function that walks the IL backward a
little and checks where the setter is and what latency it has?

        Jakub

Reply via email to