On Fri, Oct 11, 2024 at 6:26 AM Jeff Law <j...@ventanamicro.com> wrote:
>
> I probably spent way more time on this than it's worth...
>
> I was looking at the code we generate for vector SAD and noticed that we
> were being a bit silly.  Specifically:
>
>          li      a4,0            # 272   [c=4 l=4]  *movsi_internal/1
>
> Followed shortly by:
>
>          vmv.s.x v3,a4   # 261   [c=4 l=4]  *pred_broadcastrvvm1si/6
>
> And no other uses of a4.  We could have used x0 trivially.
>
> First we adjust the expander so that it doesn't force the constant into
> a register.  In the matching pattern we change the appropriate source
> constraints from "r" to "rJ" and the output template is changed to use
> %z for the operand.  The net is we drop the li completely and emit
> vmv.s.x,v3,x0.
>
> But wait, there's more.  If we're broadcasting a constant in the range
> [-16..15] into a vector, we currently load the constant into a register
> and use vmv.v.r.  We can instead use vmv.v.i, which avoids loading the
> constant into a GPR.  For that case we again avoid forcing the constant
> into a register in the expander and adjust the output template to emit
> vmv.v.x or vmv.v.i based on whether or not the appropriate operand is a
> constant or general purpose register.  So again, we'll drop a load
> immediate into a scalar for this case.
>
> Whether or not we should use vmv.v.i vs vmv.s.x for loading [-16..15]
> into the 0th element is probably uarch dependent.  The tradeoff is
> loading the GPR vs the broadcast in the vector unit.  I didn't bother
> with this case.

Note that this tradeoff is only interesting when LMUL is small.  When
LMUL is large, vmv.v.i does a lot more work than vmv.s.x (writing
multiple vector registers versus just one).

>
> Tested in my tester (which tests rv64gcv as a default codegen option).
> Will wait for the pre-commit tester to render a verdict.
>
> Jeff

Reply via email to