On Fri, Oct 11, 2024 at 6:26 AM Jeff Law <j...@ventanamicro.com> wrote: > > I probably spent way more time on this than it's worth... > > I was looking at the code we generate for vector SAD and noticed that we > were being a bit silly. Specifically: > > li a4,0 # 272 [c=4 l=4] *movsi_internal/1 > > Followed shortly by: > > vmv.s.x v3,a4 # 261 [c=4 l=4] *pred_broadcastrvvm1si/6 > > And no other uses of a4. We could have used x0 trivially. > > First we adjust the expander so that it doesn't force the constant into > a register. In the matching pattern we change the appropriate source > constraints from "r" to "rJ" and the output template is changed to use > %z for the operand. The net is we drop the li completely and emit > vmv.s.x,v3,x0. > > But wait, there's more. If we're broadcasting a constant in the range > [-16..15] into a vector, we currently load the constant into a register > and use vmv.v.r. We can instead use vmv.v.i, which avoids loading the > constant into a GPR. For that case we again avoid forcing the constant > into a register in the expander and adjust the output template to emit > vmv.v.x or vmv.v.i based on whether or not the appropriate operand is a > constant or general purpose register. So again, we'll drop a load > immediate into a scalar for this case. > > Whether or not we should use vmv.v.i vs vmv.s.x for loading [-16..15] > into the 0th element is probably uarch dependent. The tradeoff is > loading the GPR vs the broadcast in the vector unit. I didn't bother > with this case.
Note that this tradeoff is only interesting when LMUL is small. When LMUL is large, vmv.v.i does a lot more work than vmv.s.x (writing multiple vector registers versus just one). > > Tested in my tester (which tests rv64gcv as a default codegen option). > Will wait for the pre-commit tester to render a verdict. > > Jeff