https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111153
--- Comment #2 from Robin Dapp <rdapp at gcc dot gnu.org> ---
With the current trunk we don't spill anymore:
(VLS)
.L4:
vle32.v v2,0(a5)
vadd.vv v1,v1,v2
addi a5,a5,16
bne a5,a4,.L4
Considering just that loop I'd say costing works as designed. Even though the
epilog and boilerplate code seems "crude" the main loop is as short as it can
be and is IMHO preferable.
.L3:
vsetvli a5,a1,e32,m1,tu,ma
slli a4,a5,2
sub a1,a1,a5
vle32.v v2,0(a0)
add a0,a0,a4
vadd.vv v1,v2,v1
bne a1,zero,.L3
This has 6 instructions (disregarding the jump) and can't be faster than the 3
instructions for the VLS loop. Provided we iterate often enough the VLS loop
should always be a win.
Regarding "looking slow" - I think ideally we would have the VLS loop followed
directly by the VLA loop for the residual iterations and next to no additional
statements. That would require changes in the vectorizer, though.
In total: I think the current behavior is reasonable.