https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111153

--- Comment #2 from Robin Dapp <rdapp at gcc dot gnu.org> ---
With the current trunk we don't spill anymore:

(VLS)
.L4:
        vle32.v v2,0(a5)
        vadd.vv v1,v1,v2
        addi    a5,a5,16
        bne     a5,a4,.L4

Considering just that loop I'd say costing works as designed.  Even though the
epilog and boilerplate code seems "crude" the main loop is as short as it can
be and is IMHO preferable.

.L3:
        vsetvli a5,a1,e32,m1,tu,ma
        slli    a4,a5,2
        sub     a1,a1,a5
        vle32.v v2,0(a0)
        add     a0,a0,a4
        vadd.vv v1,v2,v1
        bne     a1,zero,.L3

This has 6 instructions (disregarding the jump) and can't be faster than the 3
instructions for the VLS loop.  Provided we iterate often enough the VLS loop
should always be a win.

Regarding "looking slow" - I think ideally we would have the VLS loop followed
directly by the VLA loop for the residual iterations and next to no additional
statements.  That would require changes in the vectorizer, though.

In total: I think the current behavior is reasonable.

Reply via email to