https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111153
--- Comment #2 from Robin Dapp <rdapp at gcc dot gnu.org> --- With the current trunk we don't spill anymore: (VLS) .L4: vle32.v v2,0(a5) vadd.vv v1,v1,v2 addi a5,a5,16 bne a5,a4,.L4 Considering just that loop I'd say costing works as designed. Even though the epilog and boilerplate code seems "crude" the main loop is as short as it can be and is IMHO preferable. .L3: vsetvli a5,a1,e32,m1,tu,ma slli a4,a5,2 sub a1,a1,a5 vle32.v v2,0(a0) add a0,a0,a4 vadd.vv v1,v2,v1 bne a1,zero,.L3 This has 6 instructions (disregarding the jump) and can't be faster than the 3 instructions for the VLS loop. Provided we iterate often enough the VLS loop should always be a win. Regarding "looking slow" - I think ideally we would have the VLS loop followed directly by the VLA loop for the residual iterations and next to no additional statements. That would require changes in the vectorizer, though. In total: I think the current behavior is reasonable.