https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111153
--- Comment #3 from JuzheZhong <juzhe.zhong at rivai dot ai> --- (In reply to Robin Dapp from comment #2) > With the current trunk we don't spill anymore: > > (VLS) > .L4: > vle32.v v2,0(a5) > vadd.vv v1,v1,v2 > addi a5,a5,16 > bne a5,a4,.L4 > > Considering just that loop I'd say costing works as designed. Even though > the epilog and boilerplate code seems "crude" the main loop is as short as > it can be and is IMHO preferable. > > .L3: > vsetvli a5,a1,e32,m1,tu,ma > slli a4,a5,2 > sub a1,a1,a5 > vle32.v v2,0(a0) > add a0,a0,a4 > vadd.vv v1,v2,v1 > bne a1,zero,.L3 > > This has 6 instructions (disregarding the jump) and can't be faster than the > 3 instructions for the VLS loop. Provided we iterate often enough the VLS > loop should always be a win. > > Regarding "looking slow" - I think ideally we would have the VLS loop > followed directly by the VLA loop for the residual iterations and next to no > additional statements. That would require changes in the vectorizer, though. > > In total: I think the current behavior is reasonable. Oh. I see. I just checked it now. .L4: vle32.v v2,0(a5) addi a5,a5,16 vadd.vv v1,v1,v2 bne a5,a4,.L4 lui a4,%hi(.LC0) lui a5,%hi(.LC1) addi a4,a4,%lo(.LC0) vlm.v v0,0(a4) addi a5,a5,%lo(.LC1) andi a1,a1,-4 vmv1r.v v2,v3 vlm.v v4,0(a5) vcompress.vm v2,v1,v0 vmv1r.v v0,v4 vadd.vv v1,v2,v1 vcompress.vm v3,v1,v0 vadd.vv v3,v3,v1 vmv.x.s a0,v3 sext.w a0,a0 beq a3,a1,.L12 It seems that the codegen will be even better if we support VLS mode reduction. I aggree that we first take VLS reduction choice then move to VLA reduction choice. But I wonder ARM SVE doesn't use this approach since they also has VLS mode (NEON/ADVSIMD).