https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111153
--- Comment #3 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
(In reply to Robin Dapp from comment #2)
> With the current trunk we don't spill anymore:
>
> (VLS)
> .L4:
> vle32.v v2,0(a5)
> vadd.vv v1,v1,v2
> addi a5,a5,16
> bne a5,a4,.L4
>
> Considering just that loop I'd say costing works as designed. Even though
> the epilog and boilerplate code seems "crude" the main loop is as short as
> it can be and is IMHO preferable.
>
> .L3:
> vsetvli a5,a1,e32,m1,tu,ma
> slli a4,a5,2
> sub a1,a1,a5
> vle32.v v2,0(a0)
> add a0,a0,a4
> vadd.vv v1,v2,v1
> bne a1,zero,.L3
>
> This has 6 instructions (disregarding the jump) and can't be faster than the
> 3 instructions for the VLS loop. Provided we iterate often enough the VLS
> loop should always be a win.
>
> Regarding "looking slow" - I think ideally we would have the VLS loop
> followed directly by the VLA loop for the residual iterations and next to no
> additional statements. That would require changes in the vectorizer, though.
>
> In total: I think the current behavior is reasonable.
Oh. I see. I just checked it now.
.L4:
vle32.v v2,0(a5)
addi a5,a5,16
vadd.vv v1,v1,v2
bne a5,a4,.L4
lui a4,%hi(.LC0)
lui a5,%hi(.LC1)
addi a4,a4,%lo(.LC0)
vlm.v v0,0(a4)
addi a5,a5,%lo(.LC1)
andi a1,a1,-4
vmv1r.v v2,v3
vlm.v v4,0(a5)
vcompress.vm v2,v1,v0
vmv1r.v v0,v4
vadd.vv v1,v2,v1
vcompress.vm v3,v1,v0
vadd.vv v3,v3,v1
vmv.x.s a0,v3
sext.w a0,a0
beq a3,a1,.L12
It seems that the codegen will be even better if we support VLS mode
reduction.
I aggree that we first take VLS reduction choice then move to VLA reduction
choice.
But I wonder ARM SVE doesn't use this approach since they also has VLS mode
(NEON/ADVSIMD).