--- Comment #3 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
(In reply to Robin Dapp from comment #2)
> With the current trunk we don't spill anymore:
> (VLS)
> .L4:
>       vle32.v v2,0(a5)
>       vadd.vv v1,v1,v2
>       addi    a5,a5,16
>       bne     a5,a4,.L4
> Considering just that loop I'd say costing works as designed.  Even though
> the epilog and boilerplate code seems "crude" the main loop is as short as
> it can be and is IMHO preferable.
> .L3:
>         vsetvli a5,a1,e32,m1,tu,ma
>         slli    a4,a5,2
>         sub     a1,a1,a5
>         vle32.v v2,0(a0)
>         add     a0,a0,a4
>         vadd.vv v1,v2,v1
>         bne     a1,zero,.L3
> This has 6 instructions (disregarding the jump) and can't be faster than the
> 3 instructions for the VLS loop.  Provided we iterate often enough the VLS
> loop should always be a win.
> Regarding "looking slow" - I think ideally we would have the VLS loop
> followed directly by the VLA loop for the residual iterations and next to no
> additional statements.  That would require changes in the vectorizer, though.
> In total: I think the current behavior is reasonable.

Oh. I see. I just checked it now.
        vle32.v v2,0(a5)
        addi    a5,a5,16
        vadd.vv v1,v1,v2
        bne     a5,a4,.L4
        lui     a4,%hi(.LC0)
        lui     a5,%hi(.LC1)
        addi    a4,a4,%lo(.LC0)
        vlm.v   v0,0(a4)
        addi    a5,a5,%lo(.LC1)
        andi    a1,a1,-4
        vmv1r.v v2,v3
        vlm.v   v4,0(a5)
        vcompress.vm    v2,v1,v0
        vmv1r.v v0,v4
        vadd.vv v1,v2,v1
        vcompress.vm    v3,v1,v0
        vadd.vv v3,v3,v1
        vmv.x.s a0,v3
        sext.w  a0,a0
        beq     a3,a1,.L12

It seems that the codegen will be even better if we support VLS mode

I aggree that we first take VLS reduction choice then move to VLA reduction

But I wonder ARM SVE doesn't use this approach since they also has VLS mode

Reply via email to