[Bug target/122152] riscv64 uses a vector segmented load instead of a vector strided load

rdapp at gcc dot gnu.org via Gcc-bugs Wed, 11 Mar 2026 07:19:12 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122152


--- Comment #3 from Robin Dapp <rdapp at gcc dot gnu.org> ---
(In reply to Lin Li from comment #1)
> This is indeed a very good idea. However, I wonder if you have noticed that
> if this is done, certain scenarios will generate a lot of vcompress
> instructions, which will actually make the performance extremely poor. I
> encountered this problem on both the SG2044 and my own RISC-V platform. 
> 
> For 462.libquantum, doing this would result in a performance drop of around
> 60%. The fundamental reason is that the execution efficiency of vcompress is
> too low.
> 
> .L77:
>         sub     a5,a5,a7
>         vsetvli zero,a5,e64,m2,ta,ma
>         vle64.v v2,0(t1)
>         vsetvli zero,a7,e64,m2,ta,ma
>         vle64.v v10,0(a1)
>         vmv1r.v v0,v6
>         vsetivli        zero,4,e64,m2,ta,ma
>         mv      a5,a4
>         vcompress.vm    v8,v2,v0
>         addi    t1,t1,64
>         vcompress.vm    v2,v10,v0
>         vslideup.vi     v2,v8,2
>         vand.vv v0,v2,v4
>         vxor.vv v2,v2,v12
>         vmseq.vv        v0,v0,v4
>         bleu    a4,t4,.L78
>         li      a5,4
> 
> 
> For the example you provided, gcc-trunk is indeed capable of using strided
> load without adding option '-mno-autovec-segment', but it still generates
> the vcompress instruction. For 462.libquantum, using option
> '-march=rv64gcv_zvl*b -mrvv-vector-bits=zvl -mno-autovec-segment
> -mrvv-max-lmul=m2/dynamic' can reproduce the performance issue I
> mentioned(https://godbolt.org/z/eWzT99n5T) .
> 
> I think this might not be entirely due to the RISC-V machine I'm using?

The code doesn't look terrible to me.  Can you explain why it is slower?  And
compared against what?

The vcompress here are our "zip" replacement.  Using them was done in the hope
that vcompress is a single-cycle instruction.

[Bug target/122152] riscv64 uses a vector segmented load instead of a vector strided load

Reply via email to