https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122152
--- Comment #3 from Robin Dapp <rdapp at gcc dot gnu.org> --- (In reply to Lin Li from comment #1) > This is indeed a very good idea. However, I wonder if you have noticed that > if this is done, certain scenarios will generate a lot of vcompress > instructions, which will actually make the performance extremely poor. I > encountered this problem on both the SG2044 and my own RISC-V platform. > > For 462.libquantum, doing this would result in a performance drop of around > 60%. The fundamental reason is that the execution efficiency of vcompress is > too low. > > .L77: > sub a5,a5,a7 > vsetvli zero,a5,e64,m2,ta,ma > vle64.v v2,0(t1) > vsetvli zero,a7,e64,m2,ta,ma > vle64.v v10,0(a1) > vmv1r.v v0,v6 > vsetivli zero,4,e64,m2,ta,ma > mv a5,a4 > vcompress.vm v8,v2,v0 > addi t1,t1,64 > vcompress.vm v2,v10,v0 > vslideup.vi v2,v8,2 > vand.vv v0,v2,v4 > vxor.vv v2,v2,v12 > vmseq.vv v0,v0,v4 > bleu a4,t4,.L78 > li a5,4 > > > For the example you provided, gcc-trunk is indeed capable of using strided > load without adding option '-mno-autovec-segment', but it still generates > the vcompress instruction. For 462.libquantum, using option > '-march=rv64gcv_zvl*b -mrvv-vector-bits=zvl -mno-autovec-segment > -mrvv-max-lmul=m2/dynamic' can reproduce the performance issue I > mentioned(https://godbolt.org/z/eWzT99n5T) . > > I think this might not be entirely due to the RISC-V machine I'm using? The code doesn't look terrible to me. Can you explain why it is slower? And compared against what? The vcompress here are our "zip" replacement. Using them was done in the hope that vcompress is a single-cycle instruction.
