https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120687
Bug ID: 120687 Summary: RISC-V: very poor vector code gen for LMbench bw_mem test case Product: gcc Version: 16.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: bergner at gcc dot gnu.org Target Milestone: --- We've seen some VERY poor vector code gen for the bw_mem test case in LMbench for certain numbers of loads in the loop. I have extracted a simple test case that shows the issue. For smallish numbers of loads, I get what I would generally expect: linux%~:PR$ cat bw_mem_8.c int frd (int *p, int *lastone) { int sum = 0; for (; p <= lastone; p += 8) sum += p[0] + p[1] + p[2] + p[3] + p[4] + p[5] + p[6] + p[7]; return sum; } linux%~$ riscv64-unknown-linux-gnu-gcc -S -O3 -march=rv64gcv bw_mem_8.c linux%~$ cat bw_mem_8.s [snip] looking at just main loop body... .L3: vsetvli a5,a4,e32,m1,tu,ma vlseg8e32.v v8,(a0) slli a3,a5,5 sub a4,a4,a5 add a0,a0,a3 vadd.vv v1,v9,v8 vadd.vv v1,v1,v10 vadd.vv v1,v1,v11 vadd.vv v1,v1,v12 vadd.vv v1,v1,v13 vadd.vv v1,v1,v14 vadd.vv v1,v1,v15 vadd.vv v2,v1,v2 bne a4,zero,.L3 [snip] If I double the number of loads and update the loop increment to match, I see: linux%~$ cat bw_mem_16.c int frd (int *p, int *lastone) { int sum = 0; for (; p <= lastone; p += 16) sum += p[0] + p[1] + p[2] + p[3] + p[4] + p[5] + p[6] + p[7] + p[8] + p[9] + p[10] + p[11] + p[12] + p[13] + p[14] + p[15]; return sum; } linux%~$ riscv64-unknown-linux-gnu-gcc -S -O3 -march=rv64gcv bw_mem_16.c linux%~$ cat bw_mem_16.s [snip] looking at just main loop body... .L4: add s4,a4,a3 addi s2,s4,16 vle32.v v11,0(s2) vmv1r.v v0,v2 vle32.v v13,0(s4) addi s1,s4,32 vle32.v v19,0(s1) addi s0,s4,48 vcompress.vm v21,v11,v0 vmv1r.v v0,v1 vle32.v v10,0(s0) addi s5,s4,64 vcompress.vm v20,v11,v0 vmv1r.v v0,v2 vle32.v v18,0(s5) addi t2,s4,80 vcompress.vm v11,v13,v0 vmv1r.v v0,v1 vle32.v v9,0(t2) vslideup.vi v11,v21,2 vcompress.vm v15,v13,v0 vmv1r.v v0,v2 addi t0,s4,96 vslideup.vi v15,v20,2 vcompress.vm v13,v19,v0 vmv1r.v v0,v1 vle32.v v14,0(t0) addi t6,s4,112 vcompress.vm v21,v19,v0 vmv1r.v v0,v2 vle32.v v8,0(t6) addi t5,s4,128 vcompress.vm v20,v10,v0 vmv1r.v v0,v1 vle32.v v17,0(t5) [snip] ...this goes on for many pages! vcompress.vm v8,v7,v0 vslideup.vi v10,v9,2 vcompress.vm v7,v6,v0 vadd.vv v3,v3,v12 vcompress.vm v6,v5,v0 vslideup.vi v8,v7,2 vadd.vv v3,v3,v10 vcompress.vm v7,v4,v0 vcompress.vm v5,v24,v0 vadd.vv v3,v3,v8 vslideup.vi v6,v7,2 vcompress.vm v7,v22,v0 vcompress.vm v4,v20,v0 vadd.vv v3,v3,v6 vslideup.vi v5,v7,2 vcompress.vm v6,v18,v0 vadd.vv v3,v3,v5 vslideup.vi v4,v6,2 vadd.vv v3,v3,v4 vle32.v v4,0(sp) vadd.vv v3,v4,v3 vse32.v v3,0(sp) bne a3,s3,.L4 [snip] end of main loop. Counting the number of insns in the loop, I'm seeing over 20 times the number of instructions in this loop over the 8 element test case! The original bw_mem test case in LMbench does 128 loads within the loop which just exacerbates the issue even more. I'm marking this as a target bug for now until we know more...