https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117733
Bug ID: 117733 Summary: RISC-V SPEC2017 503.bwaves Inefficient fortran multi-dimensional array access Product: gcc Version: 15.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: vineetg at gcc dot gnu.org CC: jeffreyalaw at gmail dot com, rdapp at gcc dot gnu.org Target Milestone: --- bwaves/cam4 has a bunch of fortran multi-dimensional array access and nested loops to traverse them, and Vector codegen doesn't seem pretty and/or efficient. I have a reduced test: subroutine shell(q,nx) implicit none integer nx,ny,nzl real(kind=8) q(5,nx) real(kind=8) dqnorm integer l,i dqnorm = 0.0d0 do i=1,nx do l=1,5 dqnorm = dqnorm + q(l,i)*q(l,i) enddo enddo call use_val(dqnorm) return end -Ofast -ftree-vectorize -march=rv64gcv_zvl256b_zba_zbb_zbs -mrvv-vector-bits=zvl -mabi=lp64d The relevant output is not efficient vsetivli zero,4,e64,m1,ta,ma vmv.v.i v2,0 sh2add a4,a4,a4 addi t5,a0,32 vmv1r.v v3,v2 vmv1r.v v4,v2 vmv1r.v v1,v2 vmv1r.v v5,v2 addi t4,a0,64 addi t3,a0,96 addi t1,a0,128 li t6,20 li a3,4 .L3: minu a5,a4,t6 minu a7,a5,a3 sub a5,a5,a7 minu a6,a5,a3 sub a5,a5,a6 minu a1,a5,a3 sub a5,a5,a1 vsetvli zero,a7,e64,m1,ta,ma vle64.v v10,0(a0) minu a2,a5,a3 vsetvli zero,a6,e64,m1,ta,ma vle64.v v9,0(t5) sub a5,a5,a2 vsetvli zero,a1,e64,m1,ta,ma vle64.v v8,0(t4) vsetvli zero,a5,e64,m1,ta,ma vle64.v v7,0(t1) vsetvli zero,a2,e64,m1,ta,ma vle64.v v6,0(t3) vsetvli zero,a7,e64,m1,tu,ma vfmacc.vv v5,v10,v10 vsetvli zero,a6,e64,m1,tu,ma vfmacc.vv v1,v9,v9 vsetvli zero,a1,e64,m1,tu,ma vfmacc.vv v4,v8,v8 vsetvli zero,a5,e64,m1,tu,ma vfmacc.vv v2,v7,v7 mv t0,a4 vsetvli zero,a2,e64,m1,tu,ma vfmacc.vv v3,v6,v6 addi a0,a0,160 addi t5,t5,160 addi t4,t4,160 addi t1,t1,160 addi t3,t3,160 addi a4,a4,-20 bgtu t0,t6,.L3 ... (1) There is a VLE64 per element fetch / loop entry unrolled (but fortran is column major, and elements accessed in inner loop are consecutive in memory.) (2) Uses VL for predication which will runtime hit VL=0 which might be costly on some uarches. (3) There's all loads followed by all mac ops, vs. batching similar ops under same VL.