https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932
Bug ID: 114932 Summary: Improvement in CHREC can give large performance gains Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org CC: rguenth at gcc dot gnu.org Target Milestone: --- With the original fix from PR114074 applied (e.g. g:a0b1798042d033fd2cc2c806afbb77875dd2909b) we not only saw regressions but saw big improvements. The following testcase: --- module brute_force integer, parameter :: r=9 integer block(r, r, r) contains subroutine brute k = 1 call digits_2(k) end recursive subroutine digits_2(row) integer, intent(in) :: row logical OK do i1 = 0, 1 do i2 = 1, 1 do i3 = 1, 1 do i4 = 0, 1 do i5 = 1, select do i6 = 0, 1 do i7 = l0, u0 select case(1 ) case(1) block(:2, 7:, i7) = block(:2, 7:, i7) - 1 end select do i8 = 1, 1 do i9 = 1, 1 if(row == 5) then elseif(OK)then call digits_2(row + 1) end if end do end do block(:, 1, i7) = select end do end do end do end do end do block = 1 end do block = 1 block = block0 + select end do end end --- compiled with: -mcpu=neoverse-v1 -Ofast -fomit-frame-pointer foo.f90 gets vectorized after sra and constprop. But the final addressing modes are so complicated that IVopts generates a register offset mode: 4c: 2f00041d mvni v29.2s, #0x0 50: fc666842 ldr d2, [x2, x6] 54: fc656841 ldr d1, [x2, x5] 58: fc646840 ldr d0, [x2, x4] 5c: 0ebd8442 add v2.2s, v2.2s, v29.2s 60: 0ebd8421 add v1.2s, v1.2s, v29.2s 64: 0ebd8400 add v0.2s, v0.2s, v29.2s which is harder for prefetchers to follow. When the patch was applied it was able to correctly lower these to the immediate offset loads that the scalar code was using: 38: 2f00041d mvni v29.2s, #0x0 34: fc594002 ldur d2, [x0, #-108] 40: fc5b8001 ldur d1, [x0, #-72] 44: fc5dc000 ldur d0, [x0, #-36] 48: 0ebd8442 add v2.2s, v2.2s, v29.2s 4c: 0ebd8421 add v1.2s, v1.2s, v29.2s 50: 0ebd8400 add v0.2s, v0.2s, v29.2s and also removes all the additional instructions to keep x6,x5 and x4 up to date. This gave 10%+ improvements on various workloads. (ps I'm looking at the __brute_force_MOD_digits_2.constprop.3.isra.0 specialization). I will try to reduce it more, but am filing this so we can keep track and hopefully fix.