On Wed, Aug 26, 2020 at 12:34 PM Prathamesh Kulkarni via Gcc <gcc@gcc.gnu.org> wrote: > > Hi, > We're seeing a consistent regression >10% on calculix with -O2 -flto vs -O2 > on aarch64 in our validation CI. I tried to investigate this issue a > bit, and it seems the regression comes from inlining of orthonl into > e_c3d. Disabling that brings back the performance. However, inlining > orthonl into e_c3d, increases it's size from 3187 to 3837 by around > 16.9% which isn't too large. > > I have attached two test-cases, e_c3d.f that has orthonl manually > inlined into e_c3d to "simulate" LTO's inlining, and e_c3d-orig.f, > which contains unmodified function. > (gauss.f is included by e_c3d.f). For reproducing, just passing -O2 is > sufficient. > > It seems that inlining orthonl, causes 20 hoistings into block 181, > which are then hoisted to block 173, in particular hoistings of w(1, > 1) ... w(3, 3), which wasn't > possible without inlining. The hoistings happen because of basic block > that computes orthonl in line 672 has w(1, 1) ... w(3, 3) and the > following block in line 1035 in e_c3d.f: > > senergy= > & (s11*w(1,1)+s12*(w(1,2)+w(2,1)) > & +s13*(w(1,3)+w(3,1))+s22*w(2,2) > & +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight > > Disabling hoisting into blocks 173 (and 181), brings back most of the > performance. I am not able to understand why (if?) these hoistings of > w(1, 1) ... > w(3, 3) are causing slowdown however. Looking at assembly, the hot > code-path from perf in e_c3d shows following code-gen diff: > For inlined version: > .L122: > ldr d15, [x1, -248] > add w0, w0, 1 > add x2, x2, 24 > add x1, x1, 72 > fmul d15, d17, d15 > fmul d15, d15, d18 > fmul d14, d15, d14 > fmadd d16, d14, d31, d16 > cmp w0, 4 > beq .L121 > ldr d14, [x2, -8] > b .L122 > > and for non-inlined version: > .L118: > ldr d0, [x1, -248] > add w0, w0, 1 > ldr d2, [x2, -8] > add x1, x1, 72 > add x2, x2, 24 > fmul d0, d3, d0 > fmul d0, d0, d5 > fmul d0, d0, d2 > fmadd d1, d4, d0, d1 > cmp w0, 4 > bne .L118
I wonder if you have profles. The inlined version has a non-empty latch block (looks like some PRE is happening there?). Eventually your uarch does not like the close (does your assembly show the layour as it is?) branches? > which corresponds to the following loop in line 1014. > do n1=1,3 > s(iii1,jjj1)=s(iii1,jjj1) > & +anisox(m1,k1,n1,l1) > & *w(k1,l1)*vo(i1,m1)*vo(j1,n1) > & *weight > > I am not sure why would hoisting have any direct effect on this loop > except perhaps that hoisting allocated more reigsters, and led to > increased register pressure. Perhaps that's why it's using highered > number regs for code-gen in inlined version ? However disabling > hoisting in blocks 173 and 181, also leads to overall 6 extra spills > (by grepping for str to sp), so > hoisting is also helping here ? I am not sure how to proceed further, > and would be grateful for suggestions. > > Thanks, > Prathamesh