On Wed, Aug 26, 2020 at 12:34 PM Prathamesh Kulkarni via Gcc
<gcc@gcc.gnu.org> wrote:
>
> Hi,
> We're seeing a consistent regression >10% on calculix with -O2 -flto vs -O2
> on aarch64 in our validation CI. I tried to investigate this issue a
> bit, and it seems the regression comes from inlining of orthonl into
> e_c3d. Disabling that brings back the performance. However, inlining
> orthonl into e_c3d, increases it's size from 3187 to 3837 by around
> 16.9% which isn't too large.
>
> I have attached two test-cases, e_c3d.f that has orthonl manually
> inlined into e_c3d to "simulate" LTO's inlining, and e_c3d-orig.f,
> which contains unmodified function.
> (gauss.f is included by e_c3d.f). For reproducing, just passing -O2 is
> sufficient.
>
> It seems that inlining orthonl, causes 20 hoistings into block 181,
> which are then hoisted to block 173, in particular hoistings of w(1,
> 1) ... w(3, 3), which wasn't
> possible without inlining. The hoistings happen because of basic block
> that computes orthonl in line 672 has w(1, 1) ... w(3, 3) and the
> following block in line 1035 in e_c3d.f:
>
> senergy=
>      &                    (s11*w(1,1)+s12*(w(1,2)+w(2,1))
>      &                    +s13*(w(1,3)+w(3,1))+s22*w(2,2)
>      &                    +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
>
> Disabling hoisting into blocks 173 (and 181), brings back most of the
> performance. I am not able to understand why (if?) these hoistings of
> w(1, 1) ...
> w(3, 3) are causing slowdown however. Looking at assembly, the hot
> code-path from perf in e_c3d shows following code-gen diff:
> For inlined version:
> .L122:
>         ldr     d15, [x1, -248]
>         add     w0, w0, 1
>         add     x2, x2, 24
>         add     x1, x1, 72
>         fmul    d15, d17, d15
>         fmul    d15, d15, d18
>         fmul    d14, d15, d14
>         fmadd   d16, d14, d31, d16
>         cmp     w0, 4
>         beq     .L121
>         ldr     d14, [x2, -8]
>         b       .L122
>
> and for non-inlined version:
> .L118:
>         ldr     d0, [x1, -248]
>         add     w0, w0, 1
>         ldr     d2, [x2, -8]
>         add     x1, x1, 72
>         add     x2, x2, 24
>         fmul    d0, d3, d0
>         fmul    d0, d0, d5
>         fmul    d0, d0, d2
>         fmadd   d1, d4, d0, d1
>         cmp     w0, 4
>         bne     .L118

I wonder if you have profles.  The inlined version has a
non-empty latch block (looks like some PRE is happening
there?).  Eventually your uarch does not like the close
(does your assembly show the layour as it is?) branches?

> which corresponds to the following loop in line 1014.
>                                 do n1=1,3
>                                   s(iii1,jjj1)=s(iii1,jjj1)
>      &                                  +anisox(m1,k1,n1,l1)
>      &                                  *w(k1,l1)*vo(i1,m1)*vo(j1,n1)
>      &                                  *weight
>
> I am not sure why would hoisting have any direct effect on this loop
> except perhaps that hoisting allocated more reigsters, and led to
> increased register pressure. Perhaps that's why it's using highered
> number regs for code-gen in inlined version ? However disabling
> hoisting in blocks 173 and 181, also leads to overall 6 extra spills
> (by grepping for str to sp), so
> hoisting is also helping here ? I am not sure how to proceed further,
> and would be grateful for suggestions.
>
> Thanks,
> Prathamesh

Reply via email to