Re: LTO slows down calculix by more than 10% on aarch64

Prathamesh Kulkarni via Gcc Mon, 21 Sep 2020 05:45:42 -0700

On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
<prathamesh.kulka...@linaro.org> wrote:
>
> On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <amona...@ispras.ru> wrote:
> >
> > > I obtained perf stat results for following benchmark runs:
> > >
> > > -O2:
> > >
> > >     7856832.692380      task-clock (msec)         #    1.000 CPUs utilized
> > >               3758               context-switches          #    0.000 
> > > K/sec
> > >                 40                 cpu-migrations             #    0.000 
> > > K/sec
> > >              40847              page-faults                   #    0.005 
> > > K/sec
> > >      7856782413676      cycles                           #    1.000 GHz
> > >      6034510093417      instructions                   #    0.77  insn 
> > > per cycle
> > >       363937274287       branches                       #   46.321 M/sec
> > >        48557110132       branch-misses                #   13.34% of all 
> > > branches
> >
> > (ouch, 2+ hours per run is a lot, collecting a profile over a minute should 
> > be
> > enough for this kind of code)
> >
> > > -O2 with orthonl inlined:
> > >
> > >     8319643.114380      task-clock (msec)       #    1.000 CPUs utilized
> > >               4285               context-switches         #    0.001 K/sec
> > >                 28                 cpu-migrations            #    0.000 
> > > K/sec
> > >              40843              page-faults                  #    0.005 
> > > K/sec
> > >      8319591038295      cycles                          #    1.000 GHz
> > >      6276338800377      instructions                  #    0.75  insn per 
> > > cycle
> > >       467400726106       branches                      #   56.180 M/sec
> > >        45986364011        branch-misses              #    9.84% of all 
> > > branches
> >
> > So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably 
> > implying
> > that extra instructions are appearing in this loop nest, but not in the 
> > innermost
> > loop. As a reminder for others, the innermost loop has only 3 iterations.
> >
> > > -O2 with orthonl inlined and PRE disabled (this removes the extra 
> > > branches):
> > >
> > >    8207331.088040      task-clock (msec)   #    1.000 CPUs utilized
> > >               2266               context-switches    #    0.000 K/sec
> > >                 32                 cpu-migrations       #    0.000 K/sec
> > >              40846              page-faults             #    0.005 K/sec
> > >      8207292032467      cycles                     #   1.000 GHz
> > >      6035724436440      instructions             #    0.74  insn per cycle
> > >       364415440156       branches                 #   44.401 M/sec
> > >        53138327276        branch-misses         #   14.58% of all branches
> >
> > This seems to match baseline in terms of instruction count, but without PRE
> > the loop nest may be carrying some dependencies over memory. I would simply
> > check the assembly for the entire 6-level loop nest in question, I hope it's
> > not very complicated (though Fortran array addressing...).
> >
> > > -O2 with orthonl inlined and hoisting disabled:
> > >
> > >    7797265.206850      task-clock (msec)         #    1.000 CPUs utilized
> > >               3139              context-switches          #    0.000 K/sec
> > >                 20                cpu-migrations             #    0.000 
> > > K/sec
> > >              40846              page-faults                  #    0.005 
> > > K/sec
> > >      7797221351467      cycles                          #    1.000 GHz
> > >      6187348757324      instructions                  #    0.79  insn per 
> > > cycle
> > >       461840800061       branches                      #   59.231 M/sec
> > >        26920311761        branch-misses             #    5.83% of all 
> > > branches
> >
> > There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle 
> > count.
> > I don't think the former fully covers the latter (there's also a 90e9 
> > reduction
> > in insn count).
> >
> > Given that the inner loop iterates only 3 times, my main suggestion is to
> > consider how the profile for the entire loop nest looks like (it's 6 loops 
> > deep,
> > each iterating only 3 times).
> >
> > > Perf profiles for
> > > -O2 -fno-code-hoisting and inlined orthonl:
> > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > >
> > >           3196866 |1f04:    ldur   d1, [x1, #-248]
> > > 216348301800│            add    w0, w0, #0x1
> > >             985098 |            add    x2, x2, #0x18
> > > 216215999206│            add    x1, x1, #0x48
> > > 215630376504│            fmul   d1, d5, d1
> > > 863829148015│            fmul   d1, d1, d6
> > > 864228353526│            fmul   d0, d1, d0
> > > 864568163014│            fmadd  d2, d0, d16, d2
> > >                         │             cmp    w0, #0x4
> > > 216125427594│          ↓ b.eq   1f34
> > >         15010377│             ldur   d0, [x2, #-8]
> > > 143753737468│          ↑ b      1f04
> > >
> > > -O2 with inlined orthonl:
> > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > >
> > > 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> > > 144055883055│            add    w0, w0, #0x1
> > >   72262104254│            add    x2, x2, #0x18
> > > 143991169721│            add    x1, x1, #0x48
> > > 288648917780│            fmul   d15, d17, d15
> > > 864665644756│            fmul   d15, d15, d18
> > > 863868426387│            fmul   d14, d15, d14
> > > 865228159813│            fmadd  d16, d14, d31, d16
> > >             245967│            cmp    w0, #0x4
> > > 215396760545│         ↓ b.eq   1f28
> > >       704732365│            ldur   d14, [x2, #-8]
> > > 143775979620│         ↑ b      1ef8
> >
> > This indicates that the loop only covers about 46-48% of overall time.
> >
> > High count on the initial ldur instruction could be explained if the loop
> > is not entered by "fallthru" from the preceding block, or if its backedge
> > is mispredicted. Sampling mispredictions should be possible with perf 
> > record,
> > and you may be able to check if loop entry is fallthrough by inspecting
> > assembly.
> >
> > It may also be possible to check if code alignment matters, by compiling 
> > with
> > -falign-loops=32.
> Hi,
> Thanks a lot for the detailed feedback, and I am sorry for late response.
>
> The hoisting region is:
> if(mattyp.eq.1) then
>   4 loops
> elseif(mattyp.eq.2) then
>   {
>      orthonl inlined into basic block;
>      loads w[0] .. w[8]
>   }
> else
>    6 loops  // load anisox
>
> followed by basic block:
>
>  senergy=
>      &                    (s11*w(1,1)+s12*(w(1,2)+w(2,1))
>      &                    +s13*(w(1,3)+w(3,1))+s22*w(2,2)
>      &                    +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
>                      s(ii1,jj1)=s(ii1,jj1)+senergy
>                      s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
>                      s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
>
> Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block,
> right in block 181, which is:
> if (mattyp.eq.2) goto <bb 182> else goto <bb 193>
>
> which is then further hoisted to block 173:
> if (mattyp.eq.1) goto <bb 392> else goto <bb 181>
>
> From block 181, we have two paths towards senergy block (bb 194):
> bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block)
> AND
> bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
> which has a path length of around 18 blocks.
> (bb 194 post-dominates bb 181 and bb 173).
>
> Disabling only load hoisting within blocks 173 and 181
> (simply avoid inserting pre_expr if pre_expr->kind == REFERENCE),
> avoid hoisting of 'w' array and brings back most of performance. Which
> verifies that it is hoisting of the
> 'w' array (w[0] ... w[8]), which is causing the slowdown ?
>
> I obtained perf profiles for full hoisting, and disabled hoisting of
> 'w' array for the 6 loops, and the most drastic difference was
> for ldur instruction:
>
> With full hoisting:
> 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
>
> Without full hoisting:
> 3441224 │1edc:   ldur   d1, [x1, #-248]
>
> (The loop entry seems to be fall thru in both cases. I have attached
> profiles for both cases).
>
> IIUC, the instruction seems to be loading the first element from anisox array,
> which makes me wonder if the issue was with data-cache miss for slower 
> version.
> I ran perf script on perf data for L1-dcache-load-misses with period = 
> 1million,
> and it reported two cache misses on the ldur instruction in full hoisting 
> case,
> while it reported zero for the disabled load hoisting case.
> So I wonder if the slowdown happens because hoisting of 'w' array
> possibly results
> in eviction of anisox thus causing a cache miss inside the inner loop
> and making load slower ?
>
> Hoisting also seems to improve the number of overall cache misses tho.
> For disabled hoisting  of 'w' array case, there were a total of 463
> cache misses, while with full hoisting there were 357 cache misses
> (with period = 1 million).
> Does that happen because hoisting probably reduces cache misses along
> the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ?
Hi,
In general I feel for this or PR80155 case, the issues come with long
range hoistings, inside a large CFG, since we don't have an accurate
way to model target resources (register pressure in PR80155 case / or
possibly cache pressure in this case?) at tree level and we end up
with register spill or cache miss inside loops, which may offset the
benefit of hoisting. As previously discussed the right way to go is a
live range splitting pass, at GIMPLE -> RTL border which can also help
with other code-movement optimizations (or if the source had variables
with long live ranges).


I was wondering tho as a cheap workaround, would it make sense to
check if we are hoisting across a "large" region of nested loops, and
avoid in that case since hoisting may exert resource pressure inside
loop region ? (Especially, in the cases where hoisted expressions were
not originally AVAIL in any of the loop blocks, and the loop region
doesn't benefit from hoisting).

For instance:
FOR_EACH_EDGE (e, ei, block)
  {
    /* Avoid hoisting across more than 3 nested loops */
    if (e->dest is a loop pre-header or loop header
        && nesting depth of loop is > 3)
     return false;
  }

I think this would work for resolving the calculix issue because it
hoists across one region of 6 loops and another of 4 loops (didn' test
yet).
It's not bulletproof in that it will miss detecting cases where loop
header (or pre-header) isn't a successor of candidate block (checking
for
that might get expensive tho?). I will test it on gcc suite and SPEC
for any regressions.
Does this sound like a reasonable heuristic ?

Thanks,
Prathamesh



Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
> >
> > Alexander

Re: LTO slows down calculix by more than 10% on aarch64

Reply via email to