> I obtained perf stat results for following benchmark runs:
> 
> -O2:
> 
>     7856832.692380      task-clock (msec)         #    1.000 CPUs utilized
>               3758               context-switches          #    0.000 K/sec
>                 40                 cpu-migrations             #    0.000 K/sec
>              40847              page-faults                   #    0.005 K/sec
>      7856782413676      cycles                           #    1.000 GHz
>      6034510093417      instructions                   #    0.77  insn per 
> cycle
>       363937274287       branches                       #   46.321 M/sec
>        48557110132       branch-misses                #   13.34% of all 
> branches

(ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
enough for this kind of code)

> -O2 with orthonl inlined:
> 
>     8319643.114380      task-clock (msec)       #    1.000 CPUs utilized
>               4285               context-switches         #    0.001 K/sec
>                 28                 cpu-migrations            #    0.000 K/sec
>              40843              page-faults                  #    0.005 K/sec
>      8319591038295      cycles                          #    1.000 GHz
>      6276338800377      instructions                  #    0.75  insn per 
> cycle
>       467400726106       branches                      #   56.180 M/sec
>        45986364011        branch-misses              #    9.84% of all 
> branches

So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably implying
that extra instructions are appearing in this loop nest, but not in the 
innermost
loop. As a reminder for others, the innermost loop has only 3 iterations.

> -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
> 
>    8207331.088040      task-clock (msec)   #    1.000 CPUs utilized
>               2266               context-switches    #    0.000 K/sec
>                 32                 cpu-migrations       #    0.000 K/sec
>              40846              page-faults             #    0.005 K/sec
>      8207292032467      cycles                     #   1.000 GHz
>      6035724436440      instructions             #    0.74  insn per cycle
>       364415440156       branches                 #   44.401 M/sec
>        53138327276        branch-misses         #   14.58% of all branches

This seems to match baseline in terms of instruction count, but without PRE
the loop nest may be carrying some dependencies over memory. I would simply
check the assembly for the entire 6-level loop nest in question, I hope it's
not very complicated (though Fortran array addressing...).

> -O2 with orthonl inlined and hoisting disabled:
> 
>    7797265.206850      task-clock (msec)         #    1.000 CPUs utilized
>               3139              context-switches          #    0.000 K/sec
>                 20                cpu-migrations             #    0.000 K/sec
>              40846              page-faults                  #    0.005 K/sec
>      7797221351467      cycles                          #    1.000 GHz
>      6187348757324      instructions                  #    0.79  insn per 
> cycle
>       461840800061       branches                      #   59.231 M/sec
>        26920311761        branch-misses             #    5.83% of all branches

There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle count.
I don't think the former fully covers the latter (there's also a 90e9 reduction
in insn count).

Given that the inner loop iterates only 3 times, my main suggestion is to
consider how the profile for the entire loop nest looks like (it's 6 loops deep,
each iterating only 3 times).

> Perf profiles for
> -O2 -fno-code-hoisting and inlined orthonl:
> https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> 
>           3196866 |1f04:    ldur   d1, [x1, #-248]
> 216348301800│            add    w0, w0, #0x1
>             985098 |            add    x2, x2, #0x18
> 216215999206│            add    x1, x1, #0x48
> 215630376504│            fmul   d1, d5, d1
> 863829148015│            fmul   d1, d1, d6
> 864228353526│            fmul   d0, d1, d0
> 864568163014│            fmadd  d2, d0, d16, d2
>                         │             cmp    w0, #0x4
> 216125427594│          ↓ b.eq   1f34
>         15010377│             ldur   d0, [x2, #-8]
> 143753737468│          ↑ b      1f04
> 
> -O2 with inlined orthonl:
> https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> 
> 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> 144055883055│            add    w0, w0, #0x1
>   72262104254│            add    x2, x2, #0x18
> 143991169721│            add    x1, x1, #0x48
> 288648917780│            fmul   d15, d17, d15
> 864665644756│            fmul   d15, d15, d18
> 863868426387│            fmul   d14, d15, d14
> 865228159813│            fmadd  d16, d14, d31, d16
>             245967│            cmp    w0, #0x4
> 215396760545│         ↓ b.eq   1f28
>       704732365│            ldur   d14, [x2, #-8]
> 143775979620│         ↑ b      1ef8

This indicates that the loop only covers about 46-48% of overall time.

High count on the initial ldur instruction could be explained if the loop
is not entered by "fallthru" from the preceding block, or if its backedge
is mispredicted. Sampling mispredictions should be possible with perf record,
and you may be able to check if loop entry is fallthrough by inspecting
assembly.

It may also be possible to check if code alignment matters, by compiling with
-falign-loops=32.

Alexander

Reply via email to