> I obtained perf stat results for following benchmark runs: > > -O2: > > 7856832.692380 task-clock (msec) # 1.000 CPUs utilized > 3758 context-switches # 0.000 K/sec > 40 cpu-migrations # 0.000 K/sec > 40847 page-faults # 0.005 K/sec > 7856782413676 cycles # 1.000 GHz > 6034510093417 instructions # 0.77 insn per > cycle > 363937274287 branches # 46.321 M/sec > 48557110132 branch-misses # 13.34% of all > branches
(ouch, 2+ hours per run is a lot, collecting a profile over a minute should be enough for this kind of code) > -O2 with orthonl inlined: > > 8319643.114380 task-clock (msec) # 1.000 CPUs utilized > 4285 context-switches # 0.001 K/sec > 28 cpu-migrations # 0.000 K/sec > 40843 page-faults # 0.005 K/sec > 8319591038295 cycles # 1.000 GHz > 6276338800377 instructions # 0.75 insn per > cycle > 467400726106 branches # 56.180 M/sec > 45986364011 branch-misses # 9.84% of all > branches So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably implying that extra instructions are appearing in this loop nest, but not in the innermost loop. As a reminder for others, the innermost loop has only 3 iterations. > -O2 with orthonl inlined and PRE disabled (this removes the extra branches): > > 8207331.088040 task-clock (msec) # 1.000 CPUs utilized > 2266 context-switches # 0.000 K/sec > 32 cpu-migrations # 0.000 K/sec > 40846 page-faults # 0.005 K/sec > 8207292032467 cycles # 1.000 GHz > 6035724436440 instructions # 0.74 insn per cycle > 364415440156 branches # 44.401 M/sec > 53138327276 branch-misses # 14.58% of all branches This seems to match baseline in terms of instruction count, but without PRE the loop nest may be carrying some dependencies over memory. I would simply check the assembly for the entire 6-level loop nest in question, I hope it's not very complicated (though Fortran array addressing...). > -O2 with orthonl inlined and hoisting disabled: > > 7797265.206850 task-clock (msec) # 1.000 CPUs utilized > 3139 context-switches # 0.000 K/sec > 20 cpu-migrations # 0.000 K/sec > 40846 page-faults # 0.005 K/sec > 7797221351467 cycles # 1.000 GHz > 6187348757324 instructions # 0.79 insn per > cycle > 461840800061 branches # 59.231 M/sec > 26920311761 branch-misses # 5.83% of all branches There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle count. I don't think the former fully covers the latter (there's also a 90e9 reduction in insn count). Given that the inner loop iterates only 3 times, my main suggestion is to consider how the profile for the entire loop nest looks like (it's 6 loops deep, each iterating only 3 times). > Perf profiles for > -O2 -fno-code-hoisting and inlined orthonl: > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data > > 3196866 |1f04: ldur d1, [x1, #-248] > 216348301800│ add w0, w0, #0x1 > 985098 | add x2, x2, #0x18 > 216215999206│ add x1, x1, #0x48 > 215630376504│ fmul d1, d5, d1 > 863829148015│ fmul d1, d1, d6 > 864228353526│ fmul d0, d1, d0 > 864568163014│ fmadd d2, d0, d16, d2 > │ cmp w0, #0x4 > 216125427594│ ↓ b.eq 1f34 > 15010377│ ldur d0, [x2, #-8] > 143753737468│ ↑ b 1f04 > > -O2 with inlined orthonl: > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data > > 359871503840│ 1ef8: ldur d15, [x1, #-248] > 144055883055│ add w0, w0, #0x1 > 72262104254│ add x2, x2, #0x18 > 143991169721│ add x1, x1, #0x48 > 288648917780│ fmul d15, d17, d15 > 864665644756│ fmul d15, d15, d18 > 863868426387│ fmul d14, d15, d14 > 865228159813│ fmadd d16, d14, d31, d16 > 245967│ cmp w0, #0x4 > 215396760545│ ↓ b.eq 1f28 > 704732365│ ldur d14, [x2, #-8] > 143775979620│ ↑ b 1ef8 This indicates that the loop only covers about 46-48% of overall time. High count on the initial ldur instruction could be explained if the loop is not entered by "fallthru" from the preceding block, or if its backedge is mispredicted. Sampling mispredictions should be possible with perf record, and you may be able to check if loop entry is fallthrough by inspecting assembly. It may also be possible to check if code alignment matters, by compiling with -falign-loops=32. Alexander