On Wed, 23 Sep 2020 at 13:22, Richard Biener <richard.guent...@gmail.com> wrote: > > On Tue, Sep 22, 2020 at 6:25 PM Prathamesh Kulkarni > <prathamesh.kulka...@linaro.org> wrote: > > > > On Tue, 22 Sep 2020 at 16:36, Richard Biener <richard.guent...@gmail.com> > > wrote: > > > > > > On Tue, Sep 22, 2020 at 11:37 AM Prathamesh Kulkarni > > > <prathamesh.kulka...@linaro.org> wrote: > > > > > > > > On Tue, 22 Sep 2020 at 12:56, Richard Biener > > > > <richard.guent...@gmail.com> wrote: > > > > > > > > > > On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni > > > > > <prathamesh.kulka...@linaro.org> wrote: > > > > > > > > > > > > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni > > > > > > <prathamesh.kulka...@linaro.org> wrote: > > > > > > > > > > > > > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni > > > > > > > <prathamesh.kulka...@linaro.org> wrote: > > > > > > > > > > > > > > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov > > > > > > > > <amona...@ispras.ru> wrote: > > > > > > > > > > > > > > > > > > > I obtained perf stat results for following benchmark runs: > > > > > > > > > > > > > > > > > > > > -O2: > > > > > > > > > > > > > > > > > > > > 7856832.692380 task-clock (msec) # > > > > > > > > > > 1.000 CPUs utilized > > > > > > > > > > 3758 context-switches > > > > > > > > > > # 0.000 K/sec > > > > > > > > > > 40 cpu-migrations > > > > > > > > > > # 0.000 K/sec > > > > > > > > > > 40847 page-faults > > > > > > > > > > # 0.005 K/sec > > > > > > > > > > 7856782413676 cycles # > > > > > > > > > > 1.000 GHz > > > > > > > > > > 6034510093417 instructions # > > > > > > > > > > 0.77 insn per cycle > > > > > > > > > > 363937274287 branches # > > > > > > > > > > 46.321 M/sec > > > > > > > > > > 48557110132 branch-misses # > > > > > > > > > > 13.34% of all branches > > > > > > > > > > > > > > > > > > (ouch, 2+ hours per run is a lot, collecting a profile over a > > > > > > > > > minute should be > > > > > > > > > enough for this kind of code) > > > > > > > > > > > > > > > > > > > -O2 with orthonl inlined: > > > > > > > > > > > > > > > > > > > > 8319643.114380 task-clock (msec) # 1.000 > > > > > > > > > > CPUs utilized > > > > > > > > > > 4285 context-switches # > > > > > > > > > > 0.001 K/sec > > > > > > > > > > 28 cpu-migrations > > > > > > > > > > # 0.000 K/sec > > > > > > > > > > 40843 page-faults > > > > > > > > > > # 0.005 K/sec > > > > > > > > > > 8319591038295 cycles # > > > > > > > > > > 1.000 GHz > > > > > > > > > > 6276338800377 instructions # > > > > > > > > > > 0.75 insn per cycle > > > > > > > > > > 467400726106 branches # > > > > > > > > > > 56.180 M/sec > > > > > > > > > > 45986364011 branch-misses # > > > > > > > > > > 9.84% of all branches > > > > > > > > > > > > > > > > > > So +100e9 branches, but +240e9 instructions and +480e9 > > > > > > > > > cycles, probably implying > > > > > > > > > that extra instructions are appearing in this loop nest, but > > > > > > > > > not in the innermost > > > > > > > > > loop. As a reminder for others, the innermost loop has only 3 > > > > > > > > > iterations. > > > > > > > > > > > > > > > > > > > -O2 with orthonl inlined and PRE disabled (this removes the > > > > > > > > > > extra branches): > > > > > > > > > > > > > > > > > > > > 8207331.088040 task-clock (msec) # 1.000 CPUs > > > > > > > > > > utilized > > > > > > > > > > 2266 context-switches # > > > > > > > > > > 0.000 K/sec > > > > > > > > > > 32 cpu-migrations # > > > > > > > > > > 0.000 K/sec > > > > > > > > > > 40846 page-faults # > > > > > > > > > > 0.005 K/sec > > > > > > > > > > 8207292032467 cycles # > > > > > > > > > > 1.000 GHz > > > > > > > > > > 6035724436440 instructions # 0.74 > > > > > > > > > > insn per cycle > > > > > > > > > > 364415440156 branches # > > > > > > > > > > 44.401 M/sec > > > > > > > > > > 53138327276 branch-misses # 14.58% > > > > > > > > > > of all branches > > > > > > > > > > > > > > > > > > This seems to match baseline in terms of instruction count, > > > > > > > > > but without PRE > > > > > > > > > the loop nest may be carrying some dependencies over memory. > > > > > > > > > I would simply > > > > > > > > > check the assembly for the entire 6-level loop nest in > > > > > > > > > question, I hope it's > > > > > > > > > not very complicated (though Fortran array addressing...). > > > > > > > > > > > > > > > > > > > -O2 with orthonl inlined and hoisting disabled: > > > > > > > > > > > > > > > > > > > > 7797265.206850 task-clock (msec) # 1.000 > > > > > > > > > > CPUs utilized > > > > > > > > > > 3139 context-switches # > > > > > > > > > > 0.000 K/sec > > > > > > > > > > 20 cpu-migrations > > > > > > > > > > # 0.000 K/sec > > > > > > > > > > 40846 page-faults > > > > > > > > > > # 0.005 K/sec > > > > > > > > > > 7797221351467 cycles # > > > > > > > > > > 1.000 GHz > > > > > > > > > > 6187348757324 instructions # > > > > > > > > > > 0.79 insn per cycle > > > > > > > > > > 461840800061 branches # > > > > > > > > > > 59.231 M/sec > > > > > > > > > > 26920311761 branch-misses # > > > > > > > > > > 5.83% of all branches > > > > > > > > > > > > > > > > > > There's a 20e9 reduction in branch misses and a 500e9 > > > > > > > > > reduction in cycle count. > > > > > > > > > I don't think the former fully covers the latter (there's > > > > > > > > > also a 90e9 reduction > > > > > > > > > in insn count). > > > > > > > > > > > > > > > > > > Given that the inner loop iterates only 3 times, my main > > > > > > > > > suggestion is to > > > > > > > > > consider how the profile for the entire loop nest looks like > > > > > > > > > (it's 6 loops deep, > > > > > > > > > each iterating only 3 times). > > > > > > > > > > > > > > > > > > > Perf profiles for > > > > > > > > > > -O2 -fno-code-hoisting and inlined orthonl: > > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data > > > > > > > > > > > > > > > > > > > > 3196866 |1f04: ldur d1, [x1, #-248] > > > > > > > > > > 216348301800│ add w0, w0, #0x1 > > > > > > > > > > 985098 | add x2, x2, #0x18 > > > > > > > > > > 216215999206│ add x1, x1, #0x48 > > > > > > > > > > 215630376504│ fmul d1, d5, d1 > > > > > > > > > > 863829148015│ fmul d1, d1, d6 > > > > > > > > > > 864228353526│ fmul d0, d1, d0 > > > > > > > > > > 864568163014│ fmadd d2, d0, d16, d2 > > > > > > > > > > │ cmp w0, #0x4 > > > > > > > > > > 216125427594│ ↓ b.eq 1f34 > > > > > > > > > > 15010377│ ldur d0, [x2, #-8] > > > > > > > > > > 143753737468│ ↑ b 1f04 > > > > > > > > > > > > > > > > > > > > -O2 with inlined orthonl: > > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data > > > > > > > > > > > > > > > > > > > > 359871503840│ 1ef8: ldur d15, [x1, #-248] > > > > > > > > > > 144055883055│ add w0, w0, #0x1 > > > > > > > > > > 72262104254│ add x2, x2, #0x18 > > > > > > > > > > 143991169721│ add x1, x1, #0x48 > > > > > > > > > > 288648917780│ fmul d15, d17, d15 > > > > > > > > > > 864665644756│ fmul d15, d15, d18 > > > > > > > > > > 863868426387│ fmul d14, d15, d14 > > > > > > > > > > 865228159813│ fmadd d16, d14, d31, d16 > > > > > > > > > > 245967│ cmp w0, #0x4 > > > > > > > > > > 215396760545│ ↓ b.eq 1f28 > > > > > > > > > > 704732365│ ldur d14, [x2, #-8] > > > > > > > > > > 143775979620│ ↑ b 1ef8 > > > > > > > > > > > > > > > > > > This indicates that the loop only covers about 46-48% of > > > > > > > > > overall time. > > > > > > > > > > > > > > > > > > High count on the initial ldur instruction could be explained > > > > > > > > > if the loop > > > > > > > > > is not entered by "fallthru" from the preceding block, or if > > > > > > > > > its backedge > > > > > > > > > is mispredicted. Sampling mispredictions should be possible > > > > > > > > > with perf record, > > > > > > > > > and you may be able to check if loop entry is fallthrough by > > > > > > > > > inspecting > > > > > > > > > assembly. > > > > > > > > > > > > > > > > > > It may also be possible to check if code alignment matters, > > > > > > > > > by compiling with > > > > > > > > > -falign-loops=32. > > > > > > > > Hi, > > > > > > > > Thanks a lot for the detailed feedback, and I am sorry for late > > > > > > > > response. > > > > > > > > > > > > > > > > The hoisting region is: > > > > > > > > if(mattyp.eq.1) then > > > > > > > > 4 loops > > > > > > > > elseif(mattyp.eq.2) then > > > > > > > > { > > > > > > > > orthonl inlined into basic block; > > > > > > > > loads w[0] .. w[8] > > > > > > > > } > > > > > > > > else > > > > > > > > 6 loops // load anisox > > > > > > > > > > > > > > > > followed by basic block: > > > > > > > > > > > > > > > > senergy= > > > > > > > > & (s11*w(1,1)+s12*(w(1,2)+w(2,1)) > > > > > > > > & +s13*(w(1,3)+w(3,1))+s22*w(2,2) > > > > > > > > & > > > > > > > > +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight > > > > > > > > s(ii1,jj1)=s(ii1,jj1)+senergy > > > > > > > > s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy > > > > > > > > s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy > > > > > > > > > > > > > > > > Hoisting hoists loads w[0] .. w[8] from orthonl and senergy > > > > > > > > block, > > > > > > > > right in block 181, which is: > > > > > > > > if (mattyp.eq.2) goto <bb 182> else goto <bb 193> > > > > > > > > > > > > > > > > which is then further hoisted to block 173: > > > > > > > > if (mattyp.eq.1) goto <bb 392> else goto <bb 181> > > > > > > > > > > > > > > > > From block 181, we have two paths towards senergy block (bb > > > > > > > > 194): > > > > > > > > bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block) > > > > > > > > AND > > > > > > > > bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194 > > > > > > > > which has a path length of around 18 blocks. > > > > > > > > (bb 194 post-dominates bb 181 and bb 173). > > > > > > > > > > > > > > > > Disabling only load hoisting within blocks 173 and 181 > > > > > > > > (simply avoid inserting pre_expr if pre_expr->kind == > > > > > > > > REFERENCE), > > > > > > > > avoid hoisting of 'w' array and brings back most of > > > > > > > > performance. Which > > > > > > > > verifies that it is hoisting of the > > > > > > > > 'w' array (w[0] ... w[8]), which is causing the slowdown ? > > > > > > > > > > > > > > > > I obtained perf profiles for full hoisting, and disabled > > > > > > > > hoisting of > > > > > > > > 'w' array for the 6 loops, and the most drastic difference was > > > > > > > > for ldur instruction: > > > > > > > > > > > > > > > > With full hoisting: > > > > > > > > 359871503840│ 1ef8: ldur d15, [x1, #-248] > > > > > > > > > > > > > > > > Without full hoisting: > > > > > > > > 3441224 │1edc: ldur d1, [x1, #-248] > > > > > > > > > > > > > > > > (The loop entry seems to be fall thru in both cases. I have > > > > > > > > attached > > > > > > > > profiles for both cases). > > > > > > > > > > > > > > > > IIUC, the instruction seems to be loading the first element > > > > > > > > from anisox array, > > > > > > > > which makes me wonder if the issue was with data-cache miss for > > > > > > > > slower version. > > > > > > > > I ran perf script on perf data for L1-dcache-load-misses with > > > > > > > > period = 1million, > > > > > > > > and it reported two cache misses on the ldur instruction in > > > > > > > > full hoisting case, > > > > > > > > while it reported zero for the disabled load hoisting case. > > > > > > > > So I wonder if the slowdown happens because hoisting of 'w' > > > > > > > > array > > > > > > > > possibly results > > > > > > > > in eviction of anisox thus causing a cache miss inside the > > > > > > > > inner loop > > > > > > > > and making load slower ? > > > > > > > > > > > > > > > > Hoisting also seems to improve the number of overall cache > > > > > > > > misses tho. > > > > > > > > For disabled hoisting of 'w' array case, there were a total of > > > > > > > > 463 > > > > > > > > cache misses, while with full hoisting there were 357 cache > > > > > > > > misses > > > > > > > > (with period = 1 million). > > > > > > > > Does that happen because hoisting probably reduces cache misses > > > > > > > > along > > > > > > > > the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ? > > > > > > > Hi, > > > > > > > In general I feel for this or PR80155 case, the issues come with > > > > > > > long > > > > > > > range hoistings, inside a large CFG, since we don't have an > > > > > > > accurate > > > > > > > way to model target resources (register pressure in PR80155 case > > > > > > > / or > > > > > > > possibly cache pressure in this case?) at tree level and we end up > > > > > > > with register spill or cache miss inside loops, which may offset > > > > > > > the > > > > > > > benefit of hoisting. As previously discussed the right way to go > > > > > > > is a > > > > > > > live range splitting pass, at GIMPLE -> RTL border which can also > > > > > > > help > > > > > > > with other code-movement optimizations (or if the source had > > > > > > > variables > > > > > > > with long live ranges). > > > > > > > > > > > > > > I was wondering tho as a cheap workaround, would it make sense to > > > > > > > check if we are hoisting across a "large" region of nested loops, > > > > > > > and > > > > > > > avoid in that case since hoisting may exert resource pressure > > > > > > > inside > > > > > > > loop region ? (Especially, in the cases where hoisted expressions > > > > > > > were > > > > > > > not originally AVAIL in any of the loop blocks, and the loop > > > > > > > region > > > > > > > doesn't benefit from hoisting). > > > > > > > > > > > > > > For instance: > > > > > > > FOR_EACH_EDGE (e, ei, block) > > > > > > > { > > > > > > > /* Avoid hoisting across more than 3 nested loops */ > > > > > > > if (e->dest is a loop pre-header or loop header > > > > > > > && nesting depth of loop is > 3) > > > > > > > return false; > > > > > > > } > > > > > > > > > > > > > > I think this would work for resolving the calculix issue because > > > > > > > it > > > > > > > hoists across one region of 6 loops and another of 4 loops (didn' > > > > > > > test > > > > > > > yet). > > > > > > > It's not bulletproof in that it will miss detecting cases where > > > > > > > loop > > > > > > > header (or pre-header) isn't a successor of candidate block > > > > > > > (checking > > > > > > > for > > > > > > > that might get expensive tho?). I will test it on gcc suite and > > > > > > > SPEC > > > > > > > for any regressions. > > > > > > > Does this sound like a reasonable heuristic ? > > > > > > Hi, > > > > > > The attached patch implements the above heuristic. > > > > > > Bootstrapped + tested on x86_64-linux-gnu with no regressions. > > > > > > And it brings back most of performance for calculix on par with O2 > > > > > > (without inlining orthonl). > > > > > > I verified that with patch there is no cache miss happening on load > > > > > > insn inside loop > > > > > > (with perf report -e L1-dcache-load-misses/period=1000000/) > > > > > > > > > > > > I am in the process of benchmarking the patch on aarch64 for SPEC > > > > > > for > > > > > > speed and will report numbers > > > > > > in couple of days. (If required, we could parametrize number of > > > > > > nested > > > > > > loops, hardcoded (arbitrarily to) 3 in this patch, > > > > > > and set it in backend to not affect other targets). > > > > > > > > > > I don't think this patch captures the case in a sensible way - it > > > > > will simply > > > > > never hoist computations out of loop header blocks with depth > 3 > > > > > which > > > > > is certainly not what you want. Also the pre-header check is odd - > > > > > we're > > > > > looking for computations in successors of BLOCK but clearly a > > > > > computation > > > > > in a pre-header is not at the same loop level as one in the header > > > > > itself. > > > > Well, my intent was to check if we are hoisting across a region, > > > > which has multiple nested loops, and in that case, avoid hoisting > > > > expressions > > > > that do not belong to any loop blocks, because that may increase > > > > resource pressure > > > > inside loops. For instance, in the calculix issue we hoist 'w' array > > > > from post-dom and neither > > > > loop region has any uses of 'w'. I agree checking just for loop level > > > > is too coarse. > > > > The check with pre-header was essentially the same to see if we are > > > > hoisting across a loop region, > > > > not necessarily from within the loops. > > > > > > But it will fail to hoist *p in > > > > > > if (x) > > > { > > > a = *p; > > > } > > > else > > > { > > > b = *p; > > > } > > > > > > <make distance large> > > > pdom: > > > c = *p; > > > > > > so it isn't what matters either. What happens at the immediate > > > post-dominator > > > isn't directly relevant - what matters would be if the pdom is the one > > > making > > > the value antic on one of the outgoing edges. In that case we're also > > > going > > > to PRE *p into the arm not computing *p (albeit in a later position). But > > > that property is impossible to compute from the sets itself (not even > > > mentioning > > > the arbitrary CFG that can be inbetween the block and its pdom or the > > > weird > > > pdoms we compute for regions not having a path to exit, like infinite > > > loops). > > > > > > You could eventually look at the pdom predecessors and if *p is not > > > AVAIL_OUT > > > in each of them we _might_ have the situation you want to protect against. > > > But as said PRE insertion will likely have made sure it _is_ AVAIL_OUT in > > > each > > > of them ... > > Hi Richard, > > Thanks for the suggestions. Right, the issue seems to be here that > > post-dom block is making expressions ANTIC. Before doing insert, could > > we simply copy AVAIL_OUT sets of each block into another set say > > ORIG_AVAIL_OUT, > > as a guard against PRE eventually inserting expressions in pred blocks > > of pdom and making them available? > > And during hoisting, we could check if each expr that is ANTIC_IN > > (pdom) is ORIG_AVAIL_OUT in each pred of pdom, > > if the distance is "large". > > Did you try if it works w/o copying AVAIL_OUT? Because AVAIL_OUT is > very large (it's actually quadratic in size of the CFG * # values), in > particular > we're inserting in RPO and update AVAIL_OUT only up to the current block > (from dominators) so the PDOM block should have the original AVAIL_OUT > (but from the last iteration - we do iterate insertion). > > Note I'm still not happy with pulling off this kind of heuristics. > What the suggestion > means is that for > > if (x) > y = *p; > z = *p; > > we'll end up with > > if (x) > y = *p; > else > z = *p; > > instead of > > tem = *p; > if (x) > y = tem; > else > z = tem; > > that is, we get the runtime benefit but not the code-size one > (hoisting mostly helps code size plus allows if-conversion as followup > in some cases). Now, if we iterate (like if we'd have a second hoisting pass) > then the above would still cause hoisting - so the heuristic isn't > sustainable. > Again, sth like "distance" isn't really available. Hi Richard, It doesn't work without copying AVAIL_OUT. I tried for small example with attached patch:
int foo(int cond, int x, int y) { int t; void f(int); if (cond) t = x + y; else t = x - y; f (t); int t2 = (x + y) * 10; return t2; } By intersecting availout_in_some with AVAIL_OUT of preds of pdom, it does not hoist in first pass, but then PRE inserts x + y in the "else block", and we eventually hoist before if (cond). Similarly for e_c3d hoistings in calculix. IIUC, we want hoisting to be: (ANTIC_IN (block) intersect AVAIL_OUT (preds of pdom)) - AVAIL_OUT (block) to ensure that hoisted expressions are along all paths from block to post-dom ? If copying AVAIL_OUT sets is too large, could we keep another set that precomputes intersection of AVAIL_OUT of pdom preds for each block and then use this info during hoisting ? For computing "distance", I implemented a simple DFS walk from block to post-dom, that gives up if depth crosses threshold before reaching post-dom. I am not sure tho, how expensive that can get. Thanks, Prathamesh > > Richard. > > > Thanks, > > Prathamesh > > > > > > > > > > > > > > > > > Note the difficulty to capture "distance" is that the distance is > > > > > simply not > > > > > available at this point - it is the anticipated values from the > > > > > successors > > > > > that do _not_ compute the value itself that are the issue. To capture > > > > > "distance" you'd need to somehow "age" anticipated value when > > > > > propagating them upwards during compute_antic (where it then > > > > > would also apply to PRE in general). > > > > Yes, indeed. As a hack, would it make sense to avoid inserting an > > > > expression in the block, > > > > if it's ANTIC in post-dom block as a trade-off between extending live > > > > range and hoisting > > > > if the "distance" between block and post-dom is "too far" ? In > > > > general, as you point out, we'd need to compute, > > > > distance info for successors block during compute_antic, but special > > > > casing for post-dom should be easy enough > > > > during do_hoist_insertion, and hoisting an expr that is ANTIC in > > > > post-dom could be potentially "long range", if the region is large. > > > > It's still a coarse heuristic tho. I tried it in the attached patch. > > > > > > > > Thanks, > > > > Prathamesh > > > > > > > > > > > > > > > > > > As with all other heuristics the only place one could do hackish > > > > > attempts > > > > > with at least some reasoning is the elimination phase where > > > > > we make use of the (hoist) insertions - of course for hoisting we > > > > > already > > > > > know we'll get the "close" use in one of the successors so I fear even > > > > > there it will be impossible to do something sensible. > > > > > > > > > > Richard. > > > > > > > > > > > Thanks, > > > > > > Prathamesh > > > > > > > > > > > > > > Thanks, > > > > > > > Prathamesh > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > Prathamesh > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Prathamesh > > > > > > > > > > > > > > > > > > Alexander
gnu-659-pdom-4.diff
Description: Binary data