On August 11, 2016 4:27:00 PM GMT+02:00, Jan Hubicka <hubi...@ucw.cz> wrote: >> On Thu, 11 Aug 2016, Jan Hubicka wrote: >> >> > Hi, >> > this patch adds early jump threading pass. Jump threading is one >of most >> > common cases where estimated profile becomes corrupted, because the >branches >> > are predicted independently beforehand. This patch simply adds >early mode to >> > jump threader that does not permit code growth and thus only >win/win threads >> > are done before profile is constructed. >> > >> > Naturally this also improves IPA decisions because code size/time >is estimated >> > more acurately. >> > >> > It is not very cool to add another pass and the jump threader is >already >> > run 5 times. I think incrementally we could drop one of late >threaders at least. >> > I tried to measure compile time effects but they are in wash. >Moreover the patch >> > pays for itself in cc1plus code size: >> > >> > Before patch to tweak size estimates: 27779964 >> > Current mainline: 27748900 >> > With patch applied: 27716173 >> > >> > So despite adding few functions the code size effect is positive >which I think >> > is quite nice. >> > >> > Given the fact that jump threading seems quite lightweit, I wonder >why it is >> > guarded by flag_expensive_optimizations? Is it expensive in terms >of code >> > size? >> > >> > The effectivity of individual threading passes is as follows (for >tramp3d) >> > >> > mainline with patch >> > pass thread count profile mismatches thread count profile >mismatch >> > early 525 >> > 1 1853 1900 316 337 >> > 2 4 812 4 288 >> > 3 24 1450 32 947 >> > 4 76 1457 75 975 >> > >> > So at least tramp3d data seems to suggest that we can drop the >second occurence >> > of jump threading and that most of the job thread1 does can be done >by the >> > size restricted early version (the lower absolute counts are caused >by the >> > fact that threadable paths gets duplicated by the inliner). 50% >drop in >> > profile distortion is not bad. I wonder why basically every >threaded paths seems >> > to introduce a mismatch. >> > >> > The patch distorts testusite somewhat, in most cases we only want >to update >> > dump files or disable early threading: >> > >> > +XPASS: gcc.dg/uninit-15.c (test for warnings, line 13) >> > +XPASS: gcc.dg/uninit-15.c (test for warnings, line 23) >> > +FAIL: gcc.dg/uninit-15.c (test for warnings, line 24) >> > +FAIL: gcc.dg/tree-ssa/pr68198.c scan-tree-dump-times thread1 >"Registering FSM" 1 >> > +FAIL: gcc.dg/tree-ssa/pr69196-1.c scan-tree-dump thread1 "FSM did >not thread around loop and would copy too many statements" >> > +FAIL: gcc.dg/tree-ssa/ssa-dom-thread-2b.c scan-tree-dump-times >thread1 "Jumps threaded: 1" 1 >> > +FAIL: gcc.dg/tree-ssa/ssa-thread-13.c scan-tree-dump thread1 "FSM" >> > +FAIL: gcc.dg/tree-ssa/vrp01.c scan-tree-dump-times vrp1 "Folding >predicate p_.*to 1" 1 >> > +FAIL: gcc.dg/tree-ssa/vrp56.c scan-tree-dump thread1 "Jumps >threaded: 1" >> > +FAIL: gcc.dg/tree-ssa/vrp92.c scan-tree-dump vrp1 "res_.: \\\\[1, >1\\\\]" >> > >> > This testcase is the now probably unnecesary heuristics (it may >still be >> > relevant in cases we do not thread because of size bounds but its >effectivity >> > may not be worth the maintenance cost): >> > >> > +FAIL: g++.dg/predict-loop-exit-1.C -std=gnu++11 >scan-tree-dump-times profile_estimate "extra loop exit heuristics of >edge[^:]*:" 2 >> > +FAIL: g++.dg/predict-loop-exit-1.C -std=gnu++11 >scan-tree-dump-times profile_estimate "loop exit heuristics of >edge[^:]*:" 3 >> > +FAIL: g++.dg/predict-loop-exit-1.C -std=gnu++14 >scan-tree-dump-times profile_estimate "extra loop exit heuristics of >edge[^:]*:" 2 >> > +FAIL: g++.dg/predict-loop-exit-1.C -std=gnu++14 >scan-tree-dump-times profile_estimate "loop exit heuristics of >edge[^:]*:" 3 >> > +FAIL: g++.dg/predict-loop-exit-1.C -std=gnu++98 >scan-tree-dump-times profile_estimate "extra loop exit heuristics of >edge[^:]*:" 2 >> > +FAIL: g++.dg/predict-loop-exit-1.C -std=gnu++98 >scan-tree-dump-times profile_estimate "loop exit heuristics of >edge[^:]*:" 3 >> > +FAIL: g++.dg/predict-loop-exit-2.C -std=gnu++11 >scan-tree-dump-times profile_estimate "extra loop exit heuristics of >edge[^:]*:" 1 >> > +FAIL: g++.dg/predict-loop-exit-2.C -std=gnu++11 >scan-tree-dump-times profile_estimate "loop exit heuristics of >edge[^:]*:" 2 >> > +FAIL: g++.dg/predict-loop-exit-2.C -std=gnu++14 >scan-tree-dump-times profile_estimate "extra loop exit heuristics of >edge[^:]*:" 1 >> > +FAIL: g++.dg/predict-loop-exit-2.C -std=gnu++14 >scan-tree-dump-times profile_estimate "loop exit heuristics of >edge[^:]*:" 2 >> > +FAIL: g++.dg/predict-loop-exit-2.C -std=gnu++98 >scan-tree-dump-times profile_estimate "extra loop exit heuristics of >edge[^:]*:" 1 >> > +FAIL: g++.dg/predict-loop-exit-2.C -std=gnu++98 >scan-tree-dump-times profile_estimate "loop exit heuristics of >edge[^:]*:" 2 >> > +FAIL: g++.dg/predict-loop-exit-3.C -std=gnu++11 >scan-tree-dump-times profile_estimate "extra loop exit heuristics of >edge[^:]*:" 2 >> > +FAIL: g++.dg/predict-loop-exit-3.C -std=gnu++11 >scan-tree-dump-times profile_estimate "loop exit heuristics of >edge[^:]*:" 3 >> > +FAIL: g++.dg/predict-loop-exit-3.C -std=gnu++14 >scan-tree-dump-times profile_estimate "extra loop exit heuristics of >edge[^:]*:" 2 >> > +FAIL: g++.dg/predict-loop-exit-3.C -std=gnu++14 >scan-tree-dump-times profile_estimate "loop exit heuristics of >edge[^:]*:" 3 >> > +FAIL: g++.dg/predict-loop-exit-3.C -std=gnu++98 >scan-tree-dump-times profile_estimate "extra loop exit heuristics of >edge[^:]*:" 2 >> > +FAIL: g++.dg/predict-loop-exit-3.C -std=gnu++98 >scan-tree-dump-times profile_estimate "loop exit heuristics of >edge[^:]*:" 3 >> > >> > If the patch seems acceptable, I will do the updates. One option >why I did >> > not do that is that it seems to be now posisble to pass parameters >to passes >> > from passes.def, so perhaps we do not need early_thread_jumps, but >doing so is >> > consistent with way we handle other early passes. >> >> I wonder why you choose to put the FSM threader early which only does >> backward threading(?!). I'd expect forward threading to be more >> profitable (though we don't have a separate threader for that and >> would need VRP or DOM - but it seems we will get an early VRP >anyway). > >On tramp3d all VRP passes threads together 730 branches, all DOM passes >393, so >FSM threading (with 1957 branches) is the most effective one. Perhaps >eventually >early VRP can also do bit of work. > >I am not 100% sure from where "backward" is comming from. I guess is >means that >analysis goes backward from conditionals to definitions: it looks for >conditional driven by a PHI statement that has a constant value on some >paths >and duplicates for those. This seems cheap and rather effective way of >getting >good part of the threading oppurtunities (most expensive part is >probably >identifying and walking paths that will not be threaded at the end).
Ah, I thought it was exclusively dealing with threading through back edges which is sth I'd avoid doing early? >BTW I wonder if the same analysis can't be done for other instructions >where constant >operands are very profitable, like division or multiplication. No idea, but Jeff will likely know. Richard. > >Honza