Re: Early jump threading

Richard Biener Thu, 11 Aug 2016 08:51:07 -0700

On August 11, 2016 4:27:00 PM GMT+02:00, Jan Hubicka <hubi...@ucw.cz> wrote:
>> On Thu, 11 Aug 2016, Jan Hubicka wrote:
>> 
>> > Hi,
>> > this patch adds early jump threading pass.  Jump threading is one
>of most
>> > common cases where estimated profile becomes corrupted, because the
>branches
>> > are predicted independently beforehand. This patch simply adds
>early mode to
>> > jump threader that does not permit code growth and thus only
>win/win threads
>> > are done before profile is constructed.
>> > 
>> > Naturally this also improves IPA decisions because code size/time
>is estimated
>> > more acurately.
>> > 
>> > It is not very cool to add another pass and the jump threader is
>already
>> > run 5 times. I think incrementally we could drop one of late
>threaders at least.
>> > I tried to measure compile time effects but they are in wash.
>Moreover the patch
>> > pays for itself in cc1plus code size:
>> > 
>> > Before patch to tweak size estimates: 27779964  
>> > Current mainline:                     27748900  
>> > With patch applied:                   27716173  
>> > 
>> > So despite adding few functions the code size effect is positive
>which I think
>> > is quite nice.
>> > 
>> > Given the fact that jump threading seems quite lightweit, I wonder
>why it is
>> > guarded by flag_expensive_optimizations? Is it expensive in terms
>of code
>> > size?
>> > 
>> > The effectivity of individual threading passes is as follows (for
>tramp3d)
>> > 
>> >       mainline                              with patch
>> > pass  thread count     profile mismatches   thread count    profile
>mismatch
>> > early                                       525
>> > 1     1853             1900                 316             337
>> > 2     4                812                  4               288
>> > 3     24               1450                 32              947
>> > 4     76               1457                 75              975
>> > 
>> > So at least tramp3d data seems to suggest that we can drop the
>second occurence
>> > of jump threading and that most of the job thread1 does can be done
>by the
>> > size restricted early version (the lower absolute counts are caused
>by the
>> > fact that threadable paths gets duplicated by the inliner). 50%
>drop in
>> > profile distortion is not bad. I wonder why basically every
>threaded paths seems
>> > to introduce a mismatch.
>> > 
>> > The patch distorts testusite somewhat, in most cases we only want
>to update
>> > dump files or disable early threading:
>> > 
>> > +XPASS: gcc.dg/uninit-15.c  (test for warnings, line 13)
>> > +XPASS: gcc.dg/uninit-15.c  (test for warnings, line 23)
>> > +FAIL: gcc.dg/uninit-15.c  (test for warnings, line 24)
>> > +FAIL: gcc.dg/tree-ssa/pr68198.c scan-tree-dump-times thread1
>"Registering FSM" 1
>> > +FAIL: gcc.dg/tree-ssa/pr69196-1.c scan-tree-dump thread1 "FSM did
>not thread around loop and would copy too many statements"
>> > +FAIL: gcc.dg/tree-ssa/ssa-dom-thread-2b.c scan-tree-dump-times
>thread1 "Jumps threaded: 1" 1
>> > +FAIL: gcc.dg/tree-ssa/ssa-thread-13.c scan-tree-dump thread1 "FSM"
>> > +FAIL: gcc.dg/tree-ssa/vrp01.c scan-tree-dump-times vrp1 "Folding
>predicate p_.*to 1" 1
>> > +FAIL: gcc.dg/tree-ssa/vrp56.c scan-tree-dump thread1 "Jumps
>threaded: 1"
>> > +FAIL: gcc.dg/tree-ssa/vrp92.c scan-tree-dump vrp1 "res_.: \\\\[1,
>1\\\\]"
>> > 
>> > This testcase is the now probably unnecesary heuristics (it may
>still be
>> > relevant in cases we do not thread because of size bounds but its
>effectivity
>> > may not be worth the maintenance cost):
>> > 
>> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11 
>scan-tree-dump-times profile_estimate "extra loop exit heuristics of
>edge[^:]*:" 2
>> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11 
>scan-tree-dump-times profile_estimate "loop exit heuristics of
>edge[^:]*:" 3
>> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14 
>scan-tree-dump-times profile_estimate "extra loop exit heuristics of
>edge[^:]*:" 2
>> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14 
>scan-tree-dump-times profile_estimate "loop exit heuristics of
>edge[^:]*:" 3
>> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98 
>scan-tree-dump-times profile_estimate "extra loop exit heuristics of
>edge[^:]*:" 2
>> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98 
>scan-tree-dump-times profile_estimate "loop exit heuristics of
>edge[^:]*:" 3
>> > +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++11 
>scan-tree-dump-times profile_estimate "extra loop exit heuristics of
>edge[^:]*:" 1
>> > +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++11 
>scan-tree-dump-times profile_estimate "loop exit heuristics of
>edge[^:]*:" 2
>> > +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++14 
>scan-tree-dump-times profile_estimate "extra loop exit heuristics of
>edge[^:]*:" 1
>> > +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++14 
>scan-tree-dump-times profile_estimate "loop exit heuristics of
>edge[^:]*:" 2
>> > +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++98 
>scan-tree-dump-times profile_estimate "extra loop exit heuristics of
>edge[^:]*:" 1
>> > +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++98 
>scan-tree-dump-times profile_estimate "loop exit heuristics of
>edge[^:]*:" 2
>> > +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++11 
>scan-tree-dump-times profile_estimate "extra loop exit heuristics of
>edge[^:]*:" 2
>> > +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++11 
>scan-tree-dump-times profile_estimate "loop exit heuristics of
>edge[^:]*:" 3
>> > +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++14 
>scan-tree-dump-times profile_estimate "extra loop exit heuristics of
>edge[^:]*:" 2
>> > +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++14 
>scan-tree-dump-times profile_estimate "loop exit heuristics of
>edge[^:]*:" 3
>> > +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++98 
>scan-tree-dump-times profile_estimate "extra loop exit heuristics of
>edge[^:]*:" 2
>> > +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++98 
>scan-tree-dump-times profile_estimate "loop exit heuristics of
>edge[^:]*:" 3
>> > 
>> > If the patch seems acceptable, I will do the updates. One option
>why I did
>> > not do that is that it seems to be now posisble to pass parameters
>to passes
>> > from passes.def, so perhaps we do not need early_thread_jumps, but
>doing so is
>> > consistent with way we handle other early passes.
>> 
>> I wonder why you choose to put the FSM threader early which only does
>> backward threading(?!).  I'd expect forward threading to be more
>> profitable (though we don't have a separate threader for that and
>> would need VRP or DOM - but it seems we will get an early VRP
>anyway).
>
>On tramp3d all VRP passes threads together 730 branches, all DOM passes
>393, so
>FSM threading (with 1957 branches) is the most effective one. Perhaps
>eventually
>early VRP can also do bit of work.
>
>I am not 100% sure from where "backward" is comming from. I guess is
>means that
>analysis goes backward from conditionals to definitions: it looks for
>conditional driven by a PHI statement that has a constant value on some
>paths
>and duplicates for those. This seems cheap and rather effective way of
>getting
>good part of the threading oppurtunities (most expensive part is
>probably
>identifying and walking paths that will not be threaded at the end).


Ah, I thought it was exclusively dealing with threading through back edges 
which is sth I'd avoid doing early?

>BTW I wonder if the same analysis can't be done for other instructions
>where constant
>operands are very profitable, like division or multiplication.

No idea, but Jeff will likely know.

Richard.

>
>Honza

Re: Early jump threading

Reply via email to