Re: [PATCH][RFC] Radically simplify emission of balanced tree for switch statements.

Martin Liška Wed, 20 Sep 2017 00:24:54 -0700

On 09/16/2017 12:19 AM, Jeff Law wrote:
> On 09/14/2017 06:17 AM, Martin Liška wrote:
>> Hello.
>>
>> As mentioned at Cauldron 2017, second step in switch lowering should be 
>> massive
>> simplification in code that does expansion of balanced tree. Basically it 
>> includes
>> VRP and DCE, which we can for obvious reason do by our own.
>>
>> The patch does that, and introduces a separate pass for -O0 that's 
>> responsible
>> for lowering at the end of tree pass.
>>
>> There's a small fallback that I would like to discuss:
>>
>> 1) vrp105.c - there's no longer catches threading opportunity in between 
>> default cases:
>> adding Patrick who can probably help why is the opportunity skipped with 
>> expanded tree
> Well, VRP knows how to analyze a switch statement and determine when
> paths out of one switch imply a particular case will be taken in a later
> switch.  In this particular case we're trying to verify that the default
> case in the first switch threads directly to the default case in the
> second (though it seems we ought to be able to totally thread the cases).
> 
> Obviously we don't have switches anymore after your lowering pass.  But
> ISTM we still ought to be able to evaluate how your lowering affects
> jump threading on this case.  In theory the lowered switch statements
> should be easier to thread.
> 
> Reality is sadly different.  There's two cases to consider.  One when i
> < 0 (should go to first default, then directly to second default).  The
> other when i > 1 which should also go to the first default, then
> directly to the second default).
> 
> When i < 0 we do in fact thread through the second switch and go from
> the first default case directly to the second default case.
> 
> When i > 1 we still go through the test for the second switch.
> 
> These are the key blocks:
> 
> ;;   basic block 2, loop depth 0, freq 10000, maybe hot
> ;;    prev block 0, next block 3, flags: (NEW, REACHABLE, VISITED)
> ;;    pred:       ENTRY [always]  (FALLTHRU,EXECUTABLE)
>   i.0_1 = i;
>   if (i.0_1 > 0)
>     goto <bb 4>; [50.01%] [count: INV]
>   else
>     goto <bb 3>; [49.99%] [count: INV]
> ;;    succ:       3 [50.0% (guessed)]  (FALSE_VALUE,EXECUTABLE)
> ;;                4 [50.0% (guessed)]  (TRUE_VALUE,EXECUTABLE)
> 
> ;;   basic block 3, loop depth 0, freq 10000, maybe hot
> ;;   Invalid sum of incoming frequencies 4999, should be 10000
> ;;    prev block 2, next block 4, flags: (NEW, REACHABLE, VISITED)
> ;;    pred:       2 [50.0% (guessed)]  (FALSE_VALUE,EXECUTABLE)
>   if (i.0_1 == 0)
>     goto <bb 5>; [40.00%] [count: INV]
>   else
>     goto <bb 14>; [60.00%] [count: INV]
> ;;    succ:       14 [60.0% (guessed)]  (FALSE_VALUE,EXECUTABLE)
> ;;                5 [40.0% (guessed)]  (TRUE_VALUE,EXECUTABLE)
> 
> ;;   basic block 4, loop depth 0, freq 10000, maybe hot
> ;;   Invalid sum of incoming frequencies 5001, should be 10000
> ;;    prev block 3, next block 5, flags: (NEW, REACHABLE, VISITED)
> ;;    pred:       2 [50.0% (guessed)]  (TRUE_VALUE,EXECUTABLE)
>   _13 = i.0_1 != 1;
>   if (_13 != 0)
>     goto <bb 13>; [16.68%] [count: INV]
>   else
>     goto <bb 6>; [83.32%] [count: INV]
> ;;    succ:       6 [83.3% (guessed)]  (FALSE_VALUE,EXECUTABLE)
> ;;                13 [16.7% (guessed)]  (TRUE_VALUE,EXECUTABLE)
> 
> [ ... ]
> 
> ;;   basic block 9, loop depth 0, freq 9999, maybe hot
> ;;   Invalid sum of incoming frequencies 3999, should be 9999
> ;;    prev block 8, next block 10, flags: (NEW, REACHABLE, VISITED)
> ;;    pred:       13 [always]  (FALLTHRU,EXECUTABLE)
> ;;                7 [always (guessed)]  (TRUE_VALUE,EXECUTABLE)
>   # prephitmp_19 = PHI <prephitmp_15(13), prephitmp_12(7)>
>   _2 = prephitmp_19 != 1;
>   if (_2 != 0)
>     goto <bb 12>; [16.68%] [count: INV]
>   else
>     goto <bb 11>; [83.32%] [count: INV]
> ;;    succ:       11 [83.3% (guessed)]  (FALSE_VALUE,EXECUTABLE)
> ;;                12 [16.7% (guessed)]  (TRUE_VALUE,EXECUTABLE)
> 
> [ ... ]
> ;;   basic block 13, loop depth 0, freq 1668, maybe hot
> ;;    prev block 12, next block 14, flags: (NEW, REACHABLE, VISITED)
> ;;    pred:       4 [16.7% (guessed)]  (TRUE_VALUE,EXECUTABLE)
>   # prephitmp_15 = PHI <i.0_1(4)>
>   goto <bb 9>; [100.00%] [count: INV]
> ;;    succ:       9 [always]  (FALLTHRU,EXECUTABLE)
> 
> 
> 
> WHen bb9 is reached by bb13 we know by back substitution that the assignemnt
> 
>  _2 = prehitmp_19 != 1
>  _2 = prehitmp_15 != 1
>  _2 = i.0_1 != 1
> 
> And we should know enough about the range of i.0_1 to determine that is
> true which would allow us to thread the jump.  But a few things get in
> the way.
> 
> First, the VRP thread doesn't try hard at all to simplify gimple
> assignments.  It really just tries to simplify gimple_cond and
> gimple_switch.  This could be trivially improved.
> 
> So if I do the right things by hand we end up trying to evaluate i.0_1
> != 1.  So that's good.  But we don't get a useful result back from
> vrp_evaluate_conditional.  WHy?  Because the recorded range for i.0_1 is
> [1,INF] -- that's the global range for i.0_1 not the range on the path.
> 
> Now it turns out this is precisely the problem that I've got code to
> address in my queue which fixes a regression I was working on earlier
> this year in the gcc-7 cycle.  It's backed up behind some improvements
> to tree-ssa-uninit.c that Richi and I still need to hash through.
> 
> This would likely also be fixed by Andrew's work on ranges.  It's
> precisely the kind of query its designed to answer.
> 
> In summary, your switch lowering does regress the code we get for this
> test.  But the regression is more a failing of the jump threader than
> anything.
> 
>>
>> 2) uninit-18.c where we currently report:
>>
>> /home/marxin/Programming/gcc/gcc/testsuite/gcc.dg/uninit-18.c:13:12: 
>> warning: ‘tmp’ may be used uninitialized in this function 
>> [-Wmaybe-uninitialized]
>>      tmp[5] = 7;    /* { dg-bogus "may be used uninitialized" } */
>>
>> Am I right that the pass uses VRP?
> No, it doesn't use VRP.   If you look at the history of uninit-18 it shows:
> commit cac06b024306772e2be2d357064f7ca2aa82bde8
> Author: rguenth <rguenth@138bc75d-0d04-0410-961f-82ee72b054a4>
> Date:   Fri Jan 16 13:26:10 2015 +0000
> 
>     2015-01-16  Richard Biener  <rguent...@suse.de>
> 
>             PR middle-end/64614
>             * tree-ssa-uninit.c: Include tree-cfg.h.
>             (MAX_SWITCH_CASES): New define.
>             (convert_control_dep_chain_into_preds): Handle switch
> statements.
>             (is_pred_expr_subset_of): Handle x == CST vs. (x & CST) != 0.
>             (normalize_one_pred_1): Do not split bit-manipulations.
>             Record (x & CST).
> 
>             * gcc.dg/uninit-18.c: New testcase.
> 
> Essentially tree-ssa-uninit.c has some specialized code to handle switch
> statements in predicates with BIT_AND_EXPR and uninit-18.c tests that
> capability.
> 
> Your new switch lowering breaks that relatively specialized code.
> 
> You could further enhance uninit to detect this.   From the .uninit dump:
> 
> MEM[(char *)tmp_2 + 5B] = 7;
> is guarded by :
> 
> bar_5(D) > 0 (.AND.)  (.NOT.) bar_5(D) > 1
> 
> 
> [CHECK]: Found unguarded use: MEM[(char *)tmp_2 + 5B] = 7;
> 
> 
> So that check is really bar_5 == 1 and in theory if bar_5 == 1, then we
> would have assigned a value to tmp.   VRP is capable of making that
> determination and I think when VRP does that, things simplify enough
> that the general predicate code in tree-ssa-uninit.c will DTRT.
> 
> I wonder if we could get the tighter test coming out of switch lowering,
> or discover it in DOM, then there's a reasonable change
> tree-ssa-uninit.c would just DTRT.
> 
> Looking at the DOM dumps we have:
> 
> Optimizing block #5
> 
> 1>>> STMT 1 = bar_5(D) le_expr 1
> 1>>> STMT 0 = bar_5(D) gt_expr 1
> Optimizing statement if (bar_5(D) >= 1)
> LKUP STMT bar_5(D) ge_expr 1
> 
> 
> DOM doesn't usually try to do any kind of range optimization.  But this
> is a case where it might make sense.
> 
> So the first two lines indicate that we know bar5 <= 1 is true and that
> bar5 > 1 must be false.  Those are expressions we can look up in the
> hash table.
> 
> So when the lookup bar5 >= 1 fails we could in theory lookup bar5 <= 1
> and if that's true, then we know the test could be simplified to bar5 == 1
> 
> I don't know how often this would hit in general and when it hits how
> often the resulting code is actually better.   I can try to do some
> instrumentation on that...  I wonder if that might help the first case
> as well.
> 
> 
>>
>> Last question is whether the pass is properly moved in optimization pipeline?
>> Patch can bootstrap on ppc64le-redhat-linux and survives regression tests.
> Well, that's a *real* interesting question.  THe two cases above argue
> that it's too early.  However, I could easily make an argument that the
> spot you chose was reasonable -- namely it's after the first set of
> threading/vrp/dom passes, but before the second set of threading/vrp/dom
> passes.
> 
> Let me poke a bit with DOM and see if it can reasonably clean up this
> stuff and whether or not the transformation is generally useful.
> 
> Jeff
>


Hello.

Thank you Jeff for very verbose explanation what's happening. I'm planning to do
follow-up of this patch that will include clustering for bit-tests and jump 
tables.
Maybe that will make aforementioned issues even more difficult, but we'll see.

Martin

Re: [PATCH][RFC] Radically simplify emission of balanced tree for switch statements.

Reply via email to