On Fri, Jan 13, 2017 at 7:48 PM, Aldy Hernandez <al...@redhat.com> wrote:
> [Sorry for the delay, I was sick.]
>
>
> On 01/09/2017 04:30 AM, Richard Biener wrote:
>>
>> On Sat, Jan 7, 2017 at 1:54 PM, Aldy Hernandez <al...@redhat.com> wrote:
>>>
>>> On 01/04/2017 07:11 AM, Richard Biener wrote:
>>>>
>>>>
>>>> On Tue, Jan 3, 2017 at 6:36 PM, Aldy Hernandez <al...@redhat.com> wrote:
>>>>>
>>>>>
>>>>> On 12/20/2016 09:16 AM, Richard Biener wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Dec 16, 2016 at 3:41 PM, Aldy Hernandez <al...@redhat.com>
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi folks.
>>>>>>>
>>>>>>> This is a follow-up on Jeff and Richi's interaction on the
>>>>>>> aforementioned
>>>>>>> PR
>>>>>>> here:
>>>>>>>
>>>>>>>         https://gcc.gnu.org/ml/gcc-patches/2016-08/msg01397.html
>>>>>>>
>>>>>>> I decided to explore the idea of analyzing may-undefness on-demand,
>>>>>>> which
>>>>>>> actually looks rather cheap.
>>>>>>>
>>>>>>> BTW, I don't understand why we don't have auto_bitmap's, as we
>>>>>>> already
>>>>>>> have
>>>>>>> auto_sbitmap's.  I've implemented the former based on auto_sbitmap's
>>>>>>> code
>>>>>>> we
>>>>>>> already have.
>>>>>>>
>>>>>>> The attached patch fixes the bug without introducing any regressions.
>>>>>>>
>>>>>>> I also tested the patch by compiling 242 .ii files with -O3.  These
>>>>>>> were
>>>>>>> gathered from a stage1 build with -save-temps.  There is a slight
>>>>>>> time
>>>>>>> degradation of 4 seconds within 27 minutes of user time:
>>>>>>>
>>>>>>>         tainted:        26:52
>>>>>>>         orig:           26:48
>>>>>>>
>>>>>>> This was the average aggregate time of two runs compiling all 242 .ii
>>>>>>> files.
>>>>>>> IMO, this looks reasonable.  It is after all, -O3.    Is it
>>>>>>> acceptable?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> +  while (!worklist.is_empty ())
>>>>>> +    {
>>>>>> +      name = worklist.pop ();
>>>>>> +      gcc_assert (TREE_CODE (name) == SSA_NAME);
>>>>>> +
>>>>>> +      if (ssa_undefined_value_p (name, true))
>>>>>> +       return true;
>>>>>> +
>>>>>> +      bitmap_set_bit (visited_ssa, SSA_NAME_VERSION (name));
>>>>>>
>>>>>> it should be already set as we use visited_ssa as "was it ever on the
>>>>>> worklist",
>>>>>> so maybe renaming it would be a good thing as well.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> I don't understand what you would prefer here.
>>>>
>>>>
>>>>
>>>> Set the bit when you put name on the worklist (and do not do that if the
>>>> bit is set).  Thus simply remove the above and add a bitmap_set_bit
>>>> for the initial name you put on the worklist.
>>>>
>>>>>>
>>>>>> +             if (TREE_CODE (name) == SSA_NAME)
>>>>>> +               {
>>>>>> +                 /* If an SSA has already been seen, this may be a
>>>>>> loop.
>>>>>> +                    Fail conservatively.  */
>>>>>> +                 if (bitmap_bit_p (visited_ssa, SSA_NAME_VERSION
>>>>>> (name)))
>>>>>> +                   return false;
>>>>>>
>>>>>> so to me "conservative" is returning true, not false.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> OK
>>>>>
>>>>>>
>>>>>> +                 bitmap_set_bit (visited_ssa, SSA_NAME_VERSION
>>>>>> (name));
>>>>>> +                 worklist.safe_push (name);
>>>>>>
>>>>>> but for loops we can just continue and ignore this use.  And
>>>>>> bitmap_set_bit
>>>>>> returns whether it set a bit, thus
>>>>>>
>>>>>>                 if (bitmap_set_bit (visited_ssa, SSA_NAME_VERSION
>>>>>> (name)))
>>>>>>                   worklist.safe_push (name);
>>>>>>
>>>>>> should work?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Fixed.
>>>>>
>>>>>>
>>>>>> +      /* Check that any SSA names used to define NAME is also fully
>>>>>> +        defined.  */
>>>>>> +      use_operand_p use_p;
>>>>>> +      ssa_op_iter iter;
>>>>>> +      FOR_EACH_SSA_USE_OPERAND (use_p, def, iter, SSA_OP_USE)
>>>>>> +       {
>>>>>> +         name = USE_FROM_PTR (use_p);
>>>>>> +         if (TREE_CODE (name) == SSA_NAME)
>>>>>>
>>>>>> always true.
>>>>>>
>>>>>> +           {
>>>>>> +             /* If an SSA has already been seen, this may be a loop.
>>>>>> +                Fail conservatively.  */
>>>>>> +             if (bitmap_bit_p (visited_ssa, SSA_NAME_VERSION (name)))
>>>>>> +               return false;
>>>>>> +             bitmap_set_bit (visited_ssa, SSA_NAME_VERSION (name));
>>>>>> +             worklist.safe_push (name);
>>>>>>
>>>>>> See above.
>>>>>>
>>>>>> In principle the thing is sound but I'd like to be able to pass in a
>>>>>> bitmap of
>>>>>> known maybe-undefined/must-defined SSA names to have a cache for
>>>>>> multiple invocations from within a pass (like this unswitching case).
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Done, though bitmaps are now calculated as part of the instantiation.
>>>>>
>>>>>>
>>>>>> Also once you hit defs that are in a post-dominated region of the loop
>>>>>> entry
>>>>>> you can treat them as not undefined (as their use invokes undefined
>>>>>> behavior).  This is also how you treat function parameters (well,
>>>>>> ssa_undefined_value_p does), where the call site invokes undefined
>>>>>> behavior
>>>>>> when passing in undefined values.  So we need an extra parameter
>>>>>> specifying
>>>>>> the post-dominance region.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Done.
>>>>>
>>>>>>
>>>>>> You do not handle memory or calls conservatively which means the
>>>>>> existing
>>>>>> testcase only needs some obfuscation to become a problem again.  To
>>>>>> fix
>>>>>> that before /* Check that any SSA names used to define NAME is also
>>>>>> fully
>>>>>> defined.  */ bail out conservatively, like
>>>>>>
>>>>>>    if (! is_gimple_assign (def)
>>>>>>       || gimple_assign_single_p (def))
>>>>>>     return true;
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> As I asked previously, I understand the !is_gimple_assign, which will
>>>>> skip
>>>>> over GIMPLE_CALLs setting a value, but the "gimple_assign_single_p(def)
>>>>> ==
>>>>> true"??
>>>>>
>>>>> Won't this last one bail on things like e.3_7 = e, or x_77 = y_88?
>>>>> Don't
>>>>> we
>>>>> want to follow up the def chain precisely on these?
>>>>>
>>>>> The attached implementation uses a cache, and a pre-computed
>>>>> post-dominance
>>>>> region.  A previous incantation of this patch (sans the post-dominance
>>>>> stuff) had performance within the noise of the previous implementation.
>>>>>
>>>>> I am testing again, and will do some performance comparisons in a bit,
>>>>> but
>>>>> for now-- are we on the same page here?  Is this what you wanted?
>>>>
>>>>
>>>>
>>>> +      /* DEFs in the post-dominated region of the loop entry invoke
>>>> +        undefined behavior.  Adding any use won't make things any
>>>> +        worse.  */
>>>> +      for (unsigned i = 0; i < postdom.length (); ++i)
>>>> +       if (def->bb == postdom[i])
>>>>
>>>> gimple_bb (def)
>>>>
>>>> +         {
>>>> +           set_defined (t);
>>>> +           return false;
>>>> +         }
>>>>
>>>> I think you can't really return false here but need to continue
>>>> processing
>>>> the worklist.  I'd say rather than a linear walk you can as well use
>>>> dominated_by_p (CDI_POST_DOMINATORS, ...) and record the entry
>>>> block instead?
>>>>
>>>> Note that the way you compute post-dominators doesn't work -- nothing
>>>> keeps them up-to-date when unswitching does CFG manipulations :/
>>>> Fortunately we have a way to compute them per region, see how
>>>> tree-if-conv.c
>>>> does that (build_region, calculate_dominance_info_for_region).
>>>
>>>
>>>
>>> If I understand correctly, we could compute them per region in
>>> tree_unswitch_single_loop() for each individual loop with what
>>> tree-if-conv.c uses.
>>>
>>> I could certainly abstract
>>> build_region/calculate_dominance_info_for_region
>>> into something new, say calculate_dominance_info_for_loop_region().  Then
>>> we
>>> could use dominated_by_p(CDI_POST_DOMINATORS, ...) in my class as you
>>> suggest.
>>>
>>> Question... computing the post-dom tree per region as I've just described
>>> will only create post-dom information for the loop region, which means
>>> that
>>> any definition outside of the loop will have zero dominance info.
>>
>>
>> Yes.
>>
>>> What if the definition in question post-dominates the loop entry but is
>>> outside/after of the loop?
>>
>>
>> How would we ever arrive at such def?  Once we reach a def before the
>> region/loop
>> we know it's fine to use (the missing "stop" point I pointed out).
>
>
> Hmmm, it looks like we can't even build the post-dom region appropriately in
> the presence of infinite loops.
>
> First, build_region() fails if it can't find a loop post-header:
>
>   /* The last element is loop post-header.  */
>   gcc_assert (exit_bb);
>
> which we won't have in an infinite loop like:
>
> bb2: //loop header
> |
> V
> loop 1:
> bb3:            <-------+
>         bar();          |
> |                       |
> V                       |
> bb4:                    |
>         goto bb3;   >---+
>
>
> We could just build the region without the post-header, but then we fail
> while building the DFS dominance region:
>
> dom_info::calc_dfs_tree_nonrec (basic_block bb)
> ....
>           gcc_assert (bn != m_start_block);
>
> Presumably because we've looped back to the start block (bb4 in a post
> dominator world).  This doesn't happen calculating going forward because we
> always have a start block outside of the region (the function entry block).
>
> I tried to fake edge my way out of it, but things get really messed up while
> putting/removing fake edges in the presence of loop info.  I'm assuming all
> this is by design.

Heh.  Infinite loops are indeed fun, and yes, the fake edges confuse some
of the loop stuff.

> Would you prefer me to continue down this path, trying to build a post-dom
> region with infinite loops and fixing build_region / calc_dfs_tree_nonrec,
> or could we do without this dominance optimization?  Pretty please?

It will pessimize some cases though, but see below.

>>
>>>  We will have no post-dom information for this.
>>> In this case, could we just ignore and continue evaluating the SSA_NAME
>>> with
>>> the rest of our heuristics, or did you have something else in mind?
>>> Another
>>> option would be to calculate the post-dom information for the entire
>>> function on every loop (calculate_dominance_info(CDI_POST_DOMINATORS)),
>>> but
>>> that seems rather expensive.
>>>
>>> As a side note, at this point I just want to fix/close the PR in an
>>> acceptable manner, not come up with the end-all catch-all most-efficient
>>> patch for an unlikely scenario ;-).
>>
>>
>> Yeah, which is why I wonder whether the caching is worth the trouble (in
>> it's
>> current form) for the unswitching use (given it's other restrictions
>> on conditions
>> to unswitch).
>
>
> We could go back to my original, no caching version (with the other
> suggestions).  That solves the problem quite simply, with no regressions.

So let's go with a unswitching-local solution then.  Based on
tree_may_unswitch_on:

  /* Condition must be invariant.  */
  FOR_EACH_SSA_TREE_OPERAND (use, stmt, iter, SSA_OP_USE)
    {
      /* Unswitching on undefined values would introduce undefined
         behavior that the original program might never exercise.  */
      if (ssa_undefined_value_p (use, true))
        return NULL_TREE;
      def = SSA_NAME_DEF_STMT (use);
      def_bb = gimple_bb (def);
      if (def_bb
          && flow_bb_inside_loop_p (loop, def_bb))
        return NULL_TREE;

we only have to look for uses in blocks dominating the loop header block
(or in blocks post-dominating that loop header, but we can probably implement
that by simply including the loop header itself with a FIXME comment).

Note that we only need to know whether a BB will be always executed when
the loop is entered -- there's "infrastructure" elsewhere that computes this
w/o the need of post-dominance.  For example see fill_always_executed_in_1
tree-ssa-loop-im.c (we can't use that 1:1 I think because we already use ->aux
via the original copy tables, but we could simplify it as we're only
interested in
the loop which preheader we put the unswitch condition on so we can use
a bitmap to record whether a block of the loop is always executed or not).

Can you send a patch that does this maybe?

Thanks,
Richard.


> Thanks again.
> Aldy

Reply via email to