Re: [PATCH GCC][3/4]Generalize dead store elimination (or store motion) across loop iterations in predcom

Richard Biener Tue, 25 Jul 2017 04:46:57 -0700

On Mon, Jul 24, 2017 at 4:27 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
> Ping^1.


Ok.

Thanks and sorry for the delay.
Richard.

> Thanks,
> bin
>
> On Mon, Jul 10, 2017 at 9:23 AM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Tue, Jul 4, 2017 at 1:29 PM, Richard Biener
>> <richard.guent...@gmail.com> wrote:
>>> On Tue, Jul 4, 2017 at 2:06 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>>> On Tue, Jul 4, 2017 at 12:19 PM, Richard Biener
>>>> <richard.guent...@gmail.com> wrote:
>>>>> On Mon, Jul 3, 2017 at 4:17 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>>>>> On Mon, Jul 3, 2017 at 10:38 AM, Richard Biener
>>>>>> <richard.guent...@gmail.com> wrote:
>>>>>>> On Tue, Jun 27, 2017 at 12:49 PM, Bin Cheng <bin.ch...@arm.com> wrote:
>>>>>>>> Hi,
>>>>>>>> For the moment, tree-predcom.c only supports 
>>>>>>>> invariant/load-loads/store-loads chains.
>>>>>>>> This patch generalizes dead store elimination (or store motion) across 
>>>>>>>> loop iterations in
>>>>>>>> predictive commoning pass by supporting store-store chain.  As comment 
>>>>>>>> in the patch:
>>>>>>>>
>>>>>>>>    Apart from predictive commoning on Load-Load and Store-Load chains, 
>>>>>>>> we
>>>>>>>>    also support Store-Store chains -- stores killed by other store can 
>>>>>>>> be
>>>>>>>>    eliminated.  Given below example:
>>>>>>>>
>>>>>>>>      for (i = 0; i < n; i++)
>>>>>>>>        {
>>>>>>>>          a[i] = 1;
>>>>>>>>          a[i+2] = 2;
>>>>>>>>        }
>>>>>>>>
>>>>>>>>    It can be replaced with:
>>>>>>>>
>>>>>>>>      t0 = a[0];
>>>>>>>>      t1 = a[1];
>>>>>>>>      for (i = 0; i < n; i++)
>>>>>>>>        {
>>>>>>>>          a[i] = 1;
>>>>>>>>          t2 = 2;
>>>>>>>>          t0 = t1;
>>>>>>>>          t1 = t2;
>>>>>>>>        }
>>>>>>>>      a[n] = t0;
>>>>>>>>      a[n+1] = t1;
>>>>>>>>
>>>>>>>>    If the loop runs more than 1 iterations, it can be further 
>>>>>>>> simplified into:
>>>>>>>>
>>>>>>>>      for (i = 0; i < n; i++)
>>>>>>>>        {
>>>>>>>>          a[i] = 1;
>>>>>>>>        }
>>>>>>>>      a[n] = 2;
>>>>>>>>      a[n+1] = 2;
>>>>>>>>
>>>>>>>>    The interesting part is this can be viewed either as general store 
>>>>>>>> motion
>>>>>>>>    or general dead store elimination in either intra/inter-iterations 
>>>>>>>> way.
>>>>>>>>
>>>>>>>> There are number of interesting facts about this enhancement:
>>>>>>>> a) This patch supports dead store elimination for both 
>>>>>>>> across-iteration case and single-iteration
>>>>>>>>      case.  For the latter, it is dead store elimination.
>>>>>>>> b) There are advantages supporting dead store elimination in predcom, 
>>>>>>>> for example, it has
>>>>>>>>      complete information about memory address.  On the contrary, DSE 
>>>>>>>> pass can only handle
>>>>>>>>      memory references with exact the same memory address expression.
>>>>>>>> c) It's cheap to support store-stores chain in predcom based on 
>>>>>>>> existing code.
>>>>>>>> d) As commented, the enhancement can be viewed as either generalized 
>>>>>>>> dead store elimination
>>>>>>>>      or generalized store motion.  I prefer DSE here.
>>>>>>>>
>>>>>>>> Bootstrap(O2/O3) in patch series on x86_64 and AArch64.  Is it OK?
>>>>>>>
>>>>>>> Looks mostly ok.  I have a few questions though.
>>>>>>>
>>>>>>> +  /* Don't do store elimination if loop has multiple exit edges.  */
>>>>>>> +  bool eliminate_store_p = single_exit (loop) != NULL;
>>>>>>>
>>>>>>> handling this would be an enhancement?  IIRC LIM store-motion handles 
>>>>>>> this
>>>>>>> just fine by emitting code on all exits.
>>>>>> It is an enhancement with a little bit more complication.  We would
>>>>>> need to setup/record finalizer memory references for different exit
>>>>>> edges.  I added TODO description for this (and following one).  Is it
>>>>>> okay to pick up this in the future?
>>>>>
>>>>> Yes.
>>>>>
>>>>>>>
>>>>>>> @@ -1773,6 +2003,9 @@ determine_unroll_factor (vec<chain_p> chains)
>>>>>>>      {
>>>>>>>        if (chain->type == CT_INVARIANT)
>>>>>>>         continue;
>>>>>>> +      /* Don't unroll when eliminating stores.  */
>>>>>>> +      else if (chain->type == CT_STORE_STORE)
>>>>>>> +       return 1;
>>>>>>>
>>>>>>> this is a hard exit value so we do not handle the case where another 
>>>>>>> chain
>>>>>>> in the loop would want to unroll? (enhancement?)  I'd have expected to 
>>>>>>> do
>>>>>>> the same as for CT_INVARIANT here.
>>>>>> I didn't check what change is needed in case of unrolling.  I am not
>>>>>> very sure if we should prefer unroll for *load chains or prefer not
>>>>>> unroll for store-store chains, because unroll in general increases
>>>>>> loop-carried register pressure for store-store chains rather than
>>>>>> decreases register pressure for *load chains.
>>>>>> I was also thinking if it's possible to restrict unrolling somehow in
>>>>>> order to enable predcom at O2.  BTW, this is not common, it only
>>>>>> happens once in spec2k6 with factor forced to 1.  So okay if as it is
>>>>>> now?
>>>>>
>>>>> I think it is ok for now with a TODO added.  Please change the comment
>>>>> to "we can't handle unrolling when eliminating stores" (it's not clear if 
>>>>> we
>>>>> can -- did you try?  maybe add a testcase that would ICE)
>>>>>
>>>>>>
>>>>>>>
>>>>>>> +      tree init = ref_at_iteration (dr, (int) 0 - i, &stmts);
>>>>>>> +      if (!chain->all_always_accessed && tree_could_trap_p (init))
>>>>>>> +       {
>>>>>>> +         gimple_seq_discard (stmts);
>>>>>>> +         return false;
>>>>>>>
>>>>>>> so this is the only place that remotely cares for not always performed 
>>>>>>> stores.
>>>>>>> But as-is the patch doesn't seem to avoid speculating stores and thus
>>>>>>> violates the C++ memory model, aka, introduces store-data-races?  The 
>>>>>>> LIM
>>>>>>> store-motion code was fixed to avoid this by keeping track of whether a 
>>>>>>> BB
>>>>>>> has executed to guard the stores done in the compensation code on the 
>>>>>>> loop
>>>>>>> exit.
>>>>>>>
>>>>>>> That said, to "fix" this all && tree_could_trap_p cases would need to 
>>>>>>> be removed
>>>>>>> (or similarly flag vars be introduced).  Speculating loads that do not
>>>>>>> trap is ok
>>>>>>> (might only introduce false uninit use reports by tools like valgrind).
>>>>>> Hmm, not sure IIUC.  Patch updated, is it correct (though conservative)?
>>>>>
>>>>> +static bool
>>>>> +prepare_initializers_chain_store_elim (struct loop *, chain_p chain)
>>>>> +{
>>>>> ...
>>>>> +      tree init = ref_at_iteration (dr, (int) 0 - i, &stmts);
>>>>> +      if (!chain->all_always_accessed && tree_could_trap_p (init))
>>>>> +       {
>>>>>
>>>>> remove && tree_could_trap_p and hoist the other check.
>>>>>
>>>>> Same in prepare_finalizers_chain.  I think you should check
>>>>>
>>>>>    if (! chain->all_always_accessed
>>>>>       && ! PARAM_VALUE (PARAM_ALLOW_STORE_DATA_RACES))
>>>>>      return false;
>>>>>
>>>>> at the beginning of both functions and retain
>>>>>
>>>>>     if (! chain->all_always_accessed && tree_could_trap_p ())
>>>>>
>>>>> in the loops.
>>>>>
>>>>> Ok with that change and a testcase that would exercise failure/ICE of
>>>>> store chains w/ unrolling.
>>>> Hmm, now I remember maybe these
>>>> all_always_accessed/trap/data_store_race checks can be simplified.  In
>>>> function suitable_component_p, we call just_once_each_iteration_p for
>>>> all references.  So we shouldn't not end up with
>>>> chain->all_always_accessed == false cases, right?  why means we don't
>>>> really need to check at all?
>>>
>>> Not sure.  We have this check in the existing code already.  Please come
>>> up with a testcase that we might not DSE-pcom because of store speculation.
>> So just_once_each_iteration_p function call skips all if-condition
>> cases, while chain->all_always_accessed skips multi-exit cases.  As
>> for store-elimination, given we already requires single-exit loops, we
>> can assume chain->all_always_accessed is true in
>> prepare_initializers_chain_store_elim and prepare_finalizers_chain.
>> To make code clearer, this updated patch explicitly checks
>> chain->all_always_accessed at the beginning of both functions.  The
>> fact is, we can't eliminate conditional stores (at least with current
>> code) because both the condition and niters need to be compilation
>> time constants to propagate the value for storing.  It's complicated
>> and may not worth the effort if they are constants.
>> I also added two new tests, one for conditional store elimination, the
>> other for unrolling/store-elimination.
>> Bootstrap and test on x86_64 and AArch64.  Is it OK?
>>
>> Thanks,
>> bin
>> 2017-07-04  Bin Cheng  <bin.ch...@arm.com>
>>
>>     * tree-predcom.c: Revise general description of pass.
>>     (enum chain_type): New enum type for store elimination.
>>     (struct chain): New field supporting store elimination.
>>     (dump_chain): Dump store-stores chain.
>>     (release_chain): Release resources.
>>     (split_data_refs_to_components): Compute and create component
>>     contains only stores for elimination.
>>     (get_chain_last_ref_at): New function.
>>     (make_invariant_chain): Initialization.
>>     (make_rooted_chain): Specify chain type in parameter.
>>     (add_looparound_copies): Skip for store-stores chain.
>>     (determine_roots_comp): Compute type of chain and pass it to
>>     make_rooted_chain.
>>     (initialize_root_vars_store_elim_2): New function.
>>     (finalize_eliminated_stores): New function.
>>     (remove_stmt): Handle store for elimination.
>>     (execute_pred_commoning_chain): Execute predictive commoning on
>>     store-store chains.
>>     (determine_unroll_factor): Skip unroll for store-stores chain.
>>     (prepare_initializers_chain_store_elim): New function.
>>     (prepare_initializers_chain): Hanlde store-store chain.
>>     (prepare_finalizers_chain, prepare_finalizers): New function.
>>     (tree_predictive_commoning_loop): Return integer value indicating
>>     if loop is unrolled or lcssa form is corrupted.
>>     (tree_predictive_commoning): Rewrite for lcssa form if necessary.
>>
>> gcc/testsuite/ChangeLog
>> 2017-07-04  Bin Cheng  <bin.ch...@arm.com>
>>
>>     * gcc.dg/tree-ssa/predcom-dse-1.c: New test.
>>     * gcc.dg/tree-ssa/predcom-dse-2.c: New test.
>>     * gcc.dg/tree-ssa/predcom-dse-3.c: New test.
>>     * gcc.dg/tree-ssa/predcom-dse-4.c: New test.
>>     * gcc.dg/tree-ssa/predcom-dse-5.c: New test.
>>     * gcc.dg/tree-ssa/predcom-dse-6.c: New test.
>>     * gcc.dg/tree-ssa/predcom-dse-7.c: New test.
>>     * gcc.dg/tree-ssa/predcom-dse-8.c: New test.
>>     * gcc.dg/tree-ssa/predcom-dse-9.c: New test.
>>     * gcc.dg/tree-ssa/predcom-dse-10.c: New test.
>>     * gcc.dg/tree-ssa/predcom-dse-11.c: New test.

Re: [PATCH GCC][3/4]Generalize dead store elimination (or store motion) across loop iterations in predcom

Reply via email to