On Mon, Jul 24, 2017 at 4:27 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: > Ping^1.
Ok. Thanks and sorry for the delay. Richard. > Thanks, > bin > > On Mon, Jul 10, 2017 at 9:23 AM, Bin.Cheng <amker.ch...@gmail.com> wrote: >> On Tue, Jul 4, 2017 at 1:29 PM, Richard Biener >> <richard.guent...@gmail.com> wrote: >>> On Tue, Jul 4, 2017 at 2:06 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: >>>> On Tue, Jul 4, 2017 at 12:19 PM, Richard Biener >>>> <richard.guent...@gmail.com> wrote: >>>>> On Mon, Jul 3, 2017 at 4:17 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: >>>>>> On Mon, Jul 3, 2017 at 10:38 AM, Richard Biener >>>>>> <richard.guent...@gmail.com> wrote: >>>>>>> On Tue, Jun 27, 2017 at 12:49 PM, Bin Cheng <bin.ch...@arm.com> wrote: >>>>>>>> Hi, >>>>>>>> For the moment, tree-predcom.c only supports >>>>>>>> invariant/load-loads/store-loads chains. >>>>>>>> This patch generalizes dead store elimination (or store motion) across >>>>>>>> loop iterations in >>>>>>>> predictive commoning pass by supporting store-store chain. As comment >>>>>>>> in the patch: >>>>>>>> >>>>>>>> Apart from predictive commoning on Load-Load and Store-Load chains, >>>>>>>> we >>>>>>>> also support Store-Store chains -- stores killed by other store can >>>>>>>> be >>>>>>>> eliminated. Given below example: >>>>>>>> >>>>>>>> for (i = 0; i < n; i++) >>>>>>>> { >>>>>>>> a[i] = 1; >>>>>>>> a[i+2] = 2; >>>>>>>> } >>>>>>>> >>>>>>>> It can be replaced with: >>>>>>>> >>>>>>>> t0 = a[0]; >>>>>>>> t1 = a[1]; >>>>>>>> for (i = 0; i < n; i++) >>>>>>>> { >>>>>>>> a[i] = 1; >>>>>>>> t2 = 2; >>>>>>>> t0 = t1; >>>>>>>> t1 = t2; >>>>>>>> } >>>>>>>> a[n] = t0; >>>>>>>> a[n+1] = t1; >>>>>>>> >>>>>>>> If the loop runs more than 1 iterations, it can be further >>>>>>>> simplified into: >>>>>>>> >>>>>>>> for (i = 0; i < n; i++) >>>>>>>> { >>>>>>>> a[i] = 1; >>>>>>>> } >>>>>>>> a[n] = 2; >>>>>>>> a[n+1] = 2; >>>>>>>> >>>>>>>> The interesting part is this can be viewed either as general store >>>>>>>> motion >>>>>>>> or general dead store elimination in either intra/inter-iterations >>>>>>>> way. >>>>>>>> >>>>>>>> There are number of interesting facts about this enhancement: >>>>>>>> a) This patch supports dead store elimination for both >>>>>>>> across-iteration case and single-iteration >>>>>>>> case. For the latter, it is dead store elimination. >>>>>>>> b) There are advantages supporting dead store elimination in predcom, >>>>>>>> for example, it has >>>>>>>> complete information about memory address. On the contrary, DSE >>>>>>>> pass can only handle >>>>>>>> memory references with exact the same memory address expression. >>>>>>>> c) It's cheap to support store-stores chain in predcom based on >>>>>>>> existing code. >>>>>>>> d) As commented, the enhancement can be viewed as either generalized >>>>>>>> dead store elimination >>>>>>>> or generalized store motion. I prefer DSE here. >>>>>>>> >>>>>>>> Bootstrap(O2/O3) in patch series on x86_64 and AArch64. Is it OK? >>>>>>> >>>>>>> Looks mostly ok. I have a few questions though. >>>>>>> >>>>>>> + /* Don't do store elimination if loop has multiple exit edges. */ >>>>>>> + bool eliminate_store_p = single_exit (loop) != NULL; >>>>>>> >>>>>>> handling this would be an enhancement? IIRC LIM store-motion handles >>>>>>> this >>>>>>> just fine by emitting code on all exits. >>>>>> It is an enhancement with a little bit more complication. We would >>>>>> need to setup/record finalizer memory references for different exit >>>>>> edges. I added TODO description for this (and following one). Is it >>>>>> okay to pick up this in the future? >>>>> >>>>> Yes. >>>>> >>>>>>> >>>>>>> @@ -1773,6 +2003,9 @@ determine_unroll_factor (vec<chain_p> chains) >>>>>>> { >>>>>>> if (chain->type == CT_INVARIANT) >>>>>>> continue; >>>>>>> + /* Don't unroll when eliminating stores. */ >>>>>>> + else if (chain->type == CT_STORE_STORE) >>>>>>> + return 1; >>>>>>> >>>>>>> this is a hard exit value so we do not handle the case where another >>>>>>> chain >>>>>>> in the loop would want to unroll? (enhancement?) I'd have expected to >>>>>>> do >>>>>>> the same as for CT_INVARIANT here. >>>>>> I didn't check what change is needed in case of unrolling. I am not >>>>>> very sure if we should prefer unroll for *load chains or prefer not >>>>>> unroll for store-store chains, because unroll in general increases >>>>>> loop-carried register pressure for store-store chains rather than >>>>>> decreases register pressure for *load chains. >>>>>> I was also thinking if it's possible to restrict unrolling somehow in >>>>>> order to enable predcom at O2. BTW, this is not common, it only >>>>>> happens once in spec2k6 with factor forced to 1. So okay if as it is >>>>>> now? >>>>> >>>>> I think it is ok for now with a TODO added. Please change the comment >>>>> to "we can't handle unrolling when eliminating stores" (it's not clear if >>>>> we >>>>> can -- did you try? maybe add a testcase that would ICE) >>>>> >>>>>> >>>>>>> >>>>>>> + tree init = ref_at_iteration (dr, (int) 0 - i, &stmts); >>>>>>> + if (!chain->all_always_accessed && tree_could_trap_p (init)) >>>>>>> + { >>>>>>> + gimple_seq_discard (stmts); >>>>>>> + return false; >>>>>>> >>>>>>> so this is the only place that remotely cares for not always performed >>>>>>> stores. >>>>>>> But as-is the patch doesn't seem to avoid speculating stores and thus >>>>>>> violates the C++ memory model, aka, introduces store-data-races? The >>>>>>> LIM >>>>>>> store-motion code was fixed to avoid this by keeping track of whether a >>>>>>> BB >>>>>>> has executed to guard the stores done in the compensation code on the >>>>>>> loop >>>>>>> exit. >>>>>>> >>>>>>> That said, to "fix" this all && tree_could_trap_p cases would need to >>>>>>> be removed >>>>>>> (or similarly flag vars be introduced). Speculating loads that do not >>>>>>> trap is ok >>>>>>> (might only introduce false uninit use reports by tools like valgrind). >>>>>> Hmm, not sure IIUC. Patch updated, is it correct (though conservative)? >>>>> >>>>> +static bool >>>>> +prepare_initializers_chain_store_elim (struct loop *, chain_p chain) >>>>> +{ >>>>> ... >>>>> + tree init = ref_at_iteration (dr, (int) 0 - i, &stmts); >>>>> + if (!chain->all_always_accessed && tree_could_trap_p (init)) >>>>> + { >>>>> >>>>> remove && tree_could_trap_p and hoist the other check. >>>>> >>>>> Same in prepare_finalizers_chain. I think you should check >>>>> >>>>> if (! chain->all_always_accessed >>>>> && ! PARAM_VALUE (PARAM_ALLOW_STORE_DATA_RACES)) >>>>> return false; >>>>> >>>>> at the beginning of both functions and retain >>>>> >>>>> if (! chain->all_always_accessed && tree_could_trap_p ()) >>>>> >>>>> in the loops. >>>>> >>>>> Ok with that change and a testcase that would exercise failure/ICE of >>>>> store chains w/ unrolling. >>>> Hmm, now I remember maybe these >>>> all_always_accessed/trap/data_store_race checks can be simplified. In >>>> function suitable_component_p, we call just_once_each_iteration_p for >>>> all references. So we shouldn't not end up with >>>> chain->all_always_accessed == false cases, right? why means we don't >>>> really need to check at all? >>> >>> Not sure. We have this check in the existing code already. Please come >>> up with a testcase that we might not DSE-pcom because of store speculation. >> So just_once_each_iteration_p function call skips all if-condition >> cases, while chain->all_always_accessed skips multi-exit cases. As >> for store-elimination, given we already requires single-exit loops, we >> can assume chain->all_always_accessed is true in >> prepare_initializers_chain_store_elim and prepare_finalizers_chain. >> To make code clearer, this updated patch explicitly checks >> chain->all_always_accessed at the beginning of both functions. The >> fact is, we can't eliminate conditional stores (at least with current >> code) because both the condition and niters need to be compilation >> time constants to propagate the value for storing. It's complicated >> and may not worth the effort if they are constants. >> I also added two new tests, one for conditional store elimination, the >> other for unrolling/store-elimination. >> Bootstrap and test on x86_64 and AArch64. Is it OK? >> >> Thanks, >> bin >> 2017-07-04 Bin Cheng <bin.ch...@arm.com> >> >> * tree-predcom.c: Revise general description of pass. >> (enum chain_type): New enum type for store elimination. >> (struct chain): New field supporting store elimination. >> (dump_chain): Dump store-stores chain. >> (release_chain): Release resources. >> (split_data_refs_to_components): Compute and create component >> contains only stores for elimination. >> (get_chain_last_ref_at): New function. >> (make_invariant_chain): Initialization. >> (make_rooted_chain): Specify chain type in parameter. >> (add_looparound_copies): Skip for store-stores chain. >> (determine_roots_comp): Compute type of chain and pass it to >> make_rooted_chain. >> (initialize_root_vars_store_elim_2): New function. >> (finalize_eliminated_stores): New function. >> (remove_stmt): Handle store for elimination. >> (execute_pred_commoning_chain): Execute predictive commoning on >> store-store chains. >> (determine_unroll_factor): Skip unroll for store-stores chain. >> (prepare_initializers_chain_store_elim): New function. >> (prepare_initializers_chain): Hanlde store-store chain. >> (prepare_finalizers_chain, prepare_finalizers): New function. >> (tree_predictive_commoning_loop): Return integer value indicating >> if loop is unrolled or lcssa form is corrupted. >> (tree_predictive_commoning): Rewrite for lcssa form if necessary. >> >> gcc/testsuite/ChangeLog >> 2017-07-04 Bin Cheng <bin.ch...@arm.com> >> >> * gcc.dg/tree-ssa/predcom-dse-1.c: New test. >> * gcc.dg/tree-ssa/predcom-dse-2.c: New test. >> * gcc.dg/tree-ssa/predcom-dse-3.c: New test. >> * gcc.dg/tree-ssa/predcom-dse-4.c: New test. >> * gcc.dg/tree-ssa/predcom-dse-5.c: New test. >> * gcc.dg/tree-ssa/predcom-dse-6.c: New test. >> * gcc.dg/tree-ssa/predcom-dse-7.c: New test. >> * gcc.dg/tree-ssa/predcom-dse-8.c: New test. >> * gcc.dg/tree-ssa/predcom-dse-9.c: New test. >> * gcc.dg/tree-ssa/predcom-dse-10.c: New test. >> * gcc.dg/tree-ssa/predcom-dse-11.c: New test.