On Fri, May 27, 2016 at 12:56 PM, Richard Biener <richard.guent...@gmail.com> wrote: > On Fri, May 27, 2016 at 1:11 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: >> On Fri, May 27, 2016 at 11:45 AM, Richard Biener >> <richard.guent...@gmail.com> wrote: >>> On Wed, May 25, 2016 at 1:22 PM, Bin Cheng <bin.ch...@arm.com> wrote: >>>> Hi, >>>> As analyzed in PR68303 and PR69710, vectorizer generates duplicated >>>> computations in loop's pre-header basic block when creating base address >>>> for vector reference to the same memory object. Because the duplicated >>>> code is out of loop, IVOPT fails to track base object for these vector >>>> references, resulting in missed strength reduction. >>>> It's agreed that vectorizer should be improved to generate optimal >>>> (IVOPT-friendly) code, the difficult part is we want a generic >>>> infrastructure. After investigation, I tried to introduce a >>>> generic/simple local CSE interface by reusing existing >>>> algorithm/data-structure from tree-ssa-dom (tree-ssa-scopedtables). The >>>> interface runs local CSE for each basic block in a bitmap, customers of >>>> this interface only need to record basic blocks in the bitmap when >>>> necessary. Note we don't need scopedtables' unwinding facility since the >>>> interface runs only for single basic block, this should be good in terms >>>> of compilation time. >>>> Besides CSE issue, this patch also re-associates address expressions in >>>> vect_create_addr_base_for_vector_ref, specifically, it splits constant >>>> offset and adds it back near the expression root in IR. This is necessary >>>> because GCC only handles re-association for commutative operators in CSE. >>>> >>>> I checked its impact on various test cases. >>>> With this patch, PR68030's generated assembly is reduced from ~750 lines >>>> to ~580 lines on x86_64, with both pre-header and loop body simplified. >>>> But, >>>> 1) It doesn't fix all the problem on x86_64. Root cause is computation >>>> for base address of the first reference is somehow moved outside of loop's >>>> pre-header, local CSE can't help in this case. Though >>>> split_constant_offset can back track ssa def chain, it causes possible >>>> redundant when there is no CSE opportunities in pre-header. >>>> 2) It causes regression for PR68030 on AArch64. I think the regression is >>>> caused by IVOPT issues which are exposed by this patch. Checks on offset >>>> validity in get_address_cost is wrong/inaccurate now. It considers an >>>> offset as valid if it's within the maximum offset range that backend >>>> supports. This is not true, for example, AArch64 requires aligned offset >>>> additionally. For example, LDR [base + 2060] is invalid for V4SFmode, >>>> although "2060" is within the maximum offset range. Another issue is also >>>> in get_address_cost. Inaccurate cost is computed for "base + offset + >>>> INDEX" address expression. When register pressure is low, "base+offset" >>>> can be hoisted out and we can use [base + INDEX] addressing mode, whichhis >>>> is current behavior. >>>> >>>> Bootstrap and test on x86_64 and AArch64. Any comments appreciated. >>> >>> It looks quite straight-forward with the caveat that it has one >>> obvious piece that is not in the order >>> of the complexity of a basic-block. threadedge_initialize_values >>> creates the SSA value array >> I noticed this too, and think it's better to get rid of this init/fini >> functions by some kind re-design. I found it's quite weird to call >> threadege_X in tree-vrp.c. I will keep investigating this. >>> which is zero-initialized (upon use). That's probably a non-issue for >>> the use you propose for >>> the vectorizer (call cse_bbs once per function). As Ideally I would >>> like this facility to replace >>> the loop unrollers own propagate_constants_for_unrolling it might >>> become an issue though. >>> In that regard the unroller facility is also more powerful because it >>> handles regions (including >>> PHIs). >> With the current data-structure, I think it's not very hard to extend >> the interface to regions. I will keep investigating this too. BTW, >> if it's okay, I tend to do that in following patches. > > I'm fine with doing enhancements to this in followup patches (adding Jeff > to CC for his opinions). Hi, I further investigated the impact of this patch, and believe my previous conclusion still holds. On targets that don't support [base + index + offset] addressing mode, it may causes IVO generating worse code than now. I found and investigated a new case showing exactly the same problem. Take AArch64 as an example, before this patch, IVO computes cost and chooses candidate as below:
Loop preheader: vect_p_1 = ... vect_p_2 = ... Loop header: MEM[vect_p_1 + VAR] MEM[vect_p_2 + VAR] After this patch, cse opportunities are realized between vect_p_1 and vect_p_2, and IVO computes cost for the second MEM_REF differently as below: Loop preheader: vect_p_1 = .... vect_p_2 = vect_1_p + offset Loop header: MEM[vect_p_1 + VAR] MEM[vect_p_1 + VAR + offset] On x86, it's optimal because the addressing mode is supported. On AArch64, cost computed for [vect_p_1 + VAR + offset] is too large. As a result, VAR is not chosen, and more candidates are chosen. The inaccurate cost is computed because IVO relies on LEGITIMIZE_ADDRESS hook, as well as use buffering when constructing rtx address expression. IVO should be aware of that "vect_p_1 + offset" can be computed as an invariant just like before this patch. I am doing experiments to rewrite address computation in IVO, in the meantime, I collected various benchmarks data on AArch64 and there is no big regression caused by the mentioned issue. Hi Jeff, What's your opinion on this (and how to extend it to a region based interface)? I will update this patch for further review if you are okay. Thanks, bin > >>> >>> Note that in particular for SLP vectorization the vectorizer itself >>> may end up creating quite >>> some redundancies so I wonder if it's worth to CSE the vectorized loop >>> body as well >> Maybe. The next step is condition block created by vectorizer. Both >> prune_runtime_alias_test_list and generated alias checks are >> sub-optimal now, even worse, somehow computations in alias checks can >> be propagated to loop pre-header. With help of this interface, alias >> checks (and later code) can be simplified. > > Yeah. > > Thanks, > Richard. > >>> (and given we have PHIs eventually CSE the whole vectorized loop with >>> pre-header as a region...) >>> >>> Thanks, >>> Richard. >>> >>>> Thanks, >>>> bin >>>> >>>> 2016-05-17 Bin Cheng <bin.ch...@arm.com> >>>> >>>> PR tree-optimization/68030 >>>> PR tree-optimization/69710 >>>> * tree-ssa-dom.c (cse_bbs): New function. >>>> * tree-ssa-dom.h (cse_bbs): New declaration. >>>> * tree-vect-data-refs.c (vect_create_addr_base_for_vector_ref): >>>> Re-associate address by splitting constant offset. >>>> (vect_create_data_ref_ptr, vect_setup_realignment): Record changed >>>> basic block. >>>> * tree-vect-loop-manip.c (vect_gen_niters_for_prolog_loop): Record >>>> changed basic block. >>>> * tree-vectorizer.c (tree-ssa-dom.h): Include header file. >>>> (changed_bbs): New variable. >>>> (vectorize_loops): Allocate and free CHANGED_BBS. Call cse_bbs. >>>> * tree-vectorizer.h (changed_bbs): New declaration.