On Fri, May 27, 2016 at 1:11 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: > On Fri, May 27, 2016 at 11:45 AM, Richard Biener > <richard.guent...@gmail.com> wrote: >> On Wed, May 25, 2016 at 1:22 PM, Bin Cheng <bin.ch...@arm.com> wrote: >>> Hi, >>> As analyzed in PR68303 and PR69710, vectorizer generates duplicated >>> computations in loop's pre-header basic block when creating base address >>> for vector reference to the same memory object. Because the duplicated >>> code is out of loop, IVOPT fails to track base object for these vector >>> references, resulting in missed strength reduction. >>> It's agreed that vectorizer should be improved to generate optimal >>> (IVOPT-friendly) code, the difficult part is we want a generic >>> infrastructure. After investigation, I tried to introduce a generic/simple >>> local CSE interface by reusing existing algorithm/data-structure from >>> tree-ssa-dom (tree-ssa-scopedtables). The interface runs local CSE for >>> each basic block in a bitmap, customers of this interface only need to >>> record basic blocks in the bitmap when necessary. Note we don't need >>> scopedtables' unwinding facility since the interface runs only for single >>> basic block, this should be good in terms of compilation time. >>> Besides CSE issue, this patch also re-associates address expressions in >>> vect_create_addr_base_for_vector_ref, specifically, it splits constant >>> offset and adds it back near the expression root in IR. This is necessary >>> because GCC only handles re-association for commutative operators in CSE. >>> >>> I checked its impact on various test cases. >>> With this patch, PR68030's generated assembly is reduced from ~750 lines to >>> ~580 lines on x86_64, with both pre-header and loop body simplified. But, >>> 1) It doesn't fix all the problem on x86_64. Root cause is computation for >>> base address of the first reference is somehow moved outside of loop's >>> pre-header, local CSE can't help in this case. Though >>> split_constant_offset can back track ssa def chain, it causes possible >>> redundant when there is no CSE opportunities in pre-header. >>> 2) It causes regression for PR68030 on AArch64. I think the regression is >>> caused by IVOPT issues which are exposed by this patch. Checks on offset >>> validity in get_address_cost is wrong/inaccurate now. It considers an >>> offset as valid if it's within the maximum offset range that backend >>> supports. This is not true, for example, AArch64 requires aligned offset >>> additionally. For example, LDR [base + 2060] is invalid for V4SFmode, >>> although "2060" is within the maximum offset range. Another issue is also >>> in get_address_cost. Inaccurate cost is computed for "base + offset + >>> INDEX" address expression. When register pressure is low, "base+offset" >>> can be hoisted out and we can use [base + INDEX] addressing mode, whichhis >>> is current behavior. >>> >>> Bootstrap and test on x86_64 and AArch64. Any comments appreciated. >> >> It looks quite straight-forward with the caveat that it has one >> obvious piece that is not in the order >> of the complexity of a basic-block. threadedge_initialize_values >> creates the SSA value array > I noticed this too, and think it's better to get rid of this init/fini > functions by some kind re-design. I found it's quite weird to call > threadege_X in tree-vrp.c. I will keep investigating this. >> which is zero-initialized (upon use). That's probably a non-issue for >> the use you propose for >> the vectorizer (call cse_bbs once per function). As Ideally I would >> like this facility to replace >> the loop unrollers own propagate_constants_for_unrolling it might >> become an issue though. >> In that regard the unroller facility is also more powerful because it >> handles regions (including >> PHIs). > With the current data-structure, I think it's not very hard to extend > the interface to regions. I will keep investigating this too. BTW, > if it's okay, I tend to do that in following patches.
I'm fine with doing enhancements to this in followup patches (adding Jeff to CC for his opinions). >> >> Note that in particular for SLP vectorization the vectorizer itself >> may end up creating quite >> some redundancies so I wonder if it's worth to CSE the vectorized loop >> body as well > Maybe. The next step is condition block created by vectorizer. Both > prune_runtime_alias_test_list and generated alias checks are > sub-optimal now, even worse, somehow computations in alias checks can > be propagated to loop pre-header. With help of this interface, alias > checks (and later code) can be simplified. Yeah. Thanks, Richard. >> (and given we have PHIs eventually CSE the whole vectorized loop with >> pre-header as a region...) >> >> Thanks, >> Richard. >> >>> Thanks, >>> bin >>> >>> 2016-05-17 Bin Cheng <bin.ch...@arm.com> >>> >>> PR tree-optimization/68030 >>> PR tree-optimization/69710 >>> * tree-ssa-dom.c (cse_bbs): New function. >>> * tree-ssa-dom.h (cse_bbs): New declaration. >>> * tree-vect-data-refs.c (vect_create_addr_base_for_vector_ref): >>> Re-associate address by splitting constant offset. >>> (vect_create_data_ref_ptr, vect_setup_realignment): Record changed >>> basic block. >>> * tree-vect-loop-manip.c (vect_gen_niters_for_prolog_loop): Record >>> changed basic block. >>> * tree-vectorizer.c (tree-ssa-dom.h): Include header file. >>> (changed_bbs): New variable. >>> (vectorize_loops): Allocate and free CHANGED_BBS. Call cse_bbs. >>> * tree-vectorizer.h (changed_bbs): New declaration.