On Fri, May 27, 2016 at 1:11 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
> On Fri, May 27, 2016 at 11:45 AM, Richard Biener
> <richard.guent...@gmail.com> wrote:
>> On Wed, May 25, 2016 at 1:22 PM, Bin Cheng <bin.ch...@arm.com> wrote:
>>> Hi,
>>> As analyzed in PR68303 and PR69710, vectorizer generates duplicated 
>>> computations in loop's pre-header basic block when creating base address 
>>> for vector reference to the same memory object.  Because the duplicated 
>>> code is out of loop, IVOPT fails to track base object for these vector 
>>> references, resulting in missed strength reduction.
>>> It's agreed that vectorizer should be improved to generate optimal 
>>> (IVOPT-friendly) code, the difficult part is we want a generic 
>>> infrastructure.  After investigation, I tried to introduce a generic/simple 
>>> local CSE interface by reusing existing algorithm/data-structure from 
>>> tree-ssa-dom (tree-ssa-scopedtables).  The interface runs local CSE for 
>>> each basic block in a bitmap, customers of this interface only need to 
>>> record basic blocks in the bitmap when necessary.  Note we don't need 
>>> scopedtables' unwinding facility since the interface runs only for single 
>>> basic block, this should be good in terms of compilation time.
>>> Besides CSE issue, this patch also re-associates address expressions in 
>>> vect_create_addr_base_for_vector_ref, specifically, it splits constant 
>>> offset and adds it back near the expression root in IR.  This is necessary 
>>> because GCC only handles re-association for commutative operators in CSE.
>>>
>>> I checked its impact on various test cases.
>>> With this patch, PR68030's generated assembly is reduced from ~750 lines to 
>>> ~580 lines on x86_64, with both pre-header and loop body simplified.  But,
>>> 1) It doesn't fix all the problem on x86_64.  Root cause is computation for 
>>> base address of the first reference is somehow moved outside of loop's 
>>> pre-header, local CSE can't help in this case.  Though 
>>> split_constant_offset can back track ssa def chain, it causes possible 
>>> redundant when there is no CSE opportunities in pre-header.
>>> 2) It causes regression for PR68030 on AArch64.  I think the regression is 
>>> caused by IVOPT issues which are exposed by this patch.  Checks on offset 
>>> validity in get_address_cost is wrong/inaccurate now.  It considers an 
>>> offset as valid if it's within the maximum offset range that backend 
>>> supports.  This is not true, for example, AArch64 requires aligned offset 
>>> additionally.  For example, LDR [base + 2060] is invalid for V4SFmode, 
>>> although "2060" is within the maximum offset range.  Another issue is also 
>>> in get_address_cost.  Inaccurate cost is computed for "base + offset + 
>>> INDEX" address expression.  When register pressure is low, "base+offset" 
>>> can be hoisted out and we can use [base + INDEX] addressing mode, whichhis 
>>> is current behavior.
>>>
>>> Bootstrap and test on x86_64 and AArch64.  Any comments appreciated.
>>
>> It looks quite straight-forward with the caveat that it has one
>> obvious piece that is not in the order
>> of the complexity of a basic-block.  threadedge_initialize_values
>> creates the SSA value array
> I noticed this too, and think it's better to get rid of this init/fini
> functions by some kind re-design.  I found it's quite weird to call
> threadege_X in tree-vrp.c.  I will keep investigating this.
>> which is zero-initialized (upon use).  That's probably a non-issue for
>> the use you propose for
>> the vectorizer (call cse_bbs once per function).  As Ideally I would
>> like this facility to replace
>> the loop unrollers own propagate_constants_for_unrolling it might
>> become an issue though.
>> In that regard the unroller facility is also more powerful because it
>> handles regions (including
>> PHIs).
> With the current data-structure, I think it's not very hard to extend
> the interface to regions.  I will keep investigating this too.  BTW,
> if it's okay, I tend to do that in following patches.

I'm fine with doing enhancements to this in followup patches (adding Jeff
to CC for his opinions).

>>
>> Note that in particular for SLP vectorization the vectorizer itself
>> may end up creating quite
>> some redundancies so I wonder if it's worth to CSE the vectorized loop
>> body as well
> Maybe.  The next step is condition block created by vectorizer.  Both
> prune_runtime_alias_test_list and generated alias checks are
> sub-optimal now, even worse, somehow computations in alias checks can
> be propagated to loop pre-header.  With help of this interface, alias
> checks (and later code) can be simplified.

Yeah.

Thanks,
Richard.

>> (and given we have PHIs eventually CSE the whole vectorized loop with
>> pre-header as a region...)
>>
>> Thanks,
>> Richard.
>>
>>> Thanks,
>>> bin
>>>
>>> 2016-05-17 Bin Cheng  <bin.ch...@arm.com>
>>>
>>>         PR tree-optimization/68030
>>>         PR tree-optimization/69710
>>>         * tree-ssa-dom.c (cse_bbs): New function.
>>>         * tree-ssa-dom.h (cse_bbs): New declaration.
>>>         * tree-vect-data-refs.c (vect_create_addr_base_for_vector_ref):
>>>         Re-associate address by splitting constant offset.
>>>         (vect_create_data_ref_ptr, vect_setup_realignment): Record changed
>>>         basic block.
>>>         * tree-vect-loop-manip.c (vect_gen_niters_for_prolog_loop): Record
>>>         changed basic block.
>>>         * tree-vectorizer.c (tree-ssa-dom.h): Include header file.
>>>         (changed_bbs): New variable.
>>>         (vectorize_loops): Allocate and free CHANGED_BBS.  Call cse_bbs.
>>>         * tree-vectorizer.h (changed_bbs): New declaration.

Reply via email to