On Fri, May 27, 2016 at 12:56 PM, Richard Biener
<richard.guent...@gmail.com> wrote:
> On Fri, May 27, 2016 at 1:11 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Fri, May 27, 2016 at 11:45 AM, Richard Biener
>> <richard.guent...@gmail.com> wrote:
>>> On Wed, May 25, 2016 at 1:22 PM, Bin Cheng <bin.ch...@arm.com> wrote:
>>>> Hi,
>>>> As analyzed in PR68303 and PR69710, vectorizer generates duplicated 
>>>> computations in loop's pre-header basic block when creating base address 
>>>> for vector reference to the same memory object.  Because the duplicated 
>>>> code is out of loop, IVOPT fails to track base object for these vector 
>>>> references, resulting in missed strength reduction.
>>>> It's agreed that vectorizer should be improved to generate optimal 
>>>> (IVOPT-friendly) code, the difficult part is we want a generic 
>>>> infrastructure.  After investigation, I tried to introduce a 
>>>> generic/simple local CSE interface by reusing existing 
>>>> algorithm/data-structure from tree-ssa-dom (tree-ssa-scopedtables).  The 
>>>> interface runs local CSE for each basic block in a bitmap, customers of 
>>>> this interface only need to record basic blocks in the bitmap when 
>>>> necessary.  Note we don't need scopedtables' unwinding facility since the 
>>>> interface runs only for single basic block, this should be good in terms 
>>>> of compilation time.
>>>> Besides CSE issue, this patch also re-associates address expressions in 
>>>> vect_create_addr_base_for_vector_ref, specifically, it splits constant 
>>>> offset and adds it back near the expression root in IR.  This is necessary 
>>>> because GCC only handles re-association for commutative operators in CSE.
>>>>
>>>> I checked its impact on various test cases.
>>>> With this patch, PR68030's generated assembly is reduced from ~750 lines 
>>>> to ~580 lines on x86_64, with both pre-header and loop body simplified.  
>>>> But,
>>>> 1) It doesn't fix all the problem on x86_64.  Root cause is computation 
>>>> for base address of the first reference is somehow moved outside of loop's 
>>>> pre-header, local CSE can't help in this case.  Though 
>>>> split_constant_offset can back track ssa def chain, it causes possible 
>>>> redundant when there is no CSE opportunities in pre-header.
>>>> 2) It causes regression for PR68030 on AArch64.  I think the regression is 
>>>> caused by IVOPT issues which are exposed by this patch.  Checks on offset 
>>>> validity in get_address_cost is wrong/inaccurate now.  It considers an 
>>>> offset as valid if it's within the maximum offset range that backend 
>>>> supports.  This is not true, for example, AArch64 requires aligned offset 
>>>> additionally.  For example, LDR [base + 2060] is invalid for V4SFmode, 
>>>> although "2060" is within the maximum offset range.  Another issue is also 
>>>> in get_address_cost.  Inaccurate cost is computed for "base + offset + 
>>>> INDEX" address expression.  When register pressure is low, "base+offset" 
>>>> can be hoisted out and we can use [base + INDEX] addressing mode, whichhis 
>>>> is current behavior.
>>>>
>>>> Bootstrap and test on x86_64 and AArch64.  Any comments appreciated.
>>>
>>> It looks quite straight-forward with the caveat that it has one
>>> obvious piece that is not in the order
>>> of the complexity of a basic-block.  threadedge_initialize_values
>>> creates the SSA value array
>> I noticed this too, and think it's better to get rid of this init/fini
>> functions by some kind re-design.  I found it's quite weird to call
>> threadege_X in tree-vrp.c.  I will keep investigating this.
>>> which is zero-initialized (upon use).  That's probably a non-issue for
>>> the use you propose for
>>> the vectorizer (call cse_bbs once per function).  As Ideally I would
>>> like this facility to replace
>>> the loop unrollers own propagate_constants_for_unrolling it might
>>> become an issue though.
>>> In that regard the unroller facility is also more powerful because it
>>> handles regions (including
>>> PHIs).
>> With the current data-structure, I think it's not very hard to extend
>> the interface to regions.  I will keep investigating this too.  BTW,
>> if it's okay, I tend to do that in following patches.
>
> I'm fine with doing enhancements to this in followup patches (adding Jeff
> to CC for his opinions).
Hi,
I further investigated the impact of this patch, and believe my
previous conclusion still holds.  On targets that don't support [base
+ index + offset] addressing mode, it may causes IVO generating worse
code than now.  I found and investigated a new case showing exactly
the same problem.  Take AArch64 as an example, before this patch, IVO
computes cost and chooses candidate as below:

Loop preheader:
    vect_p_1 = ...
    vect_p_2 = ...
Loop header:
    MEM[vect_p_1 + VAR]
    MEM[vect_p_2 + VAR]

After this patch, cse opportunities are realized between vect_p_1 and
vect_p_2, and IVO computes cost for the second MEM_REF differently as
below:
Loop preheader:
    vect_p_1 = ....
    vect_p_2 = vect_1_p + offset
Loop header:
    MEM[vect_p_1 + VAR]
    MEM[vect_p_1 + VAR + offset]

On x86, it's optimal because the addressing mode is supported.  On
AArch64, cost computed for [vect_p_1 + VAR + offset] is too large.  As
a result, VAR is not chosen, and more candidates are chosen.  The
inaccurate cost is computed because IVO relies on LEGITIMIZE_ADDRESS
hook, as well as use buffering when constructing rtx address
expression.  IVO should be aware of that "vect_p_1 + offset" can be
computed as an invariant just like before this patch.
I am doing experiments to rewrite address computation in IVO, in the
meantime, I collected various benchmarks data on AArch64 and there is
no big regression caused by the mentioned issue.

Hi Jeff,
What's your opinion on this (and how to extend it to a region based
interface)?  I will update this patch for further review if you are
okay.

Thanks,
bin
>
>>>
>>> Note that in particular for SLP vectorization the vectorizer itself
>>> may end up creating quite
>>> some redundancies so I wonder if it's worth to CSE the vectorized loop
>>> body as well
>> Maybe.  The next step is condition block created by vectorizer.  Both
>> prune_runtime_alias_test_list and generated alias checks are
>> sub-optimal now, even worse, somehow computations in alias checks can
>> be propagated to loop pre-header.  With help of this interface, alias
>> checks (and later code) can be simplified.
>
> Yeah.
>
> Thanks,
> Richard.
>
>>> (and given we have PHIs eventually CSE the whole vectorized loop with
>>> pre-header as a region...)
>>>
>>> Thanks,
>>> Richard.
>>>
>>>> Thanks,
>>>> bin
>>>>
>>>> 2016-05-17 Bin Cheng  <bin.ch...@arm.com>
>>>>
>>>>         PR tree-optimization/68030
>>>>         PR tree-optimization/69710
>>>>         * tree-ssa-dom.c (cse_bbs): New function.
>>>>         * tree-ssa-dom.h (cse_bbs): New declaration.
>>>>         * tree-vect-data-refs.c (vect_create_addr_base_for_vector_ref):
>>>>         Re-associate address by splitting constant offset.
>>>>         (vect_create_data_ref_ptr, vect_setup_realignment): Record changed
>>>>         basic block.
>>>>         * tree-vect-loop-manip.c (vect_gen_niters_for_prolog_loop): Record
>>>>         changed basic block.
>>>>         * tree-vectorizer.c (tree-ssa-dom.h): Include header file.
>>>>         (changed_bbs): New variable.
>>>>         (vectorize_loops): Allocate and free CHANGED_BBS.  Call cse_bbs.
>>>>         * tree-vectorizer.h (changed_bbs): New declaration.

Reply via email to