"Bin.Cheng" <amker.ch...@gmail.com> writes:
> On Wed, Aug 16, 2017 at 6:50 PM, Richard Sandiford
> <richard.sandif...@linaro.org> wrote:
>> "Bin.Cheng" <amker.ch...@gmail.com> writes:
>>> On Wed, Aug 16, 2017 at 5:00 PM, Richard Sandiford
>>> <richard.sandif...@linaro.org> wrote:
>>>> "Bin.Cheng" <amker.ch...@gmail.com> writes:
>>>>> On Wed, Aug 16, 2017 at 2:38 PM, Richard Sandiford
>>>>> <richard.sandif...@linaro.org> wrote:
>>>>>> The first loop in the testcase regressed after my recent changes to
>>>>>> dr_analyze_innermost.  Previously we would treat "i" as an iv even
>>>>>> for bb analysis and end up with:
>>>>>>
>>>>>>    DR_BASE_ADDRESS: p or q
>>>>>>    DR_OFFSET: 0
>>>>>>    DR_INIT: 0 or 4
>>>>>>    DR_STEP: 16
>>>>>>
>>>>>> We now always keep the step as 0 instead, so for an int "i" we'd have:
>>>>>>
>>>>>>    DR_BASE_ADDRESS: p or q
>>>>>>    DR_OFFSET: (intptr_t) i
>>>>>>    DR_INIT: 0 or 4
>>>>>>    DR_STEP: 0
>>>>>>
>>>>>> This is also what we'd like to have for the unsigned "i", but the
>>>>>> problem is that strip_constant_offset thinks that the "i + 1" in
>>>>>> "(intptr_t) (i + 1)" could wrap and so doesn't peel off the "+ 1".
>>>>>> The [i + 1] accesses therefore have a DR_OFFSET equal to the SSA
>>>>>> name that holds "(intptr_t) (i + 1)", meaning that the accesses no
>>>>>> longer seem to be related to the [i] ones.
>>>>>
>>>>> Didn't read the change in detail, so sorry if I mis-understood the issue.
>>>>> I made changes in scev to better fold type conversion by various sources
>>>>> of information, for example, vrp, niters, undefined overflow behavior etc.
>>>>> In theory these information should be available for other
>>>>> optimizers without
>>>>> querying scev.  For the mentioned test, vrp should compute accurate range
>>>>> information for "i" so that we can fold (intptr_t) (i + 1) it without
>>>>> worrying
>>>>> overflow.  Note we don't do it in generic folding because
>>>>> (intptr_t) (i) + 1
>>>>> could be more expensive (especially in case of (T)(i + j)), or because the
>>>>> CST part is in bigger precision after conversion.
>>>>> But such folding is wanted in several places, e.g, IVOPTs.  To provide 
>>>>> such
>>>>> an interface, we changed tree-affine and made it do aggressive fold.  I am
>>>>> curious if it's possible to use aff_tree to implement 
>>>>> strip_constant_offset
>>>>> here since aggressive folding is wanted.  After all, using additional 
>>>>> chrec
>>>>> looks like a little heavy wrto the simple test.
>>>>
>>>> Yeah, using aff_tree does work here when the bounds are constant.
>>>> It doesn't look like it works for things like:
>>>>
>>>>     double p[1000];
>>>>     double q[1000];
>>>>
>>>>     void
>>>>     f4 (unsigned int n)
>>>>     {
>>>>       for (unsigned int i = 0; i < n; i += 4)
>>>>         {
>>>>           double a = q[i] + p[i];
>>>>           double b = q[i + 1] + p[i + 1];
>>>>           q[i] = a;
>>>>           q[i + 1] = b;
>>>>         }
>>>>     }
>>>>
>>>> though, where the bounds on the global arrays guarantee that [i + 1] can't
>>>> overflow, even though "n" is unconstrained.  The patch as posted handles
>>>> this case too.
>>> BTW is this a missed optimization in value range analysis?  The range
>>> information for i should flow in a way like: array boundary -> niters
>>> -> scev/vrp.
>>> I think that's what niters/scev do in analysis.
>>
>> Yeah, maybe :-)  It looks like the problem is that when SLP runs,
>> the previous VRP pass came before loop header copying, so the (single)
>> header has to cope with n == 0 case.  Thus we get:
> Ah, there are several passes in between vrp and pass_ch, not sure if
> any such pass depends on vrp intensively.  I would suggestion reorder
> the two passes, or standalone VRP interface updating information for
> loop region after header copied?   This is a non-trivial issue that
> needs to be fixed.  Niters analyzer rely on
> simplify_using_initial_conditions heavily to get the same information,
> which in my opinion should be provided by VRP.  Though this won't be
> able to obsolete simplify_using_initial_conditions because VRP is weak
> in symbolic range...
>
>>
>>   Visiting statement:
>>   i_15 = ASSERT_EXPR <i_6, i_6 < n_9(D)>;
>>   Intersecting
>>     [0, n_9(D) + 4294967295]  EQUIVALENCES: { i_6 } (1 elements)
>>   and
>>     [0, 0]
>>   to
>>     [0, 0]  EQUIVALENCES: { i_6 } (1 elements)
>>   Intersecting
>>     [0, 0]  EQUIVALENCES: { i_6 } (1 elements)
>>   and
>>     [0, 1000]
>>   to
>>     [0, 0]  EQUIVALENCES: { i_6 } (1 elements)
>>   Found new range for i_15: [0, 0]
>>
>>   Visiting statement:
>>   _3 = i_15 + 1;
>>   Match-and-simplified i_15 + 1 to 1
>>   Intersecting
>>     [1, 1]
>>   and
>>     [0, +INF]
>>   to
>>     [1, 1]
>>   Found new range for _3: [1, 1]
>>
>> (where _3 is the index we care about), followed by:
>>
>>   Visiting statement:
>>   i_15 = ASSERT_EXPR <i_6, i_6 < n_9(D)>;
>>   Intersectings/aarch64-linux/trunk-orig/debug/gcc'
>>     [0, n_9(D) + 4294967295]  EQUIVALENCES: { i_6 } (1 elements)
>>   and
>>     [0, 4]
>>   to
>>     [0, n_9(D) + 4294967295]  EQUIVALENCES: { i_6 } (1 elements)
>>   Intersecting
>>     [0, n_9(D) + 4294967295]  EQUIVALENCES: { i_6 } (1 elements)
>>   and
>>     [0, 1000]
>>   to
>>     [0, n_9(D) + 4294967295]  EQUIVALENCES: { i_6 } (1 elements)
>>   Found new range for i_15: [0, n_9(D) + 4294967295]
>>
>>   Visiting statement:
>>   _3 = i_15 + 1;
>>   Intersecting
>>     [0, +INF]
>>   and
>>     [0, +INF]
>>   to
>>     [0, +INF]
>>   Found new range for _3: [0, +INF]
>>
>> I guess in this case it would be better to intersect the i_15 ranges
>> to [0, 1000] rather than [0, n_9(D) + 4294967295].
>>
>> It does work if another VRP pass runs after CH.  But even then,
>> is it a good idea to rely on the range info being kept up-to-date
>> all the way through to SLP?  A lot happens inbetween.
> To some extend yes.  Now I understand that SCEV uses niters
> information to prove no_overflow.  Niters analysis does infer better
> information from array boundary, while VRP fails to do that.  I don't
> worry much about gap between vrp pass and slp, it's basically the same
> as niters.  Both information are analyzed at one point and meant to be
> used by following passes.  It's also not common for vrp information
> become invalid given we are on SSA?

I'm not worried so much about vrp information becoming invalid on
an SSA name that existed when VRP was run.  It's more a question
of what happens about SSA names that get introduced after VRP,
e.g. by things like dom, reassoc, PRE, etc.

> Now that data address is not analyzed against loop, VRP would be the
> only information we can use unless adding back scev analysis.  IMHO,
> the patch is doing so in another way than before.  It requires
> additional chrec data structure.  I remember the previous patch
> enables more slp vectorization, is it because of "step" information
> from scev?

Do you mean that:

2017-07-03  Richard Sandiford  <richard.sandif...@linaro.org>

        * tree-data-ref.c (dr_analyze_innermost): Replace the "nest"
        parameter with a "loop" parameter and use it instead of the
        loop containing DR_STMT.  Don't check simple_iv when doing
        BB analysis.  Describe the two analysis modes in the comment.

enabled more SLP vectorisation in bb-slp-pr65935.c?  That was due
to us not doing IV analysis for BB vectorisation, and ensuring that
the step was always zero.

> In this patch, step information is simply discarded.  I am
> wondering if possible to always analyze scev within innermost loop for
> slp while discards step information.

Well, we don't calculate a step for bb analysis (i.e. it's not the case
the patch calculates step information and then simply discards it).
I don't see how that would work.  For bb analysis, the DR_OFFSET + DR_INIT
has to give the offset for every execution of the block, not just the
first iteration of the containing loop.  So if we get back a nonzero
step, we have to do something with it.

But:

(a) the old simple_iv analysis is more expensive than simply calling
    analyze_scev, so I don't think this is a win in terms of complexity.

(b) for bb analysis, there's nothing particularly special about the
    innermost loop.  It makes more sense to analyse it in the innermost
    loop for which the offset is invariant, as shown by the second
    testcase in the patch.

(c) The patch helps with loop vectorisation too, since analysing the
    starting DR_OFFSET in the context of the containing loop can help
    in a similar way as analysing the full offset does for SLP.

Thanks,
Richard

>
> Thanks,
> bin
>>
>> FWIW, the old simple_iv check that I removed for bb data-ref analysis
>> relies on SCEV analysis too, so I don't think this is more expensive
>> than what we had before.
>>
>> Thanks,
>> Richard

Reply via email to