on 2021/1/26 下午6:53, Richard Biener wrote:
> On Tue, 26 Jan 2021, Kewen.Lin wrote:
> 
>> Hi Segher/Richard B./Richard S.,
>>
>> Many thanks for your all helps and comments on this!
>>
>> on 2021/1/25 下午3:56, Richard Biener wrote:
>>> On Fri, 22 Jan 2021, Segher Boessenkool wrote:
>>>
>>>> On Fri, Jan 22, 2021 at 02:47:06PM +0100, Richard Biener wrote:
>>>>> On Thu, 21 Jan 2021, Segher Boessenkool wrote:
>>>>>> What is holding up this patch still?  Ke Wen has pinged it every month
>>>>>> since May, and there has still not been a review.
>>>>
>>>> Richard Sandiford wrote:
>>>>> FAOD (since I'm on cc:), I don't feel qualified to review this.
>>>>> Tree-level loop stuff isn't really my area.
>>>>
>>>> And Richard Biener wrote:
>>>>> I don't like it, it feels wrong but I don't have a good suggestion
>>>>> that had positive feedback.  Since a reviewer / approver is indirectly
>>>>> responsible for at least the design I do not want to ack this patch.
>>>>> Bin made forward progress on the other parts of the series but clearly
>>>>> there's somebody missing with the appropriate privileges who feels
>>>>> positive about the patch and its general direction.
>>>>>
>>>>> Sorry to be of no help here.
>>>>
>>>> How unfortunate :-(
>>>>
>>>> So, first off, this will then have to work for next stage 1 to make any
>>>> progress.  Rats.
>>>>
>>>> But what could have been done differently that would have helped?  Of
>>>> course Ke Wen could have written a better patch (aka one that is more
>>>> acceptable); either of you could have made your current replies earlier,
>>>> so that it is clear help needs to be sought elsewhere; and I could have
>>>> pushed people earlier, too.  No one really did anything wrong, I'm not
>>>> seeking who to blame, I'm just trying to find out how to prevent
>>>> deadlocks like this in the future (where one party waits for replies
>>>> that will never come).
>>>>
>>>> Is it just that we have a big gaping hole in reviewers with experience
>>>> in such loop optimisations?
>>>
>>> May be.  But what I think is the biggest problem is that we do not
>>> have a good way to achieve what the patch tries (if you review the
>>> communications you'll see many ideas tossed around) first and foremost
>>> because IV selection is happening early on GIMPLE and unrolling
>>> happens late on RTL.  Both need a quite accurate estimate of costs
>>> but unrolling has an ever harder time than IV selection where we've
>>> got along with throwing dummy RTL at costing functions.
>>>
>>
>> Yeah, exactly.
>>
>>> IMHO the patch is the wrong "start" to try fixing the issue and my
>>> fear is that wiring this kind of "features" into the current
>>> (fundamentally broken) state will make it much harder to rework
>>> that state without introducing regressions on said features (I'm
>>> there with trying to turn the vectorizer upside down - for three
>>> years now, struggling to not regress any of the "features" we've
>>> accumulated for various targets where most of them feel a
>>> "bolted-on" rather than well-designed ;/).
>>>
>>
>> OK, understandable.
>>
>>> I think IV selection and unrolling (and scheduling FWIW) need to move
>>> closer together.  I do not have a good idea how that can work out
>>> though but I very much believe that this "most wanted" GIMPLE unroller
>>> will not be a good way of progressing here.  Maybe taking the bullet
>>> and moving IV selection back to RTL is the answer.
>>>
>>
>> I haven't looked into loop-iv.c, but IVOPTS in gimple can leverage
>> SCEV analysis for iv detection, if moving it to RTL, it could be
>> very heavier to detect the full set there?
>>
>>> For a "short term" solution I still think that trying to perform
>>> unrolling and IV selection (for the D-form case you're targeting)
>>> at the same time is a better design, even if it means complicating
>>> the IV selection pass (and yeah, it'll still be at GIMPLE and w/o
>>> any good idea about scheduling).  There are currently 20+ GIMPLE
>>> optimization passes and 10+ RTL optimization passes between
>>> IV selection and unrolling, the idea that you can have transform
>>> decision and transform apply this far apart looks scary.
>>>
>>
>> I have some questions in mind for this part, for "perform unrolling
>> and IV selection at the same time", it can be interpreted to two
>> different implementation ways to me:
>>
>> 1) Run one gimple unrolling pass just before IVOPTS, probably using
>>    the same gate for IVOPTS.  The unrolling factor is computed by
>>    the same method as that of RTL unrolling.  But this sounds very
>>    like "most wanted gimple unrolling" which is what we want to avoid.
>>
>>    The positive aspect here is what IVOPTS faces is already one unrolled
>>    loop, it can see the whole picture and get the optimal IV set.  The
>>    downside/question is how we position these gimple unrolling and RTL
>>    unrolling passes, whether we still need RTL unrolling.  If no, it's
>>    doubtable that one gimple unrolling can fully replace the RTL
>>    unrolling probably lacking some actual target information/instructions.
>>    If yes, it's still possible to have inconsistent unrolling factors
>>    between what IVOPTS optimizes on and what late RTL unrolling pass
>>    ends with.
>>
>> 2) Make IVOPTS determine the unrolling factor by considering the
>>    reg-offset addressing (D-form), unroll the loop and do the remainings.
> 
> Yes, that's what I meant.
> 
>>    I don't think you referred to this though.  Since comparing to
>>    reg-reg addressing, reg-offset addressing can only save some bumps,
>>    it's too weak to be one deciding factor of unrolling factor.  Unlike
>>    vectorizer or modulo scheduling, it's more likely there are more
>>    important factors for unrolling factor computation than this one.
> 
> But you _are_ using that bump cost to compute the unroll estimate
> (and IIRC you are doing the D-form IV selection then).

The patch series uses the way like the opposite direction that uses
unrolling factor (will use UF later for short) estimated to adjust
the bump costs.  The process looks like:
  - estimate UF by following the similar UF determination in RTL unroller.
  - identify the reg-offset (D-form) available iv use groups, mark some
    reg-offset iv cands.
  - scale up all pair costs if need, for reg-offset iv cands, the bump cost
    (step cost) is costed by just once, for the others, it's UF-1 time.
  - run the IV selection algorithm as usual.

>From the perspective of bump cost, we can compute one UF (let's call it 
ivopts_UF here), it means:
  - when actual UF > ivopts_UF, the reg-offset iv selection wins with
    fewer bump costs.
  - while actual UF <= ivopts_UF, the reg-reg (indexed addressing) wins.

It looks we can get with the below:

  G for group cost, B for basic iv cost, S for step iv cost, N for iv count.

    G1 = (gcost<oiv,grp1> + gcost<oiv,grp2> ... ) * ivopts_UF;
    B1 = bcost<oiv>;
    S1 = scost<oiv> * (ivopts_UF-1);
    N1 = 1;
    
    G2 = (gcost<iv1, grp1> + gcost<iv2, grp2> ...) * ivopts_UF;
    B2 = bcost(iv1) + bcost(iv2) + ...;
    S2 = scost(iv1) + scost(iv2) + ...;
    N2 = count of (iv1, iv2, ...)

  Let G1 + B1 + S1 + N1 = G2 + B2 + S2 + N2, evaluate the value of ivopts_UF,
  then use the ceiling (match 2^n) of the result.

Here it should have one upper bound by considering the range of target
supported offset (calling it as offset_UF_bound).

I have some concerns on the ivopts_UF computed like this way, since here
we only compute it by considering all address type iv uses, but don't
consider the other iv uses.  It's possible that there are some iv uses
who prefer iv cands which lead to indexed addressing, we can possibly
miss that.  If we want to consider all iv uses globally, it's like to
run the IV selection and conclude the ivopts_UF.  For me, it looks very
hard in the current framework, but we probably can have one range from
low bound to upper bound and iterate it for iv selection till we can
find one or all finish (can be binary searching even), it's seems still
not good due to the possible time consuming.  // (A)

Apart from this, let's assume we have determined one ivopts_UF (have
considered the offset_UF_bound), do we need to take care of some other
possible upper bounds? especially those ones we process in RTL unroller.

This question originates from the concern that when we determine the
ivopts_UF, we only focus on those iv cand and iv uses (whatever all or 
just address type), it's likely that some/most statements of the loop
aren't qualified to show up as iv uses.  Then we can have calculated
ivopts_UF 4, but actually the loop size is big and UF is 2 according
to RTL unroller handlings, since IVOPTs perform early so we will make
it unroll 4 times first then degrade the performance (causing more
spillings etc.).  Here ivopts_UF computation focuses on bump costs
saving, it may not be the critical thing for the loop, comparing to
those similar passes who want to drive unrolling and truely make more
gains (typically more parallelism like vectorization and SMS).  // (B)

So I think we may need to consider UF determination from RTL unroller
somehow, like to respect parameters for unrolling and loop_unroll_adjust
hook if it works in gimple phase.  I assume that there are some tuning
work behind those parameters and hooks.

As above, then the approach seems to be like:
  1. compute ivopts_UF
  2. adjust it according to reg-offset allowable range.
  3. adjust it to respect RTL UF determination. // for (B)
  4. adjust the costs according to this ivopts_UF.
  5. iv selection as usual.
  6. check the iv set as expected. // 4,5,6 for (A)
  7. post-pass unrolling as ivopts_UF.

For 5-7, it can be replaced as to halt the ivopts processing for the
loop, unroll it as ivopts_UF, rerun the ivopts again, but I think we
need to rerun the analysis on the unrolled loop then.  Since we already
have decided the UF, it looks not different to unroll it in post-pass,
or at the end of pass, just making sure the unrolling to happen.

> 
>>    For IVOPTS, it's more like that it doesn't care what the unrolling
>>    factor should be, but it needs to know what the unrolling factor
>>    would be, then do optimal IV selection based on that.  So it's not
>>    good to get it to decide the unrolling factor.
> 
> Well, currently with the stupid RTL unroller you will know it will
> be unrolled 8 times unless it is cold or doesn't iterate enough.
> There's not much costing going on in the RTL unroller (unless you
> go wild with target hooks).

But I guess there were some tuning work behind those things?
Like some condition checks, parameters and hooks to adjust the value.


BR,
Kewen

Reply via email to