on 2021/1/27 下午5:43, Kewen.Lin via Gcc-patches wrote: > on 2021/1/26 下午6:53, Richard Biener wrote: >> On Tue, 26 Jan 2021, Kewen.Lin wrote: >> >>> Hi Segher/Richard B./Richard S., >>> >>> Many thanks for your all helps and comments on this! >>> >>> on 2021/1/25 下午3:56, Richard Biener wrote: >>>> On Fri, 22 Jan 2021, Segher Boessenkool wrote: >>>> >>>>> On Fri, Jan 22, 2021 at 02:47:06PM +0100, Richard Biener wrote: >>>>>> On Thu, 21 Jan 2021, Segher Boessenkool wrote: >>>>>>> What is holding up this patch still? Ke Wen has pinged it every month >>>>>>> since May, and there has still not been a review. >>>>> >>>>> Richard Sandiford wrote: >>>>>> FAOD (since I'm on cc:), I don't feel qualified to review this. >>>>>> Tree-level loop stuff isn't really my area. >>>>> >>>>> And Richard Biener wrote: >>>>>> I don't like it, it feels wrong but I don't have a good suggestion >>>>>> that had positive feedback. Since a reviewer / approver is indirectly >>>>>> responsible for at least the design I do not want to ack this patch. >>>>>> Bin made forward progress on the other parts of the series but clearly >>>>>> there's somebody missing with the appropriate privileges who feels >>>>>> positive about the patch and its general direction. >>>>>> >>>>>> Sorry to be of no help here. >>>>> >>>>> How unfortunate :-( >>>>> >>>>> So, first off, this will then have to work for next stage 1 to make any >>>>> progress. Rats. >>>>> >>>>> But what could have been done differently that would have helped? Of >>>>> course Ke Wen could have written a better patch (aka one that is more >>>>> acceptable); either of you could have made your current replies earlier, >>>>> so that it is clear help needs to be sought elsewhere; and I could have >>>>> pushed people earlier, too. No one really did anything wrong, I'm not >>>>> seeking who to blame, I'm just trying to find out how to prevent >>>>> deadlocks like this in the future (where one party waits for replies >>>>> that will never come). >>>>> >>>>> Is it just that we have a big gaping hole in reviewers with experience >>>>> in such loop optimisations? >>>> >>>> May be. But what I think is the biggest problem is that we do not >>>> have a good way to achieve what the patch tries (if you review the >>>> communications you'll see many ideas tossed around) first and foremost >>>> because IV selection is happening early on GIMPLE and unrolling >>>> happens late on RTL. Both need a quite accurate estimate of costs >>>> but unrolling has an ever harder time than IV selection where we've >>>> got along with throwing dummy RTL at costing functions. >>>> >>> >>> Yeah, exactly. >>> >>>> IMHO the patch is the wrong "start" to try fixing the issue and my >>>> fear is that wiring this kind of "features" into the current >>>> (fundamentally broken) state will make it much harder to rework >>>> that state without introducing regressions on said features (I'm >>>> there with trying to turn the vectorizer upside down - for three >>>> years now, struggling to not regress any of the "features" we've >>>> accumulated for various targets where most of them feel a >>>> "bolted-on" rather than well-designed ;/). >>>> >>> >>> OK, understandable. >>> >>>> I think IV selection and unrolling (and scheduling FWIW) need to move >>>> closer together. I do not have a good idea how that can work out >>>> though but I very much believe that this "most wanted" GIMPLE unroller >>>> will not be a good way of progressing here. Maybe taking the bullet >>>> and moving IV selection back to RTL is the answer. >>>> >>> >>> I haven't looked into loop-iv.c, but IVOPTS in gimple can leverage >>> SCEV analysis for iv detection, if moving it to RTL, it could be >>> very heavier to detect the full set there? >>> >>>> For a "short term" solution I still think that trying to perform >>>> unrolling and IV selection (for the D-form case you're targeting) >>>> at the same time is a better design, even if it means complicating >>>> the IV selection pass (and yeah, it'll still be at GIMPLE and w/o >>>> any good idea about scheduling). There are currently 20+ GIMPLE >>>> optimization passes and 10+ RTL optimization passes between >>>> IV selection and unrolling, the idea that you can have transform >>>> decision and transform apply this far apart looks scary. >>>> >>> >>> I have some questions in mind for this part, for "perform unrolling >>> and IV selection at the same time", it can be interpreted to two >>> different implementation ways to me: >>> >>> 1) Run one gimple unrolling pass just before IVOPTS, probably using >>> the same gate for IVOPTS. The unrolling factor is computed by >>> the same method as that of RTL unrolling. But this sounds very >>> like "most wanted gimple unrolling" which is what we want to avoid. >>> >>> The positive aspect here is what IVOPTS faces is already one unrolled >>> loop, it can see the whole picture and get the optimal IV set. The >>> downside/question is how we position these gimple unrolling and RTL >>> unrolling passes, whether we still need RTL unrolling. If no, it's >>> doubtable that one gimple unrolling can fully replace the RTL >>> unrolling probably lacking some actual target information/instructions. >>> If yes, it's still possible to have inconsistent unrolling factors >>> between what IVOPTS optimizes on and what late RTL unrolling pass >>> ends with. >>> >>> 2) Make IVOPTS determine the unrolling factor by considering the >>> reg-offset addressing (D-form), unroll the loop and do the remainings. >> >> Yes, that's what I meant. >> >>> I don't think you referred to this though. Since comparing to >>> reg-reg addressing, reg-offset addressing can only save some bumps, >>> it's too weak to be one deciding factor of unrolling factor. Unlike >>> vectorizer or modulo scheduling, it's more likely there are more >>> important factors for unrolling factor computation than this one. >> >> But you _are_ using that bump cost to compute the unroll estimate >> (and IIRC you are doing the D-form IV selection then). > > The patch series uses the way like the opposite direction that uses > unrolling factor (will use UF later for short) estimated to adjust > the bump costs. The process looks like: > - estimate UF by following the similar UF determination in RTL unroller. > - identify the reg-offset (D-form) available iv use groups, mark some > reg-offset iv cands. > - scale up all pair costs if need, for reg-offset iv cands, the bump cost > (step cost) is costed by just once, for the others, it's UF-1 time. > - run the IV selection algorithm as usual. > > From the perspective of bump cost, we can compute one UF (let's call it > ivopts_UF here), it means: > - when actual UF > ivopts_UF, the reg-offset iv selection wins with > fewer bump costs. > - while actual UF <= ivopts_UF, the reg-reg (indexed addressing) wins. > > It looks we can get with the below: > > G for group cost, B for basic iv cost, S for step iv cost, N for iv count. > > G1 = (gcost<oiv,grp1> + gcost<oiv,grp2> ... ) * ivopts_UF; > B1 = bcost<oiv>; > S1 = scost<oiv> * (ivopts_UF-1); > N1 = 1; > > G2 = (gcost<iv1, grp1> + gcost<iv2, grp2> ...) * ivopts_UF; > B2 = bcost(iv1) + bcost(iv2) + ...; > S2 = scost(iv1) + scost(iv2) + ...; > N2 = count of (iv1, iv2, ...) > > Let G1 + B1 + S1 + N1 = G2 + B2 + S2 + N2, evaluate the value of ivopts_UF, > then use the ceiling (match 2^n) of the result. > > Here it should have one upper bound by considering the range of target > supported offset (calling it as offset_UF_bound). > > I have some concerns on the ivopts_UF computed like this way, since here > we only compute it by considering all address type iv uses, but don't > consider the other iv uses. It's possible that there are some iv uses > who prefer iv cands which lead to indexed addressing, we can possibly > miss that. If we want to consider all iv uses globally, it's like to > run the IV selection and conclude the ivopts_UF. For me, it looks very > hard in the current framework, but we probably can have one range from > low bound to upper bound and iterate it for iv selection till we can > find one or all finish (can be binary searching even), it's seems still > not good due to the possible time consuming. // (A) > > Apart from this, let's assume we have determined one ivopts_UF (have > considered the offset_UF_bound), do we need to take care of some other > possible upper bounds? especially those ones we process in RTL unroller. > > This question originates from the concern that when we determine the > ivopts_UF, we only focus on those iv cand and iv uses (whatever all or > just address type), it's likely that some/most statements of the loop > aren't qualified to show up as iv uses. Then we can have calculated > ivopts_UF 4, but actually the loop size is big and UF is 2 according > to RTL unroller handlings, since IVOPTs perform early so we will make > it unroll 4 times first then degrade the performance (causing more > spillings etc.). Here ivopts_UF computation focuses on bump costs > saving, it may not be the critical thing for the loop, comparing to > those similar passes who want to drive unrolling and truely make more > gains (typically more parallelism like vectorization and SMS). // (B) > > So I think we may need to consider UF determination from RTL unroller > somehow, like to respect parameters for unrolling and loop_unroll_adjust > hook if it works in gimple phase. I assume that there are some tuning > work behind those parameters and hooks. > > As above, then the approach seems to be like: > 1. compute ivopts_UF > 2. adjust it according to reg-offset allowable range. > 3. adjust it to respect RTL UF determination. // for (B) > 4. adjust the costs according to this ivopts_UF. > 5. iv selection as usual. > 6. check the iv set as expected. // 4,5,6 for (A) > 7. post-pass unrolling as ivopts_UF. > > For 5-7, it can be replaced as to halt the ivopts processing for the > loop, unroll it as ivopts_UF, rerun the ivopts again, but I think we > need to rerun the analysis on the unrolled loop then. Since we already > have decided the UF, it looks not different to unroll it in post-pass, > or at the end of pass, just making sure the unrolling to happen. >
Hi Richard(s)/Segher, I'd like to ping this to avoid the "boat" to sink again. :-) What do you think of the above proposal and the related concerns? BR, Kewen >> >>> For IVOPTS, it's more like that it doesn't care what the unrolling >>> factor should be, but it needs to know what the unrolling factor >>> would be, then do optimal IV selection based on that. So it's not >>> good to get it to decide the unrolling factor. >> >> Well, currently with the stupid RTL unroller you will know it will >> be unrolled 8 times unless it is cold or doesn't iterate enough. >> There's not much costing going on in the RTL unroller (unless you >> go wild with target hooks). > > But I guess there were some tuning work behind those things? > Like some condition checks, parameters and hooks to adjust the value. > > > BR, > Kewen >