Hi! On Mon, Jan 25, 2021 at 05:59:23PM +0000, Richard Sandiford wrote: > Richard Biener <rguent...@suse.de> writes: > > On Fri, 22 Jan 2021, Segher Boessenkool wrote: > >> But what could have been done differently that would have helped? Of > >> course Ke Wen could have written a better patch (aka one that is more > >> acceptable); either of you could have made your current replies earlier, > >> so that it is clear help needs to be sought elsewhere; and I could have > >> pushed people earlier, too. No one really did anything wrong, I'm not > >> seeking who to blame, I'm just trying to find out how to prevent > >> deadlocks like this in the future (where one party waits for replies > >> that will never come). > >> > >> Is it just that we have a big gaping hole in reviewers with experience > >> in such loop optimisations? > > > > May be. But what I think is the biggest problem is that we do not > > have a good way to achieve what the patch tries (if you review the > > communications you'll see many ideas tossed around) first and foremost > > because IV selection is happening early on GIMPLE and unrolling > > happens late on RTL. Both need a quite accurate estimate of costs > > but unrolling has an ever harder time than IV selection where we've > > got along with throwing dummy RTL at costing functions.
GIMPLE already needs at least an *estimate* of how much any loop will be unrolled (for similar reasons as the IV selection). The actual mechanics can happen later (in RTL), and we could even use a different unroll factor (in some cases) than what we first estimated; but for the GIMPLE optimisations it can be important to know what the target code will eventually look like. > > IMHO the patch is the wrong "start" to try fixing the issue and my > > fear is that wiring this kind of "features" into the current > > (fundamentally broken) state will make it much harder to rework > > that state without introducing regressions on said features (I'm > > there with trying to turn the vectorizer upside down - for three > > years now, struggling to not regress any of the "features" we've > > accumulated for various targets where most of them feel a > > "bolted-on" rather than well-designed ;/). > > Thinking of any features in particular here? > > Most of the ones I can think of seem to be doing things in the way > that the current infrastructure expects. But of course, the current > infrastructure isn't perfect, so the end result isn't either. > > Still, I agree with the above apart from maybe that last bit. ;-) > > > I think IV selection and unrolling (and scheduling FWIW) need to move > > closer together. I do not have a good idea how that can work out > > though but I very much believe that this "most wanted" GIMPLE unroller > > will not be a good way of progressing here. > > What do you feel about unrolling in the vectoriser (by doubling the VF, etc.) > in cases where something about the target indicates that that would be > useful? I think that's a good place to do it (for the cases that it > handles) because it's hard to unroll later and then interleave. Yeah, in such cases doing the actual unrolling can be quite beneficial. My gut feeling is you'd do it at the same time as creating the actual vector code here. > > Maybe taking the bullet and moving IV selection back to RTL is the > > answer. > > I think that would be a bad move. I agree. > The trend recently seems to have been > to lower stuff to individual machine operations earlier in the rtl pass > pipeline (often immediately during expand) rather than split them later. I've pushed rs6000 towards this for many years :-) > The reasoning behind that is that (1) gimple has already heavily optimised > the unlowered form and (2) lowering earlier gives the more powerful rtl > optimisers a chance to do something with the individual machine operations. And (3) you need to do much less work at expand then. More than half of what expand currently does is a) unnecessary, (not much) later passes will do the same anyway; or b) harmful premature optimisations. It indeed helps a lot that GIMPLE has already done pretty much all optimisations that are not machine-specific :-) > It's going to be hard for an RTL ivopts pass to piece everything back > together. "Loops" is a pretty hard abstraction for RTL already, simply because RTL is very concrete, is very close to the hardware. But on the other hand it can be challenging to have a good cost estimate earlier on. > > For a "short term" solution I still think that trying to perform > > unrolling and IV selection (for the D-form case you're targeting) > > at the same time is a better design, even if it means complicating > > the IV selection pass (and yeah, it'll still be at GIMPLE and w/o > > any good idea about scheduling). There are currently 20+ GIMPLE > > optimization passes and 10+ RTL optimization passes between > > IV selection and unrolling, the idea that you can have transform > > decision and transform apply this far apart looks scary. > > FWIW, another option might be to go back to something like: > > https://gcc.gnu.org/pipermail/gcc-patches/2019-October/532676.html > > I agree that it was worth putting that series on hold and trying a more > target-independent approach, but I think in the end it didn't work out, > for the reasons Richard says. At least the target-specific pass would > be making a strict improvement to the IL that it sees, rather than > having to predict what future passes might do or might want. In my experience an approach like that patch, i.e. fixing things up after the "normal" stuff, is not a good way to get good results. On the other hand it *is* a good way to "fine tune" your results. It is a bit like RTL peepholes (and applies to many more things than loop opts). Segher