On Fri, Nov 15, 2019 at 6:13 PM Wilco Dijkstra <wilco.dijks...@arm.com> wrote: > > Hi Richard, > > > So what do we actually do unpatched with -funroll-loops here? > > Yes so it does the insane "fully unrolled trailing loop before the unrolled > loop" thing. One always does the trailing loop last (and typically as an > actual loop of course) and then the code ends up much faster, close to > the ideal version shown in the PR.
Well, you can't do the unrolled loop first unless you keep all exit tests. Not keeping them is the whole point of unrolling! > > When I force "stupid" unrolling the unrolled > > part is clearly worse and I see no benefit from unrolling here > > (eventually over the histogram of the number of iterations the > > repeated exits are now better predicted because they are duplicated). > > For these kinds of loops, stupid unrolling is clearly better than the > default unrolling, both in size and in performance. For the example > we only ever execute part of the "trailing" loop, and never enter the > unrolled main loop! Well, then you don't want unrolling you want peeling. You'd be actually happy with four peeled iterations and then the regular, not unrolled loop at the tail. > > I wonder if structuring it more like the "stupid" case would be > > better, thus > > The key thing is to avoid emitting an unrolled trailing loop before > the actual unrolled loop. The stupid strategy can already do that, > hence it often faster than unrolling badly. The stupid strategy is what it says - stupid. We've removed the peeling code from RTL a few releases back but I do agree that the emitted code for the peeled copies for the runtime algorithm is "bad" - it doesn't match what the comment indicates (a switch, implying a jump table) nor does it do what one would trivially expect ("stupid" peeling). > > that would cater for your case, does not require -funroll-all-loops > > or a new target hook. It might be that all that is needed is > > reordering of basic blocks with the current scheme to decrease > > branch density. Alternatively limit the unroll factor or pad > > the prologue jumps to make the branch predictor happy. > > We shouldn't treat symptoms but fix the underlying problems. Sure, which is why I suggest to change how we emit the prologue here. We can select the variant of the prologue with a target hook based on preference for example, between doing it peeling-like (which you prefer), using a scheme like current (preferably in some optimized form). As said, it might be layouting the prologue in a better way is all that is needed. Richard. > Wilco