On Wed, Nov 27, 2019 at 2:17 PM Wilco Dijkstra <wilco.dijks...@arm.com> wrote: > > Hi Richard, > > >> Yes so it does the insane "fully unrolled trailing loop before the unrolled > >> loop" thing. One always does the trailing loop last (and typically as an > >> actual loop of course) and then the code ends up much faster, close to > >> the ideal version shown in the PR. > > > > Well, you can't do the unrolled loop first unless you keep all exit tests. > > Not keeping them is the whole point of unrolling! > > You always need a loop entry test, but rather than testing iterations > 0, > we can just test iterations >= 4 before entering a 4x unrolled loop. > > >> For these kinds of loops, stupid unrolling is clearly better than the > >> default unrolling, both in size and in performance. For the example > >> we only ever execute part of the "trailing" loop, and never enter the > >> unrolled main loop! > > > > Well, then you don't want unrolling you want peeling. You'd be > > actually happy with four peeled iterations and then the regular, > > not unrolled loop at the tail. > > While peeling would work in this case since the average number of > iterations is so small, that's not what you'd want in general. The key is > not to do the trailing loop before the unrolled loop. > > > The stupid strategy is what it says - stupid. > > Absolutely, it still can be improved significantly. We need to characterize > loops and unroll smartly using different unroll strategies rather than > bluntly unroll every loop 8 times. > > > Sure, which is why I suggest to change how we emit the > > prologue here. We can select the variant of the prologue > > with a target hook based on preference for example, between > > doing it peeling-like (which you prefer), using a scheme > > like current (preferably in some optimized form). > > Well what I'm suggesting is to move the prologue to the epilogue > similar to how the vectorizer executes the trailing loop at the end > (rather than before the vectorized loop).
OK, that works as well, the current scheme tries to combine peeling and unrolling to get the benefit of both. For the the case here [insert sound heuristics] we want the peeling being done as a loop. Whether that's placed before or after the unrolled copy doesn't matter I guess, you'd either have for (; i < n % unroll-factor; ) prologue; if (i >= unroll-factor) for () unrolled-loop or if (n / unroll-factor > 0) for () unrolled-loop if (n % unroll-factor > 0) for () epilogue I think a prologue might be more efficient and eaier to set up as far as IV-reuse is concerned? In theory loop-unroll can then still decide (with heuristics) to peel the prologue/epilogue (though we removed the peeling code). Richard. > Cheers, > Wilco