On Mon, 10 Oct 2022, Andrew Stubbs wrote:

> On 10/10/2022 12:03, Richard Biener wrote:
> > The following picks up the prototype by Ju-Zhe Zhong for vectorizing
> > first order recurrences.  That solves two TSVC missed optimization PRs.
> > 
> > There's a new scalar cycle def kind, vect_first_order_recurrence
> > and it's handling of the backedge value vectorization is complicated
> > by the fact that the vectorized value isn't the PHI but instead
> > a (series of) permute(s) shifting in the recurring value from the
> > previous iteration.  I've implemented this by creating both the
> > single vectorized PHI and the series of permutes when vectorizing
> > the scalar PHI but leave the backedge values in both unassigned.
> > The backedge values are (for the testcases) computed by a load
> > which is also the place after which the permutes are inserted.
> > That placement also restricts the cases we can handle (without
> > resorting to code motion).
> > 
> > I added both costing and SLP handling though SLP handling is
> > restricted to the case where a single vectorized PHI is enough.
> > 
> > Missing is epilogue handling - while prologue peeling would
> > be handled transparently by adjusting iv_phi_p the epilogue
> > case doesn't work with just inserting a scalar LC PHI since
> > that a) keeps the scalar load live and b) that loads is the
> > wrong one, it has to be the last, much like when we'd vectorize
> > the LC PHI as live operation.  Unfortunately LIVE
> > compute/analysis happens too early before we decide on
> > peeling.  When using fully masked loop vectorization the
> > vect-recurr-6.c works as expected though.
> > 
> > I have tested this on x86_64 for now, but since epilogue
> > handling is missing there's probably no practical cases.
> > My prototype WHILE_ULT AVX512 patch can handle vect-recurr-6.c
> > just fine but I didn't feel like running SPEC within SDE nor
> > is the WHILE_ULT patch complete enough.  Builds of SPEC 2k7
> > with fully masked loops succeed (minus three cases of
> > PR107096, caused by my WHILE_ULT prototype).
> > 
> > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> > 
> > Testing with SVE, GCN or RVV appreciated, ideas how to cleanly
> > handle epilogues welcome.
> 
> The testcases all produce correct code on GCN and pass the execution tests.
> 
> The code isn't terribly optimal because we don't have a two-input permutation
> instruction, so we permute each half separately and vec_merge the results. In
> this case the first vector is always a no-op permutation so that's wasted
> cycles. We'd really want a vector rotate and write-lane (or the other way
> around). I think the special-case permutations can be recognised and coded
> into the backend, but I don't know if we can easily tell that the first vector
> is just a bunch of duplicates, when it's not constant.

It's not actually a bunch of duplicates in all but the first iteration.
But what you can recognize is that we're only using lane N - 1 of the
first vector, so you could model the permute as extract last
+ shift in scalar (the extracted lane).  IIRC VLA vector targets usually
have something like shift the vector and set the low lane from a
scalar?  The extract lane N - 1 might be more difficult but then
a rotate plus extracting lane 0 might work as well.

So yes, I think special-casing this constant permutation makes most
sense.

Richard.

Reply via email to