On 10/10/2022 12:03, Richard Biener wrote:
The following picks up the prototype by Ju-Zhe Zhong for vectorizing
first order recurrences. That solves two TSVC missed optimization PRs.
There's a new scalar cycle def kind, vect_first_order_recurrence
and it's handling of the backedge value vectorization is complicated
by the fact that the vectorized value isn't the PHI but instead
a (series of) permute(s) shifting in the recurring value from the
previous iteration. I've implemented this by creating both the
single vectorized PHI and the series of permutes when vectorizing
the scalar PHI but leave the backedge values in both unassigned.
The backedge values are (for the testcases) computed by a load
which is also the place after which the permutes are inserted.
That placement also restricts the cases we can handle (without
resorting to code motion).
I added both costing and SLP handling though SLP handling is
restricted to the case where a single vectorized PHI is enough.
Missing is epilogue handling - while prologue peeling would
be handled transparently by adjusting iv_phi_p the epilogue
case doesn't work with just inserting a scalar LC PHI since
that a) keeps the scalar load live and b) that loads is the
wrong one, it has to be the last, much like when we'd vectorize
the LC PHI as live operation. Unfortunately LIVE
compute/analysis happens too early before we decide on
peeling. When using fully masked loop vectorization the
vect-recurr-6.c works as expected though.
I have tested this on x86_64 for now, but since epilogue
handling is missing there's probably no practical cases.
My prototype WHILE_ULT AVX512 patch can handle vect-recurr-6.c
just fine but I didn't feel like running SPEC within SDE nor
is the WHILE_ULT patch complete enough. Builds of SPEC 2k7
with fully masked loops succeed (minus three cases of
PR107096, caused by my WHILE_ULT prototype).
Bootstrapped and tested on x86_64-unknown-linux-gnu.
Testing with SVE, GCN or RVV appreciated, ideas how to cleanly
handle epilogues welcome.
The testcases all produce correct code on GCN and pass the execution tests.
The code isn't terribly optimal because we don't have a two-input
permutation instruction, so we permute each half separately and
vec_merge the results. In this case the first vector is always a no-op
permutation so that's wasted cycles. We'd really want a vector rotate
and write-lane (or the other way around). I think the special-case
permutations can be recognised and coded into the backend, but I don't
know if we can easily tell that the first vector is just a bunch of
duplicates, when it's not constant.
Andrew