Catching up on backlog, so this might already be resolved, but:

Richard Biener <rguent...@suse.de> writes:
> On Tue, 7 Nov 2023, Tamar Christina wrote:
>
>> > -----Original Message-----
>> > From: Richard Biener <rguent...@suse.de>
>> > Sent: Tuesday, November 7, 2023 9:43 AM
>> > To: Tamar Christina <tamar.christ...@arm.com>
>> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>
>> > Subject: RE: [PATCH v6 0/21]middle-end: Support early break/return auto-
>> > vectorization
>> > 
>> > On Mon, 6 Nov 2023, Tamar Christina wrote:
>> > 
>> > > > -----Original Message-----
>> > > > From: Richard Biener <rguent...@suse.de>
>> > > > Sent: Monday, November 6, 2023 2:25 PM
>> > > > To: Tamar Christina <tamar.christ...@arm.com>
>> > > > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>
>> > > > Subject: Re: [PATCH v6 0/21]middle-end: Support early break/return
>> > > > auto- vectorization
>> > > >
>> > > > On Mon, 6 Nov 2023, Tamar Christina wrote:
>> > > >
>> > > > > Hi All,
>> > > > >
>> > > > > This patch adds initial support for early break vectorization in GCC.
>> > > > > The support is added for any target that implements a vector
>> > > > > cbranch optab, this includes both fully masked and non-masked 
>> > > > > targets.
>> > > > >
>> > > > > Depending on the operation, the vectorizer may also require
>> > > > > support for boolean mask reductions using Inclusive OR.  This is
>> > > > > however only checked then the comparison would produce multiple
>> > statements.
>> > > > >
>> > > > > Note: I am currently struggling to get patch 7 correct in all
>> > > > > cases and could
>> > > > use
>> > > > >       some feedback there.
>> > > > >
>> > > > > Concretely the kind of loops supported are of the forms:
>> > > > >
>> > > > >  for (int i = 0; i < N; i++)
>> > > > >  {
>> > > > >    <statements1>
>> > > > >    if (<condition>)
>> > > > >      {
>> > > > >        ...
>> > > > >        <action>;
>> > > > >      }
>> > > > >    <statements2>
>> > > > >  }
>> > > > >
>> > > > > where <action> can be:
>> > > > >  - break
>> > > > >  - return
>> > > > >  - goto
>> > > > >
>> > > > > Any number of statements can be used before the <action> occurs.
>> > > > >
>> > > > > Since this is an initial version for GCC 14 it has the following
>> > > > > limitations and
>> > > > > features:
>> > > > >
>> > > > > - Only fixed sized iterations and buffers are supported.  That is to 
>> > > > > say any
>> > > > >   vectors loaded or stored must be to statically allocated arrays 
>> > > > > with
>> > known
>> > > > >   sizes. N must also be known.  This limitation is because our 
>> > > > > primary
>> > target
>> > > > >   for this optimization is SVE.  For VLA SVE we can't easily do 
>> > > > > cross page
>> > > > >   iteraion checks. The result is likely to also not be beneficial. 
>> > > > > For that
>> > > > >   reason we punt support for variable buffers till we have 
>> > > > > First-Faulting
>> > > > >   support in GCC.
>> > 
>> > Btw, for this I wonder if you thought about marking memory accesses 
>> > required
>> > for the early break condition as required to be vector-size aligned, thus 
>> > peeling
>> > or versioning them for alignment?  That should ensure they do not fault.
>> > 
>> > OTOH I somehow remember prologue peeling isn't supported for early break
>> > vectorization?  ..
>> > 
>> > > > > - any stores in <statements1> should not be to the same objects as in
>> > > > >   <condition>.  Loads are fine as long as they don't have the 
>> > > > > possibility to
>> > > > >   alias.  More concretely, we block RAW dependencies when the
>> > > > > intermediate
>> > > > value
>> > > > >   can't be separated fromt the store, or the store itself can't be 
>> > > > > moved.
>> > > > > - Prologue peeling, alignment peelinig and loop versioning are 
>> > > > > supported.
>> > 
>> > .. but here you say it is.  Not sure if peeling for alignment works for 
>> > VLA vectors
>> > though.  Just to say x86 doesn't support first-faulting loads.
>> 
>> For VLA we support it through masking.  i.e. if you need to peel N 
>> iterations, we
>> generate a masked copy of the loop vectorized which masks off the first N 
>> bits.
>> 
>> This is not typically needed, but we do support it.  But the problem with 
>> this
>> scheme and early break is obviously that the peeled loop needs to be 
>> vectorized
>> so you kinda end up with the same issue again.  So Atm it rejects it for VLA.
>
> Hmm, I see.  I thought peeling by masking is an optimization.

Yeah, it's an opt-in optimisation.  No current Arm cores opt in though.

> Anyhow, I think it should still work here - since all accesses are aligned
> and we know that there's at least one original scalar iteration in the
> first masked and the following "unmasked" vector iterations there
> should never be faults for any of the aligned accesses.

Peeling via masking works by using the main loop for the "peeled"
iteration (so it's a bit of a misnomer).  The vector pointers start
out lower than the original scalar pointers, with some leading
inactive elements.

The awkwardness would be in skipping those leading inactive elements
in the epilogue, if an early break occurs in the first vector iteration.
Definitely doable, but I imagine not trivial.

> I think going via alignment is a way easier method to guarantee this
> than handwaving about "declared" arrays and niter.  One can try that
> in addition of course - it's not always possible to align all
> vector loads we are going to speculate (for VLA one could also
> find common runtime (mis-)alignment and restrict the vector length based
> on that, for RISC-V it seems to be efficient, not sure whether altering
> that for SVE is though).

I think both techniques (alignment and reasoning about accessibility)
are useful.  And they each help with different cases.  Like you say,
if there are two vector loads that need to be aligned, we'd need to
version for alignment on fixed-length architectures, with a scalar
fallback when the alignment requirement isn't met.  In contrast,
static reasoning about accessibility allows the vector loop to be
used for all relative misalignments.

So I think the aim should be to support both techniques.  But IMO it's
reasonable to start with either one.  It sounds from Tamar's results
like starting with static reasoning does fire quite often, and it
should have less runtime overhead than the alignment approach.

Plus, when the loop operates on chars, it's hard to predict whether
peeling for alignment pays for itself, or whether the scalar prologue
will end up handling the majority of cases.  If we have the option
of not peeling for alignment, then it's probably worth taking it
for chars.

Capping the VL at runtime is possible on SVE.  It's on the backlog
for handling runtime aliases, where we can vectorise with a lower VF
rather than falling back to scalar code.  But first-faulting loads
are likely to be better than halving or quartering the VL at runtime,
so I don't think capping the VL would be the right SVE technique for
early exits.

Thanks,
Richard

Reply via email to