On Mon, 27 Nov 2023, Richard Sandiford wrote:

> Catching up on backlog, so this might already be resolved, but:
> 
> Richard Biener <rguent...@suse.de> writes:
> > On Tue, 7 Nov 2023, Tamar Christina wrote:
> >
> >> > -----Original Message-----
> >> > From: Richard Biener <rguent...@suse.de>
> >> > Sent: Tuesday, November 7, 2023 9:43 AM
> >> > To: Tamar Christina <tamar.christ...@arm.com>
> >> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>
> >> > Subject: RE: [PATCH v6 0/21]middle-end: Support early break/return auto-
> >> > vectorization
> >> > 
> >> > On Mon, 6 Nov 2023, Tamar Christina wrote:
> >> > 
> >> > > > -----Original Message-----
> >> > > > From: Richard Biener <rguent...@suse.de>
> >> > > > Sent: Monday, November 6, 2023 2:25 PM
> >> > > > To: Tamar Christina <tamar.christ...@arm.com>
> >> > > > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>
> >> > > > Subject: Re: [PATCH v6 0/21]middle-end: Support early break/return
> >> > > > auto- vectorization
> >> > > >
> >> > > > On Mon, 6 Nov 2023, Tamar Christina wrote:
> >> > > >
> >> > > > > Hi All,
> >> > > > >
> >> > > > > This patch adds initial support for early break vectorization in 
> >> > > > > GCC.
> >> > > > > The support is added for any target that implements a vector
> >> > > > > cbranch optab, this includes both fully masked and non-masked 
> >> > > > > targets.
> >> > > > >
> >> > > > > Depending on the operation, the vectorizer may also require
> >> > > > > support for boolean mask reductions using Inclusive OR.  This is
> >> > > > > however only checked then the comparison would produce multiple
> >> > statements.
> >> > > > >
> >> > > > > Note: I am currently struggling to get patch 7 correct in all
> >> > > > > cases and could
> >> > > > use
> >> > > > >       some feedback there.
> >> > > > >
> >> > > > > Concretely the kind of loops supported are of the forms:
> >> > > > >
> >> > > > >  for (int i = 0; i < N; i++)
> >> > > > >  {
> >> > > > >    <statements1>
> >> > > > >    if (<condition>)
> >> > > > >      {
> >> > > > >        ...
> >> > > > >        <action>;
> >> > > > >      }
> >> > > > >    <statements2>
> >> > > > >  }
> >> > > > >
> >> > > > > where <action> can be:
> >> > > > >  - break
> >> > > > >  - return
> >> > > > >  - goto
> >> > > > >
> >> > > > > Any number of statements can be used before the <action> occurs.
> >> > > > >
> >> > > > > Since this is an initial version for GCC 14 it has the following
> >> > > > > limitations and
> >> > > > > features:
> >> > > > >
> >> > > > > - Only fixed sized iterations and buffers are supported.  That is 
> >> > > > > to say any
> >> > > > >   vectors loaded or stored must be to statically allocated arrays 
> >> > > > > with
> >> > known
> >> > > > >   sizes. N must also be known.  This limitation is because our 
> >> > > > > primary
> >> > target
> >> > > > >   for this optimization is SVE.  For VLA SVE we can't easily do 
> >> > > > > cross page
> >> > > > >   iteraion checks. The result is likely to also not be beneficial. 
> >> > > > > For that
> >> > > > >   reason we punt support for variable buffers till we have 
> >> > > > > First-Faulting
> >> > > > >   support in GCC.
> >> > 
> >> > Btw, for this I wonder if you thought about marking memory accesses 
> >> > required
> >> > for the early break condition as required to be vector-size aligned, 
> >> > thus peeling
> >> > or versioning them for alignment?  That should ensure they do not fault.
> >> > 
> >> > OTOH I somehow remember prologue peeling isn't supported for early break
> >> > vectorization?  ..
> >> > 
> >> > > > > - any stores in <statements1> should not be to the same objects as 
> >> > > > > in
> >> > > > >   <condition>.  Loads are fine as long as they don't have the 
> >> > > > > possibility to
> >> > > > >   alias.  More concretely, we block RAW dependencies when the
> >> > > > > intermediate
> >> > > > value
> >> > > > >   can't be separated fromt the store, or the store itself can't be 
> >> > > > > moved.
> >> > > > > - Prologue peeling, alignment peelinig and loop versioning are 
> >> > > > > supported.
> >> > 
> >> > .. but here you say it is.  Not sure if peeling for alignment works for 
> >> > VLA vectors
> >> > though.  Just to say x86 doesn't support first-faulting loads.
> >> 
> >> For VLA we support it through masking.  i.e. if you need to peel N 
> >> iterations, we
> >> generate a masked copy of the loop vectorized which masks off the first N 
> >> bits.
> >> 
> >> This is not typically needed, but we do support it.  But the problem with 
> >> this
> >> scheme and early break is obviously that the peeled loop needs to be 
> >> vectorized
> >> so you kinda end up with the same issue again.  So Atm it rejects it for 
> >> VLA.
> >
> > Hmm, I see.  I thought peeling by masking is an optimization.
> 
> Yeah, it's an opt-in optimisation.  No current Arm cores opt in though.
> 
> > Anyhow, I think it should still work here - since all accesses are aligned
> > and we know that there's at least one original scalar iteration in the
> > first masked and the following "unmasked" vector iterations there
> > should never be faults for any of the aligned accesses.
> 
> Peeling via masking works by using the main loop for the "peeled"
> iteration (so it's a bit of a misnomer).  The vector pointers start
> out lower than the original scalar pointers, with some leading
> inactive elements.
> 
> The awkwardness would be in skipping those leading inactive elements
> in the epilogue, if an early break occurs in the first vector iteration.
> Definitely doable, but I imagine not trivial.
> 
> > I think going via alignment is a way easier method to guarantee this
> > than handwaving about "declared" arrays and niter.  One can try that
> > in addition of course - it's not always possible to align all
> > vector loads we are going to speculate (for VLA one could also
> > find common runtime (mis-)alignment and restrict the vector length based
> > on that, for RISC-V it seems to be efficient, not sure whether altering
> > that for SVE is though).
> 
> I think both techniques (alignment and reasoning about accessibility)
> are useful.  And they each help with different cases.  Like you say,
> if there are two vector loads that need to be aligned, we'd need to
> version for alignment on fixed-length architectures, with a scalar
> fallback when the alignment requirement isn't met.  In contrast,
> static reasoning about accessibility allows the vector loop to be
> used for all relative misalignments.
> 
> So I think the aim should be to support both techniques.  But IMO it's
> reasonable to start with either one.  It sounds from Tamar's results
> like starting with static reasoning does fire quite often, and it
> should have less runtime overhead than the alignment approach.

Fair enough, we need to fix the correctness issues then though
(as said, correctness is way easier to assert for alignment).

> Plus, when the loop operates on chars, it's hard to predict whether
> peeling for alignment pays for itself, or whether the scalar prologue
> will end up handling the majority of cases.  If we have the option
> of not peeling for alignment, then it's probably worth taking it
> for chars.

That's true.

> Capping the VL at runtime is possible on SVE.  It's on the backlog
> for handling runtime aliases, where we can vectorise with a lower VF
> rather than falling back to scalar code.  But first-faulting loads
> are likely to be better than halving or quartering the VL at runtime,
> so I don't think capping the VL would be the right SVE technique for
> early exits.

For targets with no first-faulting loads we only have alignment as
additional possibility then.  I can look at this for next stage1.

Richard.

> Thanks,
> Richard

Reply via email to