On Mon, 27 Nov 2023, Richard Sandiford wrote: > Catching up on backlog, so this might already be resolved, but: > > Richard Biener <rguent...@suse.de> writes: > > On Tue, 7 Nov 2023, Tamar Christina wrote: > > > >> > -----Original Message----- > >> > From: Richard Biener <rguent...@suse.de> > >> > Sent: Tuesday, November 7, 2023 9:43 AM > >> > To: Tamar Christina <tamar.christ...@arm.com> > >> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com> > >> > Subject: RE: [PATCH v6 0/21]middle-end: Support early break/return auto- > >> > vectorization > >> > > >> > On Mon, 6 Nov 2023, Tamar Christina wrote: > >> > > >> > > > -----Original Message----- > >> > > > From: Richard Biener <rguent...@suse.de> > >> > > > Sent: Monday, November 6, 2023 2:25 PM > >> > > > To: Tamar Christina <tamar.christ...@arm.com> > >> > > > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com> > >> > > > Subject: Re: [PATCH v6 0/21]middle-end: Support early break/return > >> > > > auto- vectorization > >> > > > > >> > > > On Mon, 6 Nov 2023, Tamar Christina wrote: > >> > > > > >> > > > > Hi All, > >> > > > > > >> > > > > This patch adds initial support for early break vectorization in > >> > > > > GCC. > >> > > > > The support is added for any target that implements a vector > >> > > > > cbranch optab, this includes both fully masked and non-masked > >> > > > > targets. > >> > > > > > >> > > > > Depending on the operation, the vectorizer may also require > >> > > > > support for boolean mask reductions using Inclusive OR. This is > >> > > > > however only checked then the comparison would produce multiple > >> > statements. > >> > > > > > >> > > > > Note: I am currently struggling to get patch 7 correct in all > >> > > > > cases and could > >> > > > use > >> > > > > some feedback there. > >> > > > > > >> > > > > Concretely the kind of loops supported are of the forms: > >> > > > > > >> > > > > for (int i = 0; i < N; i++) > >> > > > > { > >> > > > > <statements1> > >> > > > > if (<condition>) > >> > > > > { > >> > > > > ... > >> > > > > <action>; > >> > > > > } > >> > > > > <statements2> > >> > > > > } > >> > > > > > >> > > > > where <action> can be: > >> > > > > - break > >> > > > > - return > >> > > > > - goto > >> > > > > > >> > > > > Any number of statements can be used before the <action> occurs. > >> > > > > > >> > > > > Since this is an initial version for GCC 14 it has the following > >> > > > > limitations and > >> > > > > features: > >> > > > > > >> > > > > - Only fixed sized iterations and buffers are supported. That is > >> > > > > to say any > >> > > > > vectors loaded or stored must be to statically allocated arrays > >> > > > > with > >> > known > >> > > > > sizes. N must also be known. This limitation is because our > >> > > > > primary > >> > target > >> > > > > for this optimization is SVE. For VLA SVE we can't easily do > >> > > > > cross page > >> > > > > iteraion checks. The result is likely to also not be beneficial. > >> > > > > For that > >> > > > > reason we punt support for variable buffers till we have > >> > > > > First-Faulting > >> > > > > support in GCC. > >> > > >> > Btw, for this I wonder if you thought about marking memory accesses > >> > required > >> > for the early break condition as required to be vector-size aligned, > >> > thus peeling > >> > or versioning them for alignment? That should ensure they do not fault. > >> > > >> > OTOH I somehow remember prologue peeling isn't supported for early break > >> > vectorization? .. > >> > > >> > > > > - any stores in <statements1> should not be to the same objects as > >> > > > > in > >> > > > > <condition>. Loads are fine as long as they don't have the > >> > > > > possibility to > >> > > > > alias. More concretely, we block RAW dependencies when the > >> > > > > intermediate > >> > > > value > >> > > > > can't be separated fromt the store, or the store itself can't be > >> > > > > moved. > >> > > > > - Prologue peeling, alignment peelinig and loop versioning are > >> > > > > supported. > >> > > >> > .. but here you say it is. Not sure if peeling for alignment works for > >> > VLA vectors > >> > though. Just to say x86 doesn't support first-faulting loads. > >> > >> For VLA we support it through masking. i.e. if you need to peel N > >> iterations, we > >> generate a masked copy of the loop vectorized which masks off the first N > >> bits. > >> > >> This is not typically needed, but we do support it. But the problem with > >> this > >> scheme and early break is obviously that the peeled loop needs to be > >> vectorized > >> so you kinda end up with the same issue again. So Atm it rejects it for > >> VLA. > > > > Hmm, I see. I thought peeling by masking is an optimization. > > Yeah, it's an opt-in optimisation. No current Arm cores opt in though. > > > Anyhow, I think it should still work here - since all accesses are aligned > > and we know that there's at least one original scalar iteration in the > > first masked and the following "unmasked" vector iterations there > > should never be faults for any of the aligned accesses. > > Peeling via masking works by using the main loop for the "peeled" > iteration (so it's a bit of a misnomer). The vector pointers start > out lower than the original scalar pointers, with some leading > inactive elements. > > The awkwardness would be in skipping those leading inactive elements > in the epilogue, if an early break occurs in the first vector iteration. > Definitely doable, but I imagine not trivial. > > > I think going via alignment is a way easier method to guarantee this > > than handwaving about "declared" arrays and niter. One can try that > > in addition of course - it's not always possible to align all > > vector loads we are going to speculate (for VLA one could also > > find common runtime (mis-)alignment and restrict the vector length based > > on that, for RISC-V it seems to be efficient, not sure whether altering > > that for SVE is though). > > I think both techniques (alignment and reasoning about accessibility) > are useful. And they each help with different cases. Like you say, > if there are two vector loads that need to be aligned, we'd need to > version for alignment on fixed-length architectures, with a scalar > fallback when the alignment requirement isn't met. In contrast, > static reasoning about accessibility allows the vector loop to be > used for all relative misalignments. > > So I think the aim should be to support both techniques. But IMO it's > reasonable to start with either one. It sounds from Tamar's results > like starting with static reasoning does fire quite often, and it > should have less runtime overhead than the alignment approach.
Fair enough, we need to fix the correctness issues then though (as said, correctness is way easier to assert for alignment). > Plus, when the loop operates on chars, it's hard to predict whether > peeling for alignment pays for itself, or whether the scalar prologue > will end up handling the majority of cases. If we have the option > of not peeling for alignment, then it's probably worth taking it > for chars. That's true. > Capping the VL at runtime is possible on SVE. It's on the backlog > for handling runtime aliases, where we can vectorise with a lower VF > rather than falling back to scalar code. But first-faulting loads > are likely to be better than halving or quartering the VL at runtime, > so I don't think capping the VL would be the right SVE technique for > early exits. For targets with no first-faulting loads we only have alignment as additional possibility then. I can look at this for next stage1. Richard. > Thanks, > Richard