I have a few more thoughts during the vacation: we don't really need
those fixes (work around for VL=0) *IF* we know the VL is constant or
VLMAX, and that the situation for out-loop reduction, the only thing
we need to fix is the in-loop reduction, that should be the only case
will affected.

On Fri, Dec 27, 2024 at 8:17 AM Jeff Law <jeffreya...@gmail.com> wrote:
>
>
>
> On 12/24/24 2:24 AM, Robin Dapp wrote:
> >> `.MASK_LEN_FOLD_LEFT_PLUS`(or `mask_len_fold_left_plus_m`) is expecting the
> >> return value will be the start value even if the length is 0.
> >>
> >> However current code gen in RISC-V backend is not meet that semantic, it 
> >> will
> >> result a random garbage value if length is 0.
> >>
> >> Let example by current code gen for MASK_LEN_FOLD_LEFT_PLUS with f64:
> >>          # _148 = .MASK_LEN_FOLD_LEFT_PLUS (stmp__148.33_134, 
> >> vect__70.32_138, { -1, ... }, loop_len_161, 0);
> >>          vsetvli zero,a5,e64,m1,ta,ma
> >>          vfmv.s.f        v2,fa5     # insn 1
> >>          vfredosum.vs    v1,v1,v2   # insn 2
> >>          vfmv.f.s        fa5,v1     # insn 3
> >>
> >> insn 1:
> >> - vfmv.s.f won't do anything if VL=0, which means v2 will contain garbage 
> >> value.
> >> insn 2:
> >> - vfredosum.vs won't do anything if VL=0, and keep vd unchanged even TA.
> >> (v-spec say: `If vl=0, no operation is performed and the destination 
> >> register
> >>   is not updated.`)
> >> insn 3:
> >> - vfmv.f.s will move the value from v1 even VL=0, so this is safe.
> >>
> >> So how we fix that? we need two fix for that:
> >>
> >> 1. insn 1: need always execute with VL=1, so that we can guarantee it will
> >>             always work as expect.
> >> 2. insn 2: Add new pattern to force `vd` use same reg as `vs1` (start 
> >> value) for
> >>             all reduction patterns, then we can guarantee vd[0] will 
> >> contain the
> >>             start value when vl=0
> >>
> >> For 1, it's just a simple change to riscv_vector::expand_reduction, but 
> >> for 2,
> >> we have to add _AV variant reduction to force `vd` use same reg as `vs1` 
> >> (start
> >> value).
> >
> > I think I'm OK with both approaches.  If the second one doesn't negatively
> > affect register pressure it certainly looks a bit nicer.  On the other hand 
> > I
> > would expect the reduction to dominate latency anyway so an additional 
> > vsetvl
> > shouldn't really matter.
> >
> > Another thing to consider - can we even hit the vl=0 case with the regular
> > reductions?  They are VLMAX only so just affected by when we enable 
> > SELECT_VL
> > which we only do with one rgroup so vl=0 shouldn't happen (as opposed to
> > fold left).  I still need to convince myself that this is correct but the 
> > fact
> > that you only saw fallout with fold left partially confirms it.
> No strong opinions either.  I had to deal with something in this space a
> while back on another design.  I don't remember the impedance mismatch
> between the fold-left GCC semantics and the chip, but it was solvable
> with a single copy and (IIRC) a mask which was all set up in the expander.
>
> In general I wouldn't expect a fold-left reduction to be a significant
> performance factor -- you still have to do the ops in order which is th
> performance killer.  But supporting fold-left was still useful, even if
> for no other reason than allowing other parts of a loop to vectorize,
> even if the reduction step was a performance bottleneck.
>
> jeff

Reply via email to