I have a few more thoughts during the vacation: we don't really need those fixes (work around for VL=0) *IF* we know the VL is constant or VLMAX, and that the situation for out-loop reduction, the only thing we need to fix is the in-loop reduction, that should be the only case will affected.
On Fri, Dec 27, 2024 at 8:17 AM Jeff Law <jeffreya...@gmail.com> wrote: > > > > On 12/24/24 2:24 AM, Robin Dapp wrote: > >> `.MASK_LEN_FOLD_LEFT_PLUS`(or `mask_len_fold_left_plus_m`) is expecting the > >> return value will be the start value even if the length is 0. > >> > >> However current code gen in RISC-V backend is not meet that semantic, it > >> will > >> result a random garbage value if length is 0. > >> > >> Let example by current code gen for MASK_LEN_FOLD_LEFT_PLUS with f64: > >> # _148 = .MASK_LEN_FOLD_LEFT_PLUS (stmp__148.33_134, > >> vect__70.32_138, { -1, ... }, loop_len_161, 0); > >> vsetvli zero,a5,e64,m1,ta,ma > >> vfmv.s.f v2,fa5 # insn 1 > >> vfredosum.vs v1,v1,v2 # insn 2 > >> vfmv.f.s fa5,v1 # insn 3 > >> > >> insn 1: > >> - vfmv.s.f won't do anything if VL=0, which means v2 will contain garbage > >> value. > >> insn 2: > >> - vfredosum.vs won't do anything if VL=0, and keep vd unchanged even TA. > >> (v-spec say: `If vl=0, no operation is performed and the destination > >> register > >> is not updated.`) > >> insn 3: > >> - vfmv.f.s will move the value from v1 even VL=0, so this is safe. > >> > >> So how we fix that? we need two fix for that: > >> > >> 1. insn 1: need always execute with VL=1, so that we can guarantee it will > >> always work as expect. > >> 2. insn 2: Add new pattern to force `vd` use same reg as `vs1` (start > >> value) for > >> all reduction patterns, then we can guarantee vd[0] will > >> contain the > >> start value when vl=0 > >> > >> For 1, it's just a simple change to riscv_vector::expand_reduction, but > >> for 2, > >> we have to add _AV variant reduction to force `vd` use same reg as `vs1` > >> (start > >> value). > > > > I think I'm OK with both approaches. If the second one doesn't negatively > > affect register pressure it certainly looks a bit nicer. On the other hand > > I > > would expect the reduction to dominate latency anyway so an additional > > vsetvl > > shouldn't really matter. > > > > Another thing to consider - can we even hit the vl=0 case with the regular > > reductions? They are VLMAX only so just affected by when we enable > > SELECT_VL > > which we only do with one rgroup so vl=0 shouldn't happen (as opposed to > > fold left). I still need to convince myself that this is correct but the > > fact > > that you only saw fallout with fold left partially confirms it. > No strong opinions either. I had to deal with something in this space a > while back on another design. I don't remember the impedance mismatch > between the fold-left GCC semantics and the chip, but it was solvable > with a single copy and (IIRC) a mask which was all set up in the expander. > > In general I wouldn't expect a fold-left reduction to be a significant > performance factor -- you still have to do the ops in order which is th > performance killer. But supporting fold-left was still useful, even if > for no other reason than allowing other parts of a loop to vectorize, > even if the reduction step was a performance bottleneck. > > jeff