Issue |
123069
|
Summary |
RISC-V EVL tail folding
|
Labels |
backend:RISC-V,
vectorizers
|
Assignees |
|
Reporter |
lukel97
|
On the spacemit-x60, [GCC 14 is ~24% faster on the 525.x264_r SPEC CPU 2017 benchmark than a recent build of Clang](https://lnt.lukelau.me/db_default/v4/nts/profile/13/18/15).
A big chunk of this difference is due to GCC tail folding its loops with VL, whereas LLVM doesn't by default.
Because LLVM doesn't tail fold its loops, it generates both a vectorized body and a scalar epilogue. There is a minimum trip count >= VF required to execute the vectorized body, otherwise it can only run the scalar epilogue.
On 525.x264_r, there are some very hot functions (e.g. `get_ref`) which never meet the minimum trip count and so the vector code is never ran. Tail folding avoids this issue and allows us to run the vectorized body every time.
There are likely other performance benefits to be had with tail folding with VL, so it seems worthwhile exploring.
"EVL tail folding" (LLVM's vector-predication terminology for VL tail folding), can be enabled from Clang with `-mllvm -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue -mllvm -force-tail-folding-style=data-with-evl`. It initially landed in #76172 but it isn't enabled by default yet due to support for it not being fully complete, both in the loop vectorizer and elsewhere in the RISC-V backend.
This issue aims to track what work is needed across the LLVM project to bring it up to a stable state, at which point we can evaluate its performance to see if it should be enabled by default.
It's not a complete list and only contains the tasks that I've noticed so far. Please feel free to edit and add to it!
I presume we will find more things that need addressed as time goes on.
- [ ] Set up CI infrastructure for -force-tail-folding-style=data-with-evl
- [ ] Likely need a buildbot that runs llvm-test-suite in this configuration, similar to the [AArch64 sve2 buildbots](https://lab.llvm.org/buildbot/#/builders/198)
- [X] Igalia is running [nightly SPEC CPU 2017 benchmarking with EVL tail folding via LNT](https://lnt.lukelau.me/db_default/v4/nts/machine/26)
- [ ] Address known miscompiles
- #122461
- [ ] Fix cases that abort vectorization entirely
- On SPEC CPU 2017 as of 02403f4e450b86d93197dd34045ff40a34b21494, with EVL tail folding we vectorize 57% less loops that were previously vectorized. This is likely due to vectorization aborting when it encounters unimplemented cases:
- VPWidenIntOrFpInductionRecipe
- #115274
- #118638
- VPWidenPointerInductionRecipe
- Fixed-length VFs: There are cases where scalable vectorization isn’t possible and we currently don't allow fixed-length VFs, so presumably nothing gets vectorized in this case.
- Cases where the RISC-V cost model may have become unprofitable with EVL tail folding
- [ ] Implement support for EVL tail folding in other parts of the loop vectorizer
- [ ] Fixed-order recurrences (will fall back to DataWithoutLaneMask style after #122458)
- #100755
- #114205 (see note on RISCVVLOptimizer below)
- [ ] Extend RISC-V VP intrinsic codegen
- Segmented accesses #120490
- Strided accesses in RISCVGatherScatterLowering
- #122244
- #122232
- Eventually, the loop vectorizer should be taught to emit `vp.strided.{load,store}` intrinsics directly cc) @nikolaypanchenko
- [ ] RISCVVLOptimizer
- The VL optimizer may have made non-trapping VP intrinsics redundant. We should evaluate if we still need to transform intrinsics/calls/binops to VP intrinsics in the LV
- #91796
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs