Hi Richard, > On 23 Oct 2024, at 11:30, Richard Sandiford <richard.sandif...@arm.com> wrote: > > Kyrylo Tkachov <ktkac...@nvidia.com> writes: >> Hi all, >> >> Some vector rotate operations can be implemented in a single instruction >> rather than using the fallback SHL+USRA sequence. >> In particular, when the rotate amount is half the bitwidth of the element >> we can use a REV64,REV32,REV16 instruction. >> This patch adds this transformation in the recently added splitter for vector >> rotates. >> Bootstrapped and tested on aarch64-none-linux-gnu. >> >> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com> >> >> gcc/ >> >> * config/aarch64/aarch64-protos.h (aarch64_emit_opt_vec_rotate): >> Declare prototype. >> * config/aarch64/aarch64.cc (aarch64_emit_opt_vec_rotate): Implement. >> * config/aarch64/aarch64-simd.md (*aarch64_simd_rotate_imm<mode>): >> Call the above. >> >> gcc/testsuite/ >> >> * gcc.target/aarch64/simd/pr117048_2.c: New test. > > Sorry to be awkward, but I still think at least part of this should be > target-independent. Any rotate by a byte amount can be expressed as a > vector permutation in a target-independent way. Target-independent code > can then use the usual optab routines to query whether the permutation > is possible and/or try to generate it.
Thank you for elaborating. I had already prototyped the permute index-computing code in my tree but was reluctant to using it during expand as I wanted the rotate RTX to be available for combining into XAR so I felt a bit stuck. Having the code in a generic place but called from the backend at a time of its choosing makes sense to me. > > I can see that it probably makes sense to leave target code to make > the decision about when to use the permutation strategy vs. other > approaches. But the code to implement that strategy shouldn't need > to be target-specific. > > E.g. we could have a routine: > > expand_rotate_as_vec_perm > > which checks whether the rotation amount is suitable and tries to > generate the permutation if so. I’ve implemented something like that in the attached patch. It seems to work on AArch64 but as mentioned in the commit message I’d like a check on the big-endian logic, and perhaps some pointers on how/whether it should be extended to VLA vectors. I’m updating the other patches in the series according to your feedback so I’ll repost them once I’m done, just wanted to get this out for further iteration in the meantime. Thanks, Kyrill
0001-aarch64-Optimize-vector-rotates-as-vector-permutes-w.patch
Description: 0001-aarch64-Optimize-vector-rotates-as-vector-permutes-w.patch