Hi Richard,

> On 23 Oct 2024, at 11:30, Richard Sandiford <richard.sandif...@arm.com> wrote:
>
> Kyrylo Tkachov <ktkac...@nvidia.com> writes:
>> Hi all,
>>
>> Some vector rotate operations can be implemented in a single instruction
>> rather than using the fallback SHL+USRA sequence.
>> In particular, when the rotate amount is half the bitwidth of the element
>> we can use a REV64,REV32,REV16 instruction.
>> This patch adds this transformation in the recently added splitter for vector
>> rotates.
>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>
>> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
>>
>> gcc/
>>
>> * config/aarch64/aarch64-protos.h (aarch64_emit_opt_vec_rotate):
>> Declare prototype.
>> * config/aarch64/aarch64.cc (aarch64_emit_opt_vec_rotate): Implement.
>> * config/aarch64/aarch64-simd.md (*aarch64_simd_rotate_imm<mode>):
>> Call the above.
>>
>> gcc/testsuite/
>>
>> * gcc.target/aarch64/simd/pr117048_2.c: New test.
>
> Sorry to be awkward, but I still think at least part of this should be
> target-independent.  Any rotate by a byte amount can be expressed as a
> vector permutation in a target-independent way.  Target-independent code
> can then use the usual optab routines to query whether the permutation
> is possible and/or try to generate it.

Thank you for elaborating. I had already prototyped the permute index-computing 
code in my tree
but was reluctant to using it during expand as I wanted the rotate RTX to be 
available for combining
into XAR so I felt a bit stuck. Having the code in a generic place but called 
from the backend at a
time of its choosing makes sense to me.

>
> I can see that it probably makes sense to leave target code to make
> the decision about when to use the permutation strategy vs. other
> approaches.  But the code to implement that strategy shouldn't need
> to be target-specific.
>
> E.g. we could have a routine:
>
>  expand_rotate_as_vec_perm
>
> which checks whether the rotation amount is suitable and tries to
> generate the permutation if so.

I’ve implemented something like that in the attached patch.
It seems to work on AArch64 but as mentioned in the commit message I’d like a 
check on
the big-endian logic, and perhaps some pointers on how/whether it should be 
extended to
VLA vectors.

I’m updating the other patches in the series according to your feedback so I’ll 
repost them once I’m done,
just wanted to get this out for further iteration in the meantime.
Thanks,
Kyrill






Attachment: 0001-aarch64-Optimize-vector-rotates-as-vector-permutes-w.patch
Description: 0001-aarch64-Optimize-vector-rotates-as-vector-permutes-w.patch

Reply via email to