Re: [PATCH 4/6] expmed, aarch64: Optimize vector rotates as vector permutes where possible

Kyrylo Tkachov Thu, 31 Oct 2024 08:51:19 -0700

Hi Jeff,

> On 31 Oct 2024, at 16:25, Jeff Law <jeffreya...@gmail.com> wrote:
> 
> 
> 
> On 10/27/24 10:22 AM, Kyrylo Tkachov wrote:
>> Hi all,
>> Some vector rotate operations can be implemented in a single instruction
>> rather than using the fallback SHL+USRA sequence.
>> In particular, when the rotate amount is half the bitwidth of the element
>> we can use a REV64,REV32,REV16 instruction.
>> More generally, rotates by a byte amount can be implented using vector
>> permutes.
>> This patch adds such a generic routine in expmed.cc called
>> expand_rotate_as_vec_perm that calculates the required permute indices
>> and uses the expand_vec_perm_const interface.
>> On aarch64 this ends up generating the single-instruction sequences above
>> where possible and can use LDR+TBL sequences too, which are a good choice.
>> With help from Richard, the routine should be VLA-safe.
>> However, the only use of expand_rotate_as_vec_perm introduced in this patch
>> is in aarch64-specific code that for now only handles fixed-width modes.
>> A runtime aarch64 test is added to ensure the permute indices are not messed
>> up.
>> Bootstrapped and tested on aarch64-none-linux-gnu.
>> Richard had approved these changes in the previous iteration, but I’ll only 
>> push
>> this after the prerequisites in the series.
>> Thanks,
>> Kyrill
>> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
>> gcc/
>> * expmed.h (expand_rotate_as_vec_perm): Declare.
>> * expmed.cc (expand_rotate_as_vec_perm): Define.
>> * config/aarch64/aarch64-protos.h (aarch64_emit_opt_vec_rotate):
>> Declare prototype.
>> * config/aarch64/aarch64.cc (aarch64_emit_opt_vec_rotate): Implement.
>> * config/aarch64/aarch64-simd.md (*aarch64_simd_rotate_imm<mode>):
>> Call the above.
>> gcc/testsuite/
>> * gcc.target/aarch64/vec-rot-exec.c: New test.
>> * gcc.target/aarch64/simd/pr117048_2.c: New test.
> High level question and forgive me if it's already been asked and answered.
> 
> Is there a reason why we don't have the target reject the vector rotation 
> cases it can't natively handle and the expander code would then try the 
> rotate via permuation?
> 
> It seems like your patch defers everything to the target which calls back 
> into the expander code for the special case.


There’s two reasons:
* In the cases I’ve been looking at the rotate is only detected in post-expand 
RTL passes so the post-combine split is the first place where we even have the 
chance to make that decision.
* On AArch64 once the rotate has been detected we want to give combine an 
opportunity to propagate it further into a combined XOR+ROTATE instruction 
(XAR) and only resort to what this patch is doing if no such combination was 
made.

I think the generic rotate expansion code could use the new routine in the 
future but the target needs to be heavily involved in the decision.
More broadly, the expansion sequence for these rotates on aarch64 can be:
* Emitting a shift-left + combined shift+accumulate
* For some rotate amounts it’s a single REV* instruction
* For targets that supports certain ISA extensions it’s a zeroing instruction + 
XAR instruction
* For byte multiple rotate amounts it’s a permute mask creation + vector 
permute.

So using something like RTX costs to guide these decisions would be somewhat 
unwieldy.
Thanks,
Kyrill

> 
> The implementation of rotate via permute seems generally reasonable.  So this 
> is really about how we use that implementation.
> 
> jeff
> 
>

smime.p7s
Description: S/MIME cryptographic signature

Re: [PATCH 4/6] expmed, aarch64: Optimize vector rotates as vector permutes where possible

Reply via email to