On 10/27/24 10:22 AM, Kyrylo Tkachov wrote:
Hi all,
Some vector rotate operations can be implemented in a single instruction
rather than using the fallback SHL+USRA sequence.
In particular, when the rotate amount is half the bitwidth of the element
we can use a REV64,REV32,REV16 instruction.
More generally, rotates by a byte amount can be implented using vector
permutes.
This patch adds such a generic routine in expmed.cc called
expand_rotate_as_vec_perm that calculates the required permute indices
and uses the expand_vec_perm_const interface.
On aarch64 this ends up generating the single-instruction sequences above
where possible and can use LDR+TBL sequences too, which are a good choice.
With help from Richard, the routine should be VLA-safe.
However, the only use of expand_rotate_as_vec_perm introduced in this patch
is in aarch64-specific code that for now only handles fixed-width modes.
A runtime aarch64 test is added to ensure the permute indices are not messed
up.
Bootstrapped and tested on aarch64-none-linux-gnu.
Richard had approved these changes in the previous iteration, but I’ll only push
this after the prerequisites in the series.
Thanks,
Kyrill
Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
gcc/
* expmed.h (expand_rotate_as_vec_perm): Declare.
* expmed.cc (expand_rotate_as_vec_perm): Define.
* config/aarch64/aarch64-protos.h (aarch64_emit_opt_vec_rotate):
Declare prototype.
* config/aarch64/aarch64.cc (aarch64_emit_opt_vec_rotate): Implement.
* config/aarch64/aarch64-simd.md (*aarch64_simd_rotate_imm<mode>):
Call the above.
gcc/testsuite/
* gcc.target/aarch64/vec-rot-exec.c: New test.
* gcc.target/aarch64/simd/pr117048_2.c: New test.
High level question and forgive me if it's already been asked and answered.
Is there a reason why we don't have the target reject the vector
rotation cases it can't natively handle and the expander code would then
try the rotate via permuation?
It seems like your patch defers everything to the target which calls
back into the expander code for the special case.
The implementation of rotate via permute seems generally reasonable. So
this is really about how we use that implementation.
jeff