Hi all,

Some vector rotate operations can be implemented in a single instruction
rather than using the fallback SHL+USRA sequence.
In particular, when the rotate amount is half the bitwidth of the element
we can use a REV64,REV32,REV16 instruction.
More generally, rotates by a byte amount can be implented using vector
permutes.
This patch adds such a generic routine in expmed.cc called
expand_rotate_as_vec_perm that calculates the required permute indices
and uses the expand_vec_perm_const interface.

On aarch64 this ends up generating the single-instruction sequences above
where possible and can use LDR+TBL sequences too, which are a good choice.

With help from Richard, the routine should be VLA-safe.
However, the only use of expand_rotate_as_vec_perm introduced in this patch
is in aarch64-specific code that for now only handles fixed-width modes.

A runtime aarch64 test is added to ensure the permute indices are not messed
up.

Bootstrapped and tested on aarch64-none-linux-gnu.
Richard had approved these changes in the previous iteration, but I’ll only push
this after the prerequisites in the series.

Thanks,
Kyrill

Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>

gcc/

        * expmed.h (expand_rotate_as_vec_perm): Declare.
        * expmed.cc (expand_rotate_as_vec_perm): Define.
        * config/aarch64/aarch64-protos.h (aarch64_emit_opt_vec_rotate):
        Declare prototype.
        * config/aarch64/aarch64.cc (aarch64_emit_opt_vec_rotate): Implement.
        * config/aarch64/aarch64-simd.md (*aarch64_simd_rotate_imm<mode>):
        Call the above.

gcc/testsuite/

        * gcc.target/aarch64/vec-rot-exec.c: New test.
        * gcc.target/aarch64/simd/pr117048_2.c: New test.

Attachment: v3-0004-aarch64-Optimize-vector-rotates-as-vector-permutes-w.patch
Description: v3-0004-aarch64-Optimize-vector-rotates-as-vector-permutes-w.patch

Reply via email to