Kyrylo Tkachov <ktkac...@nvidia.com> writes:
> Hi all,
>
> We can make use of the integrated rotate step of the XAR instruction
> to implement most vector integer rotates, as long we zero out one
> of the input registers for it.  This allows for a lower-latency sequence
> than the fallback SHL+USRA, especially when we can hoist the zeroing operation
> away from loops and hot parts.
> We can also use it for 64-bit vectors as long
> as we zero the top half of the vector to be rotated.  That should still be
> preferable to the default sequence.

Is the zeroing necessary?  We don't expect/require that 64-bit vector
modes are maintained in zero-extended form, or that 64-bit ops act as
strict_lowparts, so it should be OK to take a paradoxical subreg.
Or we could just extend the patterns to 64-bit modes, to avoid the
punning.

> With this patch we can gerate for the input:
> v4si
> G1 (v4si r)
> {
>     return (r >> 23) | (r << 9);
> }
>
> v8qi
> G2 (v8qi r)
> {
>   return (r << 3) | (r >> 5);
> }
> the assembly for +sve2:
> G1:
>         movi    v31.4s, 0
>         xar     z0.s, z0.s, z31.s, #23
>         ret
>
> G2:
>         movi    v31.4s, 0
>         fmov    d0, d0
>         xar     z0.b, z0.b, z31.b, #5
>         ret
>
> instead of the current:
> G1:
>         shl     v31.4s, v0.4s, 9
>         usra    v31.4s, v0.4s, 23
>         mov     v0.16b, v31.16b
>         ret
> G2:
>         shl     v31.8b, v0.8b, 3
>         usra    v31.8b, v0.8b, 5
>         mov     v0.8b, v31.8b
>         ret
>
> Bootstrapped and tested on aarch64-none-linux-gnu.
>
> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
>
> gcc/
>
>       * config/aarch64/aarch64.cc (aarch64_emit_opt_vec_rotate): Add
>       generation of XAR sequences when possible.
>
> gcc/testsuite/
>
>       * gcc.target/aarch64/rotate_xar_1.c: New test.
> [...]
> +/*
> +** G1:
> +**   movi?   [vdz]([0-9]+)\.?(?:[0-9]*[bhsd])?, #?0
> +**   xar     v0\.2d, v([0-9]+)\.2d, v([0-9]+)\.2d, 39

FWIW, the (...) captures aren't necessary, since we never use backslash
references to them later.

Thanks,
Richard

> +**      ret
> +*/
> +v2di
> +G1 (v2di r) {
> +    return (r >> 39) | (r << 25);
> +}
> +
> +/*
> +** G2:
> +**   movi?   [vdz]([0-9]+)\.?(?:[0-9]*[bhsd])?, #?0
> +**   xar     z0\.s, z([0-9]+)\.s, z([0-9]+)\.s, #23
> +**      ret
> +*/
> +v4si
> +G2 (v4si r) {
> +    return (r >> 23) | (r << 9);
> +}
> +
> +/*
> +** G3:
> +**   movi?   [vdz]([0-9]+)\.?(?:[0-9]*[bhsd])?, #?0
> +**   xar     z0\.h, z([0-9]+)\.h, z([0-9]+)\.h, #5
> +**      ret
> +*/
> +v8hi
> +G3 (v8hi r) {
> +    return (r >> 5) | (r << 11);
> +}
> +
> +/*
> +** G4:
> +**   movi?   [vdz]([0-9]+)\.?(?:[0-9]*[bhsd])?, #?0
> +**   xar     z0\.b, z([0-9]+)\.b, z([0-9]+)\.b, #6
> +**      ret
> +*/
> +v16qi
> +G4 (v16qi r)
> +{
> +  return (r << 2) | (r >> 6);
> +}
> +
> +/*
> +** G5:
> +**   movi?   [vdz]([0-9]+)\.?(?:[0-9]*[bhsd])?, #?0
> +**   fmov    d[0-9]+, d[0-9]+
> +**   xar     z0\.s, z([0-9]+)\.s, z([0-9]+)\.s, #22
> +**      ret
> +*/
> +v2si
> +G5 (v2si r) {
> +    return (r >> 22) | (r << 10);
> +}
> +
> +/*
> +** G6:
> +**   movi?   [vdz]([0-9]+)\.?(?:[0-9]*[bhsd])?, #?0
> +**   fmov    d[0-9]+, d[0-9]+
> +**   xar     z0\.h, z([0-9]+)\.h, z([0-9]+)\.h, #7
> +**      ret
> +*/
> +v4hi
> +G6 (v4hi r) {
> +    return (r >> 7) | (r << 9);
> +}
> +
> +/*
> +** G7:
> +**   movi?   [vdz]([0-9]+)\.?(?:[0-9]*[bhsd])?, #?0
> +**   fmov    d[0-9]+, d[0-9]+
> +**   xar     z0\.b, z([0-9]+)\.b, z([0-9]+)\.b, #5
> +**      ret
> +*/
> +v8qi
> +G7 (v8qi r)
> +{
> +  return (r << 3) | (r >> 5);
> +}
> +

Reply via email to