Kyrylo Tkachov <ktkac...@nvidia.com> writes: > Hi all, > > We can make use of the integrated rotate step of the XAR instruction > to implement most vector integer rotates, as long we zero out one > of the input registers for it. This allows for a lower-latency sequence > than the fallback SHL+USRA, especially when we can hoist the zeroing operation > away from loops and hot parts. > We can also use it for 64-bit vectors as long > as we zero the top half of the vector to be rotated. That should still be > preferable to the default sequence.
Is the zeroing necessary? We don't expect/require that 64-bit vector modes are maintained in zero-extended form, or that 64-bit ops act as strict_lowparts, so it should be OK to take a paradoxical subreg. Or we could just extend the patterns to 64-bit modes, to avoid the punning. > With this patch we can gerate for the input: > v4si > G1 (v4si r) > { > return (r >> 23) | (r << 9); > } > > v8qi > G2 (v8qi r) > { > return (r << 3) | (r >> 5); > } > the assembly for +sve2: > G1: > movi v31.4s, 0 > xar z0.s, z0.s, z31.s, #23 > ret > > G2: > movi v31.4s, 0 > fmov d0, d0 > xar z0.b, z0.b, z31.b, #5 > ret > > instead of the current: > G1: > shl v31.4s, v0.4s, 9 > usra v31.4s, v0.4s, 23 > mov v0.16b, v31.16b > ret > G2: > shl v31.8b, v0.8b, 3 > usra v31.8b, v0.8b, 5 > mov v0.8b, v31.8b > ret > > Bootstrapped and tested on aarch64-none-linux-gnu. > > Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com> > > gcc/ > > * config/aarch64/aarch64.cc (aarch64_emit_opt_vec_rotate): Add > generation of XAR sequences when possible. > > gcc/testsuite/ > > * gcc.target/aarch64/rotate_xar_1.c: New test. > [...] > +/* > +** G1: > +** movi? [vdz]([0-9]+)\.?(?:[0-9]*[bhsd])?, #?0 > +** xar v0\.2d, v([0-9]+)\.2d, v([0-9]+)\.2d, 39 FWIW, the (...) captures aren't necessary, since we never use backslash references to them later. Thanks, Richard > +** ret > +*/ > +v2di > +G1 (v2di r) { > + return (r >> 39) | (r << 25); > +} > + > +/* > +** G2: > +** movi? [vdz]([0-9]+)\.?(?:[0-9]*[bhsd])?, #?0 > +** xar z0\.s, z([0-9]+)\.s, z([0-9]+)\.s, #23 > +** ret > +*/ > +v4si > +G2 (v4si r) { > + return (r >> 23) | (r << 9); > +} > + > +/* > +** G3: > +** movi? [vdz]([0-9]+)\.?(?:[0-9]*[bhsd])?, #?0 > +** xar z0\.h, z([0-9]+)\.h, z([0-9]+)\.h, #5 > +** ret > +*/ > +v8hi > +G3 (v8hi r) { > + return (r >> 5) | (r << 11); > +} > + > +/* > +** G4: > +** movi? [vdz]([0-9]+)\.?(?:[0-9]*[bhsd])?, #?0 > +** xar z0\.b, z([0-9]+)\.b, z([0-9]+)\.b, #6 > +** ret > +*/ > +v16qi > +G4 (v16qi r) > +{ > + return (r << 2) | (r >> 6); > +} > + > +/* > +** G5: > +** movi? [vdz]([0-9]+)\.?(?:[0-9]*[bhsd])?, #?0 > +** fmov d[0-9]+, d[0-9]+ > +** xar z0\.s, z([0-9]+)\.s, z([0-9]+)\.s, #22 > +** ret > +*/ > +v2si > +G5 (v2si r) { > + return (r >> 22) | (r << 10); > +} > + > +/* > +** G6: > +** movi? [vdz]([0-9]+)\.?(?:[0-9]*[bhsd])?, #?0 > +** fmov d[0-9]+, d[0-9]+ > +** xar z0\.h, z([0-9]+)\.h, z([0-9]+)\.h, #7 > +** ret > +*/ > +v4hi > +G6 (v4hi r) { > + return (r >> 7) | (r << 9); > +} > + > +/* > +** G7: > +** movi? [vdz]([0-9]+)\.?(?:[0-9]*[bhsd])?, #?0 > +** fmov d[0-9]+, d[0-9]+ > +** xar z0\.b, z([0-9]+)\.b, z([0-9]+)\.b, #5 > +** ret > +*/ > +v8qi > +G7 (v8qi r) > +{ > + return (r << 3) | (r >> 5); > +} > +