Tejas Belagod <tbela...@arm.com> writes: > Richard Sandiford wrote: >> Tejas Belagod <tbela...@arm.com> writes: >>> Richard Sandiford wrote: >>>> Tejas Belagod <tbela...@arm.com> writes: >>>>>> The problem is that one reg rtx can span several hard registers. >>>>>> E.g. (reg:V4SI 32) might represent one 64-bit register (no. 32), >>>>>> but it might instead represent two 32-bit registers (nos. 32 and 33). >>>>>> Obviously the latter's not very likely for vectors this small, >>>>>> but more likely for larger ones (including on NEON IIRC). >>>>>> >>>>>> So if we had 2 32-bit registers being treated as a V4HI, it would be: >>>>>> >>>>>> <--32--><--33--> >>>>>> msb lsb >>>>>> 0000111122223333 >>>>>> VVVVVVVV >>>>>> 00001111 >>>>>> msb lsb >>>>>> <--32--> >>>>>> >>>>>> for big endian and: >>>>>> >>>>>> <--33--><--32--> >>>>>> msb lsb >>>>>> 3333222211110000 >>>>>> VVVVVVVV >>>>>> 11110000 >>>>>> msb lsb >>>>>> <--32--> >>>>>> >>>>>> for little endian. >>>>> Ah, ok, that makes things clearer. Thanks for that. >>>>> >>>>> I can't find any helper function that figures out if we're writing >>>>> partial or >>>>> full result regs. Would something like >>>>> >>>>> REGNO (src) == REGNO (dst) && >>>>> HARD_REGNO_NREGS (src) == HARD_REGNO_NREGS (dst) == 1 >>>>> >>>>> be a sane check for partial result regs? >>>> Yeah, that should work. I think a more general alternative would be: >>>> >>>> simplify_subreg_regno (REGNO (src), GET_MODE (src), >>>> offset, GET_MODE (dst)) == (int) REGNO (dst) >>>> >>>> where: >>>> >>>> offset = GET_MODE_UNIT_SIZE (GET_MODE (src)) * INTVAL (XVECEXP (sel, 0)) >>>> >>>> That offset is the byte offset of the first selected element from the >>>> start of a vector in memory, which is also the way that SUBREG_BYTEs >>>> are counted. For little-endian it gives the offset of the lsb of the >>>> slice, while for big-endian it gives the offset of the msb (which is >>>> also how SUBREG_BYTEs work). >>>> >>>> The simplify_subreg_regno should cope with both single-register vectors >>>> and multi-register vectors. >>> Sorry for the delayed response to this. >>> >>> Thanks for the tip. Here's an improved patch that implements the >>> simplify_sureg_regno () method of eliminating redundant moves. Regarding >>> the >>> test case, I failed to get the ppc back-end to generate RTL pattern >>> that this >>> patch checks for. I can easily write a test case for aarch64(big and little >>> endian) on these lines >>> >>> typedef float float32x4_t __attribute__ ((__vector_size__ (16))); >>> >>> float foo_be (float32x4_t x) >>> { >>> return x[3]; >>> } >>> >>> float foo_le (float32x4_t x) >>> { >>> return x[0]; >>> } >>> >>> where I know that the vector indexing will generate a vec_select on >>> the same src and dst regs that could be optimized away and hence test >>> it. But I'm struggling to get a test case that the ppc altivec >>> back-end will generate such a vec_select for. I see that altivec does >>> not define vec_extract, so a simple indexing like this seems to happen >>> via memory. Also, I don't know enough about the ppc PCS or >>> architecture to write a test that will check for this optimization >>> opportunity on same src and dst hard-registers. Any hints? >> >> Me neither, sorry. >> >> FWIW, the MIPS tests: >> >> typedef float float32x2_t __attribute__ ((__vector_size__ (8))); >> void bar (float); >> void foo_be (float32x2_t x) { bar (x[1]); } >> void foo_le (float32x2_t x) { bar (x[0]); } >> >> also exercise it, but I don't think they add anything over the aarch64 >> versions. I can add them to the testsuite anyway if it helps though. >> >>> diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c >>> index 0cd0c7e..ca25ce5 100644 >>> --- a/gcc/rtlanal.c >>> +++ b/gcc/rtlanal.c >>> @@ -1180,6 +1180,22 @@ set_noop_p (const_rtx set) >>> dst = SUBREG_REG (dst); >>> } >>> >>> + /* It is a NOOP if destination overlaps with selected src vector >>> + elements. */ >>> + if (GET_CODE (src) == VEC_SELECT >>> + && REG_P (XEXP (src, 0)) && REG_P (dst) >>> + && HARD_REGISTER_P (XEXP (src, 0)) >>> + && HARD_REGISTER_P (dst)) >>> + { >>> + rtx par = XEXP (src, 1); >>> + rtx src0 = XEXP (src, 0); >>> + HOST_WIDE_INT offset = >>> + GET_MODE_UNIT_SIZE (GET_MODE (src0)) * INTVAL (XVECEXP (par, 0, 0)); >>> + >>> + return simplify_subreg_regno (REGNO (src0), GET_MODE (src0), >>> + offset, GET_MODE (dst)) == (int)REGNO (dst); >>> + } >>> + >> >> Since this also (correctly) triggers for vector results, we need to keep >> the check for consecutive indices that you had originally. (It's always >> the first index that should be used for the simplify_subreg_regno though.) >> >> Looks good to me otherwise, thanks. > > Thanks Richard. Here is a revised patch. Sorry about the delay - I was > investigating to make sure an LRA ICE I was seeing on aarch64 was > unrelated to this patch. I've added a test case that I expect to pass > for aarch64. I've also added the tests that you suggested for MIPS, > but haven't checked for the target because I'm not sure what > optimizations happen on MIPS.
Thanks, looks good to me, but I can't approve it. Just one minor formatting nit: > + return simplify_subreg_regno (REGNO (src0), GET_MODE (src0), > + offset, GET_MODE (dst)) == (int)REGNO (dst); space after "(int)". Richard