Jeff Law wrote:
On 12/04/13 09:06, Tejas Belagod wrote:
Richard Sandiford wrote:
Tejas Belagod <tbela...@arm.com> writes:
Richard Sandiford wrote:
Tejas Belagod <tbela...@arm.com> writes:
The problem is that one reg rtx can span several hard registers.
E.g. (reg:V4SI 32) might represent one 64-bit register (no. 32),
but it might instead represent two 32-bit registers (nos. 32 and 33).
Obviously the latter's not very likely for vectors this small,
but more likely for larger ones (including on NEON IIRC).

So if we had 2 32-bit registers being treated as a V4HI, it would be:

   <--32--><--33-->
   msb          lsb
   0000111122223333
   VVVVVVVV
   00001111
   msb  lsb
   <--32-->

for big endian and:

   <--33--><--32-->
   msb          lsb
   3333222211110000
           VVVVVVVV
           11110000
           msb  lsb
           <--32-->

for little endian.
Ah, ok, that makes things clearer. Thanks for that.

I can't find any helper function that figures out if we're writing
partial or
full result regs. Would something like

     REGNO (src) == REGNO (dst) &&
     HARD_REGNO_NREGS (src) == HARD_REGNO_NREGS (dst) == 1

be a sane check for partial result regs?
Yeah, that should work.  I think a more general alternative would be:

  simplify_subreg_regno (REGNO (src), GET_MODE (src),
                         offset, GET_MODE (dst)) == (int) REGNO (dst)

where:

  offset = GET_MODE_UNIT_SIZE (GET_MODE (src)) * INTVAL (XVECEXP
(sel, 0))

That offset is the byte offset of the first selected element from the
start of a vector in memory, which is also the way that SUBREG_BYTEs
are counted.  For little-endian it gives the offset of the lsb of the
slice, while for big-endian it gives the offset of the msb (which is
also how SUBREG_BYTEs work).

The simplify_subreg_regno should cope with both single-register vectors
and multi-register vectors.
Sorry for the delayed response to this.

Thanks for the tip. Here's an improved patch that implements the
simplify_sureg_regno () method of eliminating redundant moves.
Regarding the test case, I failed to get the ppc back-end to generate
RTL pattern that this patch checks for. I can easily write a test
case for aarch64(big and little endian) on these lines

typedef float float32x4_t __attribute__ ((__vector_size__ (16)));

float foo_be (float32x4_t x)
{
   return x[3];
}

float foo_le (float32x4_t x)
{
   return x[0];
}

where I know that the vector indexing will generate a vec_select on
the same src and dst regs that could be optimized away and hence test
it. But I'm struggling to get a test case that the ppc altivec
back-end will generate such a vec_select for. I see that altivec does
not define vec_extract, so a simple indexing like this seems to happen
via memory. Also, I don't know enough about the ppc PCS or
architecture to write a test that will check for this optimization
opportunity on same src and dst hard-registers. Any hints?
Me neither, sorry.

FWIW, the MIPS tests:

  typedef float float32x2_t __attribute__ ((__vector_size__ (8)));
  void bar (float);
  void foo_be (float32x2_t x) { bar (x[1]); }
  void foo_le (float32x2_t x) { bar (x[0]); }

also exercise it, but I don't think they add anything over the aarch64
versions.  I can add them to the testsuite anyway if it helps though.

diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
index 0cd0c7e..ca25ce5 100644
--- a/gcc/rtlanal.c
+++ b/gcc/rtlanal.c
@@ -1180,6 +1180,22 @@ set_noop_p (const_rtx set)
       dst = SUBREG_REG (dst);
     }

+  /* It is a NOOP if destination overlaps with selected src vector
+     elements.  */
+  if (GET_CODE (src) == VEC_SELECT
+      && REG_P (XEXP (src, 0)) && REG_P (dst)
+      && HARD_REGISTER_P (XEXP (src, 0))
+      && HARD_REGISTER_P (dst))
+    {
+      rtx par = XEXP (src, 1);
+      rtx src0 = XEXP (src, 0);
+      HOST_WIDE_INT offset =
+    GET_MODE_UNIT_SIZE (GET_MODE (src0)) * INTVAL (XVECEXP (par, 0,
0));
+
+      return simplify_subreg_regno (REGNO (src0), GET_MODE (src0),
+                    offset, GET_MODE (dst)) == (int)REGNO (dst);
+    }
+
Since this also (correctly) triggers for vector results, we need to keep
the check for consecutive indices that you had originally.  (It's always
the first index that should be used for the simplify_subreg_regno
though.)

Looks good to me otherwise, thanks.
Thanks Richard. Here is a revised patch. Sorry about the delay - I was
investigating to make sure an LRA ICE I was seeing on aarch64 was
unrelated to this patch. I've added a test case that I expect to pass
for aarch64. I've also added the tests that you suggested for MIPS, but
haven't checked for the target because I'm not sure what optimizations
happen on MIPS.

OK for trunk?

Thanks,
Tejas.

2013-12-04  Tejas Belagod  <tejas.bela...@arm.com>

gcc/
     * rtlanal.c (set_noop_p): Return nonzero in case of redundant
vec_select
     for overlapping register lanes.

testsuite/
     * config/gcc.dg/vect/vect-nop-move.c: New.
Per HJ's request please test vect-nop-move on x86/x86_64 and if the redundant move is properly eliminated, enable the test on those targets (i?86-*-* x86_64-*-*).

Approved with that change for the trunk.

Thanks Jeff.

Now that Kirill's looking at why this doesn't work for x86, could I check this in without enabling vect-nop-move.c for targets (i?86-*-* x86_64-*-*)? If not, I'm happy to wait.

Thanks,
Tejas.

Reply via email to