[PATCH][RFA][PR target/15184] Partial fix for direct byte access on x86

Jeff Law Mon, 26 Jan 2015 19:08:49 -0800

Segher: I know you're not officially noted as a maintainer or reviewerfor combine.c, but that's something I'd like to change if you'reinterested in a larger role. In the mean time, any feedback you havewould be appreciated.

So the issue mentioned in the BZ is that fairly obvious code sequencesthat ought to use simple byte moves are expanding into hideous sequences(load, store, couple bitwise logicals, maybe a shift or extension thrownin for good measure).

As mentioned in the BZ, one of the issues is that combine is limited interms of how many insns it will look at. As it turns out that wasaddressed not terribly low ago and we can do 4 insn combinations. Withjust a little work in combine.c we can get the desired code for thefirst two testcases as well as two of my own.

The first issue is 4 insn combinations are (reasonably) guarded in sucha way as to avoid them if they are unlikely to succeed. We basicallylook at the operands of the 4 insns and try to guess if there's areasonable chance a combination would succeed. If not, no 4 insncombinations are tried.

So the first part of this patch improves that heuristic. What we seewith these byte accesses is a pattern like


load word from memory
masking
bit-ior
store word to memory

That's very consistent. The form of the masking and bit-ior in themiddle varies quite a bit, but consistently across all the tests I'velooked at we have the memory operations in i0 and i3. It's worth notingthat the memory load might have a zero (or sign?) extension as well andthat the load/store might be using different modes.

So when we see that overall structure, we increase the "goodness" oftrying to combine i0-i3.


The second change we need is an additional simplification.

If we have
(subreg:M1 (zero_extend:M2 (x))

Where M1 > M2 and both are scalar integer modes. It's advantageous tostrip the SUBREG and instead have a wider extension. So a concrete example


(subreg:SI (zero_extend:HI (X:QI))

Will get turned into

(zero_extend:SI (X:QI))

Now this does have an interesting side effect, namely that the bitsbetween HI and SI now have a defined value whereas they previously had"don't care" values. In theory, we lose a bit of flexibility in thatregard, but eliminating the subreg in favor of the wider extensionallows the existing simplification machinery to do a better job andlowers the cost of the resulting simplified insn.


The net result is we get a simple movb for the first two testcases.

Getting a movb for the other two testcases in the PR is a bit harder,but possible. The patch to handle those testcases needs further cleanup.

Bootstrapped and regression tested on x86_64-unknown-linux-gnu.Thoughts? OK for the trunk?

        PR target/15184 (partial)
        * combine.c (try_combine): If I0 is a memory load and I3 a store
        to a related address, increase the "goodness" of doing a 4-insn
        combination with I0-I3.
        (combine_simplify_rtx): Given (subreg:M1 (any_extend:M2 (x))) where
        M1 > M2, strip the subreg and use a wider extension.

        PR target/15184 (partial)
        * gcc.target/i386/pr15184.c: New test

diff --git a/gcc/combine.c b/gcc/combine.c
index 5c763b4..2d85cf7 100644
--- a/gcc/combine.c
+++ b/gcc/combine.c
@@ -2620,6 +2620,7 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, 
rtx_insn *i0,
       int i;
       int ngood = 0;
       int nshift = 0;
+      rtx set0, set3;
 
       if (!flag_expensive_optimizations)
        return 0;
@@ -2643,6 +2644,24 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, 
rtx_insn *i0,
                   || GET_CODE (src) == LSHIFTRT)
            nshift++;
        }
+
+      /* If I0 loads a memory and I3 sets the same memory, then I2 and I3
+        are likely manipulating its value.  Ideally we'll be able to combine
+        all four insns into a bitfield insertion of some kind. 
+
+        Note the source in I0 might be inside a sign/zero extension and the
+        memory modes in I0 and I3 might be different.  So extract the address
+        from the destination of I3 and search for it in the source of I0.
+
+        In the event that there's a match but the source/dest do not actually
+        refer to the same memory, the worst that happens is we try some
+        combinations that we wouldn't have otherwise.  */
+      if ((set0 = single_set (i0))
+         && (set3 = single_set (i3))
+         && GET_CODE (SET_DEST (set3)) == MEM
+         && rtx_referenced_p (XEXP (SET_DEST (set3), 0), SET_SRC (set0)))
+       ngood += 2;
+
       if (ngood < 2 && nshift < 2)
        return 0;
     }
@@ -5663,6 +5682,25 @@ combine_simplify_rtx (rtx x, machine_mode op0_mode, int 
in_dest,
          return CONST0_RTX (mode);
       }
 
+      /* If we have (subreg:M1 (zero_extend:M2 (x))) or 
+        (subreg:M1 (sign_extend: M2 (x))) where M1 is wider
+        then M2, then go ahead and just widen the original extension.
+
+        While the subreg is useful in saying "I don't care about those
+        upper bits.  Squashing out the subreg results in simpler RTL that
+        is more easily matched.  */
+      if ((GET_CODE (SUBREG_REG (x)) == ZERO_EXTEND
+          || GET_CODE (SUBREG_REG (x)) == SIGN_EXTEND)
+         && SCALAR_INT_MODE_P (GET_MODE (x))
+         && SCALAR_INT_MODE_P (GET_MODE (SUBREG_REG (x)))
+         && GET_MODE (x) > GET_MODE (SUBREG_REG (x)))
+       {
+         if (GET_CODE (SUBREG_REG (x)) == ZERO_EXTEND)
+           return gen_rtx_ZERO_EXTEND (GET_MODE (x), XEXP (SUBREG_REG (x), 0));
+         else
+           return gen_rtx_SIGN_EXTEND (GET_MODE (x), XEXP (SUBREG_REG (x), 0));
+       }
+
       /* Don't change the mode of the MEM if that would change the meaning
         of the address.  */
       if (MEM_P (SUBREG_REG (x))
diff --git a/gcc/testsuite/gcc.target/i386/pr15184-1.c 
b/gcc/testsuite/gcc.target/i386/pr15184-1.c
new file mode 100644
index 0000000..9eb544c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr15184-1.c
@@ -0,0 +1,33 @@
+/* PR 15184 first two tests, plus two addition ones.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -m32 -march=pentiumpro" } */
+
+#define regparm __attribute__((__regparm__(3)))
+
+extern unsigned int x;
+extern unsigned short y;
+
+void regparm f0(unsigned char c)
+{
+       x = (x & 0xFFFFFF00) | (unsigned int)c;
+}
+
+void regparm f1(unsigned char c)
+{
+     x = (x & 0xFFFF00FF) | ((unsigned int)c << 8);
+}
+
+void regparm f2(unsigned char c)
+{
+     x = (x & 0xFF00FFFF) | ((unsigned int)c << 16);
+}
+void regparm f3(unsigned char c)
+{
+     x = (x & 0x00FFFFFF) | ((unsigned int)c << 24);
+}
+
+
+/* Each function should compile down to a byte move from
+   the input register into x, possibly at an offset within x.  */
+/* { dg-final { scan-assembler-times "\tmovb\t%al, x" 4 } } */
+

[PATCH][RFA][PR target/15184] Partial fix for direct byte access on x86

Reply via email to