Segher: I know you're not officially noted as a maintainer or reviewer for combine.c, but that's something I'd like to change if you're interested in a larger role. In the mean time, any feedback you have would be appreciated.

So the issue mentioned in the BZ is that fairly obvious code sequences that ought to use simple byte moves are expanding into hideous sequences (load, store, couple bitwise logicals, maybe a shift or extension thrown in for good measure).

As mentioned in the BZ, one of the issues is that combine is limited in terms of how many insns it will look at. As it turns out that was addressed not terribly low ago and we can do 4 insn combinations. With just a little work in combine.c we can get the desired code for the first two testcases as well as two of my own.

The first issue is 4 insn combinations are (reasonably) guarded in such a way as to avoid them if they are unlikely to succeed. We basically look at the operands of the 4 insns and try to guess if there's a reasonable chance a combination would succeed. If not, no 4 insn combinations are tried.

So the first part of this patch improves that heuristic. What we see with these byte accesses is a pattern like

load word from memory
masking
bit-ior
store word to memory

That's very consistent. The form of the masking and bit-ior in the middle varies quite a bit, but consistently across all the tests I've looked at we have the memory operations in i0 and i3. It's worth noting that the memory load might have a zero (or sign?) extension as well and that the load/store might be using different modes.

So when we see that overall structure, we increase the "goodness" of trying to combine i0-i3.

The second change we need is an additional simplification.

If we have
(subreg:M1 (zero_extend:M2 (x))

Where M1 > M2 and both are scalar integer modes. It's advantageous to strip the SUBREG and instead have a wider extension. So a concrete example

(subreg:SI (zero_extend:HI (X:QI))

Will get turned into

(zero_extend:SI (X:QI))


Now this does have an interesting side effect, namely that the bits between HI and SI now have a defined value whereas they previously had "don't care" values. In theory, we lose a bit of flexibility in that regard, but eliminating the subreg in favor of the wider extension allows the existing simplification machinery to do a better job and lowers the cost of the resulting simplified insn.

The net result is we get a simple movb for the first two testcases.

Getting a movb for the other two testcases in the PR is a bit harder, but possible. The patch to handle those testcases needs further cleanup.


Bootstrapped and regression tested on x86_64-unknown-linux-gnu. Thoughts? OK for the trunk?


        PR target/15184 (partial)
        * combine.c (try_combine): If I0 is a memory load and I3 a store
        to a related address, increase the "goodness" of doing a 4-insn
        combination with I0-I3.
        (combine_simplify_rtx): Given (subreg:M1 (any_extend:M2 (x))) where
        M1 > M2, strip the subreg and use a wider extension.

        PR target/15184 (partial)
        * gcc.target/i386/pr15184.c: New test

diff --git a/gcc/combine.c b/gcc/combine.c
index 5c763b4..2d85cf7 100644
--- a/gcc/combine.c
+++ b/gcc/combine.c
@@ -2620,6 +2620,7 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, 
rtx_insn *i0,
       int i;
       int ngood = 0;
       int nshift = 0;
+      rtx set0, set3;
 
       if (!flag_expensive_optimizations)
        return 0;
@@ -2643,6 +2644,24 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, 
rtx_insn *i0,
                   || GET_CODE (src) == LSHIFTRT)
            nshift++;
        }
+
+      /* If I0 loads a memory and I3 sets the same memory, then I2 and I3
+        are likely manipulating its value.  Ideally we'll be able to combine
+        all four insns into a bitfield insertion of some kind. 
+
+        Note the source in I0 might be inside a sign/zero extension and the
+        memory modes in I0 and I3 might be different.  So extract the address
+        from the destination of I3 and search for it in the source of I0.
+
+        In the event that there's a match but the source/dest do not actually
+        refer to the same memory, the worst that happens is we try some
+        combinations that we wouldn't have otherwise.  */
+      if ((set0 = single_set (i0))
+         && (set3 = single_set (i3))
+         && GET_CODE (SET_DEST (set3)) == MEM
+         && rtx_referenced_p (XEXP (SET_DEST (set3), 0), SET_SRC (set0)))
+       ngood += 2;
+
       if (ngood < 2 && nshift < 2)
        return 0;
     }
@@ -5663,6 +5682,25 @@ combine_simplify_rtx (rtx x, machine_mode op0_mode, int 
in_dest,
          return CONST0_RTX (mode);
       }
 
+      /* If we have (subreg:M1 (zero_extend:M2 (x))) or 
+        (subreg:M1 (sign_extend: M2 (x))) where M1 is wider
+        then M2, then go ahead and just widen the original extension.
+
+        While the subreg is useful in saying "I don't care about those
+        upper bits.  Squashing out the subreg results in simpler RTL that
+        is more easily matched.  */
+      if ((GET_CODE (SUBREG_REG (x)) == ZERO_EXTEND
+          || GET_CODE (SUBREG_REG (x)) == SIGN_EXTEND)
+         && SCALAR_INT_MODE_P (GET_MODE (x))
+         && SCALAR_INT_MODE_P (GET_MODE (SUBREG_REG (x)))
+         && GET_MODE (x) > GET_MODE (SUBREG_REG (x)))
+       {
+         if (GET_CODE (SUBREG_REG (x)) == ZERO_EXTEND)
+           return gen_rtx_ZERO_EXTEND (GET_MODE (x), XEXP (SUBREG_REG (x), 0));
+         else
+           return gen_rtx_SIGN_EXTEND (GET_MODE (x), XEXP (SUBREG_REG (x), 0));
+       }
+
       /* Don't change the mode of the MEM if that would change the meaning
         of the address.  */
       if (MEM_P (SUBREG_REG (x))
diff --git a/gcc/testsuite/gcc.target/i386/pr15184-1.c 
b/gcc/testsuite/gcc.target/i386/pr15184-1.c
new file mode 100644
index 0000000..9eb544c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr15184-1.c
@@ -0,0 +1,33 @@
+/* PR 15184 first two tests, plus two addition ones.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -m32 -march=pentiumpro" } */
+
+#define regparm __attribute__((__regparm__(3)))
+
+extern unsigned int x;
+extern unsigned short y;
+
+void regparm f0(unsigned char c)
+{
+       x = (x & 0xFFFFFF00) | (unsigned int)c;
+}
+
+void regparm f1(unsigned char c)
+{
+     x = (x & 0xFFFF00FF) | ((unsigned int)c << 8);
+}
+
+void regparm f2(unsigned char c)
+{
+     x = (x & 0xFF00FFFF) | ((unsigned int)c << 16);
+}
+void regparm f3(unsigned char c)
+{
+     x = (x & 0x00FFFFFF) | ((unsigned int)c << 24);
+}
+
+
+/* Each function should compile down to a byte move from
+   the input register into x, possibly at an offset within x.  */
+/* { dg-final { scan-assembler-times "\tmovb\t%al, x" 4 } } */
+

Reply via email to