extensions

Jeff Law Tue, 14 Jan 2025 21:00:10 -0800


On 1/14/25 7:57 AM, Richard Sandiford wrote:

Jeff Law <jeffreya...@gmail.com> writes:

On 12/30/24 3:02 PM, Richard Sandiford wrote:


So it seems like it's a bit of a mess :(

If we do try to fix combine, I think something like the attached
would fit within the current scheme.  It is a pure shift-for-shift
transformation, avoiding any extensions.

Will think more about it, but wanted to get the above stream of
consciousness out before I finish for the day :)

So your approach in simplify-rtx ended up much cleaner than mine; unless
there's some objection from you, I'd like to go with it.  Obviously I'll
run it through the usual testing cycles first ;-)


Yeah, that's fine with me.  Reading it back, I realised that we should
probably also check for an in-range shift value, fixed below.

Thanks. This required a minor twiddle on one risc-v specific test.Essentially we turned a 32-bit shift into a 64-bit shift, which isgenerally a good thing.


Attached is your patch + the testsuite changes for archival purposes.

jeff

commit cab2e12362287bd50f9abdd7fb5a775138e02d1b
Author: Richard Sandiford <richard.sandif...@arm.com>
Date:   Tue Jan 14 21:51:41 2025 -0700

    [PR rtl-optimization/109592] Simplify nested shifts
    
    > The BZ in question is a failure to recognize a pair of shifts as a sign
    > extension.
    >
    > I originally thought simplify-rtx would be the right framework to
    > address this problem, but fwprop is actually better.  We can write the
    > recognizer much simpler in that framework.
    >
    > fwprop already simplifies nested shifts/extensions to the desired RTL,
    > but it's not considered profitable and we throw away the good work done
    > by fwprop & simplifiers.
    >
    > It's hard to see a scenario where nested shifts or nested extensions
    > that simplify down to a single sign/zero extension isn't a profitable
    > transformation.  So when fwprop has nested shifts/extensions that
    > simplifies to an extension, we consider it profitable.
    >
    > This allow us to simplify the testcase on rv64 with ZBB enabled from a
    > pair of shifts to a single byte or half-word sign extension.
    
    Hmm.  So just to summarise something that was discussed in the PR
    comments, this is a case where combine's expand_compound_operation/
    make_compound_operation wrangler hurts us, because the process isn't
    idempotent, and combine produces two complex instructions:
    
    (insn 6 3 7 2 (set (reg:DI 137 [ _3 ])
            (ashift:DI (reg:DI 139 [ x ])
                (const_int 24 [0x18]))) "foo.c":2:20 305 {ashldi3}
         (expr_list:REG_DEAD (reg:DI 139 [ x ])
            (nil)))
    (insn 12 7 13 2 (set (reg/i:DI 10 a0)
            (sign_extend:DI (ashiftrt:SI (subreg:SI (reg:DI 137 [ _3 ]) 0)
                    (const_int 24 [0x18])))) "foo.c":2:27 321 {ashrsi3_extend}
         (expr_list:REG_DEAD (reg:DI 137 [ _3 ])
            (nil)))
    
    given two simple instructions:
    
    (insn 6 3 7 2 (set (reg:SI 137 [ _3 ])
            (sign_extend:SI (subreg:QI (reg/v:DI 136 [ x ]) 0))) "foo.c":2:20 
533 {*extendqisi2_bitmanip}
         (expr_list:REG_DEAD (reg/v:DI 136 [ x ])
            (nil)))
    (insn 7 6 12 2 (set (reg:DI 138 [ _3 ])
            (sign_extend:DI (reg:SI 137 [ _3 ]))) "foo.c":2:20 discrim 1 133 
{*extendsidi2_internal}
         (expr_list:REG_DEAD (reg:SI 137 [ _3 ])
            (nil)))
    
    If I run with -fdisable-rtl-combine then late_combine1 already does the
    expected transformation.
    
    Although it would be nice to fix combine, that might be difficult.
    If we treat combine as immutable then the options are:
    
    (1) Teach simplify-rtx to simplify combine's output into a single 
sign_extend.
    
    (2) Allow fwprop1 to get in first, before combine has a chance to mess
        things up.
    
    The patch goes for (2).
    
    Is that a fair summary?
    
    Playing devil's advocate, I suppose one advantage of (1) is that it
    would allow the optimisation even if the original rtl looked like
    combine's output.  And fwprop1 doesn't distinguish between cases in
    which the source instruction disappears from cases in which the source
    instruction is kept.  Thus we could transform:
    
      (set (reg:SI R2) (sign_extend:SI (reg:QI R1)))
      (set (reg:DI R3) (sign_extend:DI (reg:SI R2)))
    
    into:
    
      (set (reg:SI R2) (sign_extend:SI (reg:QI R1)))
      (set (reg:DI R3) (sign_extend:DI (reg:QI R1)))
    
    which increases the register pressure between the two instructions
    (since R2 and R1 are both now live).  In general, there could be
    quite a gap between the two instructions.
    
    On the other hand, even in that case, fwprop1 would be parallelising
    the extensions.  And since we're talking about unary operations,
    even two-address targets would allow R1 to be extended without
    tying the source and destination.
    
    Also, it seems relatively unlikely that expand would produce code
    that looks like combine's, since the gimple optimisers should have
    simplified it into conversions.
    
    So initially I was going to agree that it's worth trying in fwprop.  But...
    
    [ commentary on Jeff's original approach dropped. ]
    
    So it seems like it's a bit of a mess 🙁
    
    If we do try to fix combine, I think something like the attached
    would fit within the current scheme.  It is a pure shift-for-shift
    transformation, avoiding any extensions.
    
    Will think more about it, but wanted to get the above stream of
    consciousness out before I finish for the day 🙂
    
            PR rtl-optimization/109592
    gcc/
            * simplify-rtx.cc (simplify_context::simplify_binary_operation_1):
            Simplify nested shifts with subregs.
    
    gcc/testsuite
            * gcc.target/riscv/pr109592.c: New test.
            * gcc.target/riscv/sign-extend-rshift.c: Adjust expected output
    
    Co-authored-by: Jeff Law <j...@ventanamicro.com>

diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
index dda8fc689e7..c478bd060fc 100644
--- a/gcc/simplify-rtx.cc
+++ b/gcc/simplify-rtx.cc
@@ -4494,6 +4494,60 @@ simplify_context::simplify_binary_operation_1 (rtx_code 
code,
            return simplify_gen_binary (code, mode, op0,
                                        gen_int_shift_amount (mode, val));
        }
+
+      /* Simplify:
+
+          (code:M1
+            (subreg:M1
+              ([al]shiftrt:M2
+                (subreg:M2
+                  (ashift:M1 X C1))
+                C2))
+            C3)
+
+        to:
+
+          (code:M1
+            ([al]shiftrt:M1
+              (ashift:M1 X C1+N)
+              C2+N)
+            C3)
+
+        where M1 is N bits wider than M2.  Optimizing the (subreg:M1 ...)
+        directly would be arithmetically correct, but restricting the
+        simplification to shifts by constants is more conservative,
+        since it is more likely to lead to further simplifications.  */
+      if (is_a<scalar_int_mode> (mode, &int_mode)
+         && paradoxical_subreg_p (op0)
+         && is_a<scalar_int_mode> (GET_MODE (SUBREG_REG (op0)), &inner_mode)
+         && (GET_CODE (SUBREG_REG (op0)) == ASHIFTRT
+             || GET_CODE (SUBREG_REG (op0)) == LSHIFTRT)
+         && CONST_INT_P (op1))
+       {
+         auto xcode = GET_CODE (SUBREG_REG (op0));
+         rtx xop0 = XEXP (SUBREG_REG (op0), 0);
+         rtx xop1 = XEXP (SUBREG_REG (op0), 1);
+         if (SUBREG_P (xop0)
+             && GET_MODE (SUBREG_REG (xop0)) == mode
+             && GET_CODE (SUBREG_REG (xop0)) == ASHIFT
+             && CONST_INT_P (xop1)
+             && UINTVAL (xop1) < GET_MODE_PRECISION (inner_mode))
+           {
+             rtx yop0 = XEXP (SUBREG_REG (xop0), 0);
+             rtx yop1 = XEXP (SUBREG_REG (xop0), 1);
+             if (CONST_INT_P (yop1)
+                 && UINTVAL (yop1) < GET_MODE_PRECISION (inner_mode))
+               {
+                 auto bias = (GET_MODE_BITSIZE (int_mode)
+                              - GET_MODE_BITSIZE (inner_mode));
+                 tem = simplify_gen_binary (ASHIFT, mode, yop0,
+                                            GEN_INT (INTVAL (yop1) + bias));
+                 tem = simplify_gen_binary (xcode, mode, tem,
+                                            GEN_INT (INTVAL (xop1) + bias));
+                 return simplify_gen_binary (code, mode, tem, op1);
+               }
+           }
+       }
       break;
 
     case SS_ASHIFT:
diff --git a/gcc/testsuite/gcc.target/riscv/pr109592.c 
b/gcc/testsuite/gcc.target/riscv/pr109592.c
new file mode 100644
index 00000000000..2b388c4f7a1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/pr109592.c
@@ -0,0 +1,9 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gc_zbb -mabi=lp64d" } */
+
+int sextb32(int x) { return (x << 24) >> 24; }
+
+/* { dg-final { scan-assembler-times {sext.b} 1 } } */
+/* { dg-final { scan-assembler-not {slli} } } */
+/* { dg-final { scan-assembler-not {srai} } } */
+
diff --git a/gcc/testsuite/gcc.target/riscv/sign-extend-rshift.c 
b/gcc/testsuite/gcc.target/riscv/sign-extend-rshift.c
index 86f40375722..bcb015efe03 100644
--- a/gcc/testsuite/gcc.target/riscv/sign-extend-rshift.c
+++ b/gcc/testsuite/gcc.target/riscv/sign-extend-rshift.c
@@ -84,7 +84,7 @@ SLONG_EXT_SSHORT_RSHIFT_N_SLONG(63)
 
 #if __riscv_xlen == 64
 // Below "slli ((32+16)-N); srai (32+16)" for rv64
-//    or "slli (16-N); sraiw 16" for rv64
+//    or "slli (16-N); srai 16" for rv64
 SINT_EXT_SSHORT_RSHIFT_N_SINT(1)
 SINT_EXT_SSHORT_RSHIFT_N_SINT(7)
 SINT_EXT_SSHORT_RSHIFT_N_SINT(8)
@@ -104,7 +104,7 @@ SINT_EXT_SSHORT_RSHIFT_N_SINT(31)
 
 // Below "slli (16-N); srai 16" for rv32
 // Below "slli ((32+16)-N); srai (32+16)" for rv64
-//    or "slli (16-N); sraiw 16" for rv64
+//    or "slli (16-N); srai 16" for rv64
 SINT_EXT_SSHORT_RSHIFT_N_SLONG(9)
 SINT_EXT_SSHORT_RSHIFT_N_SLONG(15)
 
@@ -119,5 +119,5 @@ SLONG_EXT_SSHORT_RSHIFT_N_SINT(15)
 /* { dg-final { scan-assembler-times "srai\t" 26 { target { rv32 } } } } */
 
 /* { dg-final { scan-assembler-times "slli\t" 44 { target { rv64 } } } } */
-/* { dg-final { scan-assembler-times "srai\t" 51 { target { rv64 } } } } */
-/* { dg-final { scan-assembler-times "sraiw\t" 10 { target { rv64 } } } } */
+/* { dg-final { scan-assembler-times "srai\t" 58 { target { rv64 } } } } */
+/* { dg-final { scan-assembler-times "sraiw\t" 3 { target { rv64 } } } } */

Re: [RFA] [PR rtl-optimization/109592] Improve fwprop's handling of nested shifts/extensions

Reply via email to