[gcc r15-7208] [RISC-V][PR target/116256] Improve handling of single bit constants

Jeff Law via Gcc-cvs Sat, 25 Jan 2025 08:44:55 -0800

https://gcc.gnu.org/g:e5990a6ce611f522b8f48c2b469983da19d39777


commit r15-7208-ge5990a6ce611f522b8f48c2b469983da19d39777
Author: Jeff Law <j...@ventanamicro.com>
Date:   Sat Jan 25 09:42:19 2025 -0700

    [RISC-V][PR target/116256] Improve handling of single bit constants
    
    So under the umbrella of pr116256 (P3 regression) I've been exploring 
removal
    of the mvconst_internal pattern.   Not surprisingly, that's going to cause 
all
    kinds of undesirable fallout.  While I can kind of see a path forward for 
that
    work, it's going to require some combine work that I don't think we want to
    tackle in the context of gcc-15.
    
    Essentially without mvconst_internal we'll have fully exposed constant
    synthesis prior to combine.  Remember that combine has limits on what
    combinations it will perform based on how many instructions are in the 
source
    sequence.  If we need 2+ instructions to synthesize the constant, those eat
    into our budget.
    
    In a world without mvconst_internal we'd need to either improve combine to
    handle 5 insns cases (which do show up in the testsuite) or we need to
    significantly improve how combine handles REG_EQUAL notes.  5 insn 
combinations
    sound like insanity to me.  So I'd tend to lean towards the latter, though
    that's going to need some refactoring and diving into note redistribution
    (ugh!).
    
    In the mean time we can start limiting mvconst_internal.  For the remaining
    case in pr116256 we have this code in combine:
    
    > (insn 8 5 10 2 (set (reg:V2048HF 138 [ _5 ])
    >         (vec_duplicate:V2048HF (reg:HF 142 [ x ]))) "j.c":152:11 3712 
{*vec_duplicatev2048hf}
    >      (expr_list:REG_DEAD (reg:HF 142 [ x ])
    >         (nil)))
    > (insn 10 8 11 2 (set (reg:DI 139)
    >         (const_int 2048 [0x800])) "j.c":152:11 275 {*mvconst_internal}
    >      (nil))      (insn 11 10 0 2 (set (mem:V2048HF (reg/f:DI 141 [ in ]) 
[1 MEM <vector(2048) _Float16> [(_Float16 *)in_7(D)]+0 S4096 A128])
    >         (if_then_else:V2048HF (unspec:V2048BI [
    >                     (const_vector:V2048BI [
    >                             (const_int 1 [0x1]) repeated x2048
    >                         ])
    >                     (reg:DI 139)
    >                     (const_int 2 [0x2]) repeated x3
    >                     (reg:SI 66 vl)
    >                     (reg:SI 67 vtype)
    >                 ] UNSPEC_VPREDICATE)
    >             (reg:V2048HF 138 [ _5 ])
    >             (unspec:V2048HF [
    >                     (reg:DI 0 zero)
    >                 ] UNSPEC_VUNDEF))) "j.c":152:11 3843 {*pred_movv2048hf}
    >      (expr_list:REG_DEAD (reg/f:DI 141 [ in ])
    >         (expr_list:REG_DEAD (reg:DI 0 zero)
    >             (expr_list:REG_DEAD (reg:SI 66 vl)
    >                 (expr_list:REG_DEAD (reg:SI 67 vtype)
    >                     (expr_list:REG_DEAD (reg:V2048HF 138 [ _5 ])
    >                         (expr_list:REG_DEAD (reg:DI 139)
    >                             (nil))))))))
    
    Note a couple things.  First insn 8 will be split shortly after combine and
    will need the constant 2048.  But that's obviously exposed  late. Second (of
    course) is the mvconst_internal pattern at insn 10.  After split1 we'll 
have:
    
    > (insn 16 5 17 2 (set (reg:DI 144)         (const_int 4096 [0x1000])) 
"j.c":152:11 -1
    >      (nil))
    > (insn 17 16 18 2 (set (reg:DI 143)
    >         (plus:DI (reg:DI 144)
    >             (const_int -2048 [0xfffffffffffff800]))) "j.c":152:11 -1
    >      (expr_list:REG_EQUAL (const_int 2048 [0x800])
    >         (nil)))
    > (insn 18 17 19 2 (set (reg:V2048HF 138 [ _5 ])
    >         (if_then_else:V2048HF (unspec:V2048BI [                     
(const_vector:V2048BI [
    >                             (const_int 1 [0x1]) repeated x2048
    >                         ])
    >                     (reg:DI 143)
    >                     (const_int 2 [0x2]) repeated x3
    >                     (reg:SI 66 vl)
    >                     (reg:SI 67 vtype)
    >                 ] UNSPEC_VPREDICATE)
    >             (vec_duplicate:V2048HF (reg:HF 142 [ x ]))
    >             (unspec:V2048HF [                     (reg:DI 0 zero)
    >                 ] UNSPEC_VUNDEF))) "j.c":152:11 -1
    >      (nil))
    > (insn 19 18 20 2 (set (reg:DI 145)
    >         (const_int 4096 [0x1000])) "j.c":152:11 -1
    >      (nil))
    > (insn 20 19 11 2 (set (reg:DI 139)
    >         (plus:DI (reg:DI 145)
    >             (const_int -2048 [0xfffffffffffff800]))) "j.c":152:11 -1
    >      (expr_list:REG_EQUAL (const_int 2048 [0x800])
    >         (nil)))
    > (insn 11 20 0 2 (set (mem:V2048HF (reg/f:DI 141 [ in ]) [1 MEM 
<vector(2048) _Float16> [(_Float16 *)in_7(D)]+0 S4096 A128])
    >         (if_then_else:V2048HF (unspec:V2048BI [
    >                     (const_vector:V2048BI [
    >                             (const_int 1 [0x1]) repeated x2048
    >                         ])
    >                     (reg:DI 139)                     (const_int 2 [0x2]) 
repeated x3
    >                     (reg:SI 66 vl)
    >                     (reg:SI 67 vtype)
    >                 ] UNSPEC_VPREDICATE)
    >             (reg:V2048HF 138 [ _5 ])
    >             (unspec:V2048HF [                     (reg:DI 0 zero)
    >                 ] UNSPEC_VUNDEF))) "j.c":152:11 3843 {*pred_movv2048hf}
    >      (expr_list:REG_DEAD (reg/f:DI 141 [ in ])
    >         (expr_list:REG_DEAD (reg:DI 0 zero)             
(expr_list:REG_DEAD (reg:SI 66 vl)
    >                 (expr_list:REG_DEAD (reg:SI 67 vtype)
    >                     (expr_list:REG_DEAD (reg:V2048HF 138 [ _5 ])
    >                         (expr_list:REG_DEAD (reg:DI 139)
    >                             (nil))))))))
    Note the synthesis of 2048 appears twice.  I seriously considered adding a
    local cprop pass at this point.  That could be done with a bit of work.  It
    didn't look too bad -- the biggest problem is cprop isn't designed to run 
once
    we've left cfglayout.  But we could probably finesse that by not allowing 
it to
    change jumps if we've left cfglayout or converting it to do the more complex
    jump fixups.
    
    You might ask why the post-reload optimizers don't help since this at least
    looks like a case where they could.  After LRA the RTL looks like:
    
    > (insn 26 5 25 2 (set (reg:DI 15 a5 [144])
    >         (const_int 4096 [0x1000])) 
"/home/jlaw/test/gcc/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/dup-1.c":152:11
 277 {*movdi_64bit}      (expr_list:REG_EQUIV (const_int 4096 [0x1000])
    >         (nil)))
    > (insn 25 26 19 2 (set (reg:DI 15 a5 [143])
    >         (plus:DI (reg:DI 15 a5 [144])
    >             (const_int -2048 [0xfffffffffffff800]))) 
"/home/jlaw/test/gcc/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/dup-1.c":152:11
 5 {adddi3}
    >      (expr_list:REG_EQUIV (const_int 2048 [0x800])
    >         (nil)))
    > (insn 19 25 20 2 (set (reg:V2048QI 100 v4 [orig:138 _11 ] [138])
    >         (if_then_else:V2048QI (unspec:V2048BI [
    >                     (const_vector:V2048BI [
    >                             (const_int 1 [0x1]) repeated x2048
    >                         ])
    >                     (reg:DI 15 a5 [143])
    >                     (const_int 2 [0x2]) repeated x3
    >                     (reg:SI 66 vl)
    >                     (reg:SI 67 vtype)
    >                 ] UNSPEC_VPREDICATE)
    >             (vec_duplicate:V2048QI (reg:QI 12 a2 [145]))
    >             (unspec:V2048QI [                     (reg:DI 0 zero)
    >                 ] UNSPEC_VUNDEF))) 
"/home/jlaw/test/gcc/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/dup-1.c":152:11
 4172 {*pred_broadcastv2048qi}
    >      (nil)) (insn 20 19 21 2 (set (reg:DI 15 a5 [146])
    >         (const_int 4096 [0x1000])) 
"/home/jlaw/test/gcc/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/dup-1.c":152:11
 277 {*movdi_64bit}       (expr_list:REG_EQUIV (const_int 4096 [0x1000])
    >         (nil)))
    > (insn 21 20 11 2 (set (reg:DI 15 a5 [139])
    >         (plus:DI (reg:DI 15 a5 [146])
    >             (const_int -2048 [0xfffffffffffff800]))) 
"/home/jlaw/test/gcc/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/dup-1.c":152:11
 5 {adddi3}
    >      (expr_list:REG_EQUIV (const_int 2048 [0x800])
    >         (nil)))
    
    Note the re-use of a5 for the constant synthesis steps.  That's going to 
spoil
    any chance of reload_cse saving us.  That re-use also gets in the way of 
vsetvl
    elimination and we ultimately get this code:
    
    > foo10:
    >         li      a5,4096
    >         addi    a5,a5,-2048
    >         vsetvli zero,a5,e16,m8,ta,ma
    >         vfmv.v.f        v8,fa0
    >         li      a5,4096
    >         addi    a5,a5,-2048
    >         vsetvli zero,a5,e16,m8,ta,ma
    >         vse16.v v8,0(a0)
    >         ret
    The regression is we have the obviously redundant vsetvl.  The additional 
copy
    of the synthesis is undesirable as well.
    
    If we filter out single bit constants from mvconst_internal we trivially fix
    that regression.  The only fallout is a class of saturation tests which 
want to
    test against 0x80000000.   Under the hood this is a minor codegen issue
    interacting badly with combine's deliberate rejection of simplification of
    extensions of constants.  Rather than constructing the SImode constant, then
    zero extending the result we can just generate the constant we actually want
    directly in DImode.
    
    The net is we fix the regression, don't introduce any obvious new 
regressions
    and slightly reduce our dependence on mvconst_internal.  All good in my 
book.
    Obviously I'll wait for pre-commit CI to render a verdict.
    
            PR target/116256
    gcc/
            * config/riscv/riscv.md (mvconst_internal): Reject single bit
            constants.
            * config/riscv/riscv.cc (riscv_gen_zero_extend_rtx): Improve
            handling constants.

Diff:
---
 gcc/config/riscv/riscv.cc | 12 +++++++++---
 gcc/config/riscv/riscv.md |  3 ++-
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index 5a3a05041773..4652454b8fec 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -12684,10 +12684,16 @@ riscv_gen_zero_extend_rtx (rtx x, machine_mode mode)
     emit_move_insn (xmode_reg, x);
   else
     {
-      rtx reg_x = gen_reg_rtx (mode);
+      /* Combine deliberately does not simplify extensions of constants
+        (long story).  So try to generate the zero extended constant
+        efficiently.
 
-      emit_move_insn (reg_x, x);
-      riscv_emit_unary (ZERO_EXTEND, xmode_reg, reg_x);
+        First extract the constant and mask off all the bits not in MODE.  */
+      HOST_WIDE_INT val = INTVAL (x);
+      val &= GET_MODE_MASK (mode);
+
+      /* X may need synthesis, so do not blindly copy it.  */
+      xmode_reg = force_reg (Xmode, gen_int_mode (val, Xmode));
     }
 
   return xmode_reg;
diff --git a/gcc/config/riscv/riscv.md b/gcc/config/riscv/riscv.md
index e4123c912dcb..09053df1eb9b 100644
--- a/gcc/config/riscv/riscv.md
+++ b/gcc/config/riscv/riscv.md
@@ -2470,7 +2470,8 @@
         (match_operand:GPR 1 "splittable_const_int_operand" "i"))]
   "!ira_in_progress
    && !(p2m1_shift_operand (operands[1], <MODE>mode)
-        || high_mask_shift_operand (operands[1], <MODE>mode))"
+       || high_mask_shift_operand (operands[1], <MODE>mode)
+       || exact_log2 (INTVAL (operands[1])) >= 0)"
   "#"
   "&& 1"
   [(const_int 0)]

[gcc r15-7208] [RISC-V][PR target/116256] Improve handling of single bit constants

Reply via email to