https://gcc.gnu.org/g:ba773a86f0377abccecd3e398dceb9408bba5a7c

commit r15-4292-gba773a86f0377abccecd3e398dceb9408bba5a7c
Author: Jeff Law <j...@ventanamicro.com>
Date:   Sat Oct 12 07:12:53 2024 -0600

    RISC-V] Slightly improve broadcasting small constants into vectors
    
    I probably spent way more time on this than it's worth...
    
    I was looking at the code we generate for vector SAD and noticed that we 
were
    being a bit silly.  Specifically:
    
            li      a4,0            # 272   [c=4 l=4]  *movsi_internal/1
    
    Followed shortly by:
    
            vmv.s.x v3,a4   # 261   [c=4 l=4]  *pred_broadcastrvvm1si/6
    
    And no other uses of a4.  We could have used x0 trivially.
    
    First we adjust the expander so that it doesn't force the constant into a
    register.  In the matching pattern we change the appropriate source 
constraints
    from "r" to "rJ" and the output template is changed to use %z for the 
operand.
    The net is we drop the li completely and emit vmv.s.x,v3,x0.
    
    But wait, there's more.  If we're broadcasting a constant in the range
    [-16..15] into a vector, we currently load the constant into a register and 
use
    vmv.v.r.  We can instead use vmv.v.i, which avoids loading the constant 
into a
    GPR.  For that case we again avoid forcing the constant into a register in 
the
    expander and adjust the output template to emit vmv.v.x or vmv.v.i based on
    whether or not the appropriate operand is a constant or general purpose
    register.  So again, we'll drop a load immediate into a scalar for this 
case.
    
    Whether or not we should use vmv.v.i vs vmv.s.x for loading [-16..15] into 
the
    0th element is probably uarch dependent.  The tradeoff is loading the GPR vs
    the broadcast in the vector unit.  I didn't bother with this case.
    
    Tested in my tester (which tests rv64gcv as a default codegen option). Will
    wait for the pre-commit tester to render a verdict.
    
    gcc/
            * config/riscv/constraints.md (P): New constraint.
            * config/riscv/vector.md (pred_broadcast<mode> expander): Do
            not force small integers into GPRs so aggressively.
            (pred_broadcast<mode> insn & splitter): Allow splatting small
            constants across the vector register directly.  Allow splatting
            (const_int 0) into element 0 directly.

Diff:
---
 gcc/config/riscv/constraints.md |  5 +++++
 gcc/config/riscv/vector.md      | 22 ++++++++++++++++------
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/gcc/config/riscv/constraints.md b/gcc/config/riscv/constraints.md
index 3ab6d5426223..eb5a0bb75c72 100644
--- a/gcc/config/riscv/constraints.md
+++ b/gcc/config/riscv/constraints.md
@@ -70,6 +70,11 @@
   (and (match_code "const_int")
        (match_test "ival == 8")))
 
+(define_constraint "P"
+  "A 5-bit signed immediate for vmv.v.i."
+  (and (match_code "const_int")
+       (match_test "IN_RANGE (ival, -16, 15)")))
+
 (define_constraint "K"
   "A 5-bit unsigned immediate for CSR access instructions."
   (and (match_code "const_int")
diff --git a/gcc/config/riscv/vector.md b/gcc/config/riscv/vector.md
index 92e3061c7f85..a21288f7af2a 100644
--- a/gcc/config/riscv/vector.md
+++ b/gcc/config/riscv/vector.md
@@ -2095,6 +2095,16 @@
       emit_move_insn (tmp, gen_int_mode (value, Pmode));
       operands[3] = gen_rtx_SIGN_EXTEND (<VEL>mode, tmp);
     }
+  /* Never load (const_int 0) into a register, that's silly.  */
+  else if (operands[3] == CONST0_RTX (<VEL>mode))
+    ;
+  /* If we're broadcasting [-16..15] across more than just
+     element 0, then we can use vmv.v.i directly, thus avoiding
+     the load of the constant into a GPR.  */
+  else if (CONST_INT_P (operands[3])
+          && IN_RANGE (INTVAL (operands[3]), -16, 15)
+          && !satisfies_constraint_Wb1 (operands[1]))
+    ;
   else
     operands[3] = force_reg (<VEL>mode, operands[3]);
 })
@@ -2111,18 +2121,18 @@
             (reg:SI VL_REGNUM)
             (reg:SI VTYPE_REGNUM)] UNSPEC_VPREDICATE)
          (vec_duplicate:V_VLSI
-           (match_operand:<VEL> 3 "direct_broadcast_operand"       " r,  
r,Wdm,Wdm,Wdm,Wdm,  r,  r"))
-         (match_operand:V_VLSI 2 "vector_merge_operand"            "vu,  0, 
vu,  0, vu,  0, vu,  0")))]
+           (match_operand:<VEL> 3 "direct_broadcast_operand"       
"rP,rP,Wdm,Wdm,Wdm,Wdm, rJ, rJ"))
+         (match_operand:V_VLSI 2 "vector_merge_operand"            "vu, 0, vu, 
 0, vu,  0, vu,  0")))]
   "TARGET_VECTOR"
   "@
-   vmv.v.x\t%0,%3
-   vmv.v.x\t%0,%3
+   vmv.v.%o3\t%0,%3
+   vmv.v.%o3\t%0,%3
    vlse<sew>.v\t%0,%3,zero,%1.t
    vlse<sew>.v\t%0,%3,zero,%1.t
    vlse<sew>.v\t%0,%3,zero
    vlse<sew>.v\t%0,%3,zero
-   vmv.s.x\t%0,%3
-   vmv.s.x\t%0,%3"
+   vmv.s.x\t%0,%z3
+   vmv.s.x\t%0,%z3"
   "(register_operand (operands[3], <VEL>mode)
   || CONST_POLY_INT_P (operands[3]))
   && GET_MODE_BITSIZE (<VEL>mode) > GET_MODE_BITSIZE (Pmode)"

Reply via email to