[PATCH v2 3/3]middle-end: Use addhn for compression instead of inclusive OR when reducing comparison values

Tamar Christina Tue, 02 Sep 2025 03:59:09 -0700

Given a sequence such as

int foo ()
{
#pragma GCC unroll 4
  for (int i = 0; i < N; i++)
    if (a[i] == 124)
      return 1;


  return 0;
}

where a[i] is long long, we will unroll the loop and use an OR reduction for
early break on Adv. SIMD.  Afterwards the sequence is followed by a compression
sequence to compress the 128-bit vectors into 64-bits for use by the branch.

However if we have support for add halving and narrowing then we can instead of
using an OR, use an ADDHN which will do the combining and narrowing.

Note that for now I only do the last OR, however if we have more than one level
of unrolling we could technically chain them.  I will revisit this in another
up coming early break series, however an unroll of 2 is fairly common.

Bootstrapped Regtested on aarch64-none-linux-gnu,
arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
-m32, -m64 and no issues and about a 10% improvements
in this sequence for Adv. SIMD.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

        * internal-fn.def (VEC_ADD_HALVING_NARROW): New.
        * doc/generic.texi: Document it.
        * optabs.def (vec_addh_narrow): New.
        * doc/md.texi: Document it.
        * tree-vect-stmts.cc (vectorizable_early_exit): Use addhn if supported.

gcc/testsuite/ChangeLog:

        * gcc.target/aarch64/vect-early-break-addhn_1.c: New test.
        * gcc.target/aarch64/vect-early-break-addhn_2.c: New test.
        * gcc.target/aarch64/vect-early-break-addhn_3.c: New test.
        * gcc.target/aarch64/vect-early-break-addhn_4.c: New test.

---
diff --git a/gcc/doc/generic.texi b/gcc/doc/generic.texi
index 
d4ac580a7a8b9cd339d26cb97f7eb963f83746a4..ff16ff47bbf45e795df0d230e9a885d9d218d9af
 100644
--- a/gcc/doc/generic.texi
+++ b/gcc/doc/generic.texi
@@ -1834,6 +1834,7 @@ a value from @code{enum annot_expr_kind}, the third is an 
@code{INTEGER_CST}.
 @tindex IFN_VEC_WIDEN_MINUS_LO
 @tindex IFN_VEC_WIDEN_MINUS_EVEN
 @tindex IFN_VEC_WIDEN_MINUS_ODD
+@tindex IFN_VEC_ADD_HALVING_NARROW
 @tindex VEC_UNPACK_HI_EXPR
 @tindex VEC_UNPACK_LO_EXPR
 @tindex VEC_UNPACK_FLOAT_HI_EXPR
@@ -1956,6 +1957,24 @@ vector of @code{N/2} subtractions.  In the case of
 vector are subtracted from the odd @code{N/2} of the first to produce the
 vector of @code{N/2} subtractions.
 
+@item IFN_VEC_ADD_HALVING_NARROW
+This internal function performs an addition of two input vectors,
+then extracts the most significant half of each result element and
+narrows it back to the original element width.
+
+Concretely, it computes:
+@code{(bits(a)/2)((a + b) >> bits(a))}
+
+where @code{bits(a)} is the width in bits of each input element.
+
+Its operands are vectors containing the same number of elements (@code{N})
+of the same integral type.  The result is a vector of length @code{N}, with
+elements of an integral type whose size is half that of the input element
+type.
+
+This operation currently only used for early break result compression when the
+result of a vector boolean can be represented as 0 or -1.
+
 @item VEC_UNPACK_HI_EXPR
 @itemx VEC_UNPACK_LO_EXPR
 These nodes represent unpacking of the high and low parts of the input vector,
diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 
aba93f606eca59d31c103a05b2567fd4f3be55f3..ec0193e4eee079e00168bbaf9b28ba8d52e5d464
 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -6087,6 +6087,25 @@ vectors with N signed/unsigned elements of size S@.  
Find the absolute
 difference between operands 1 and 2 and widen the resulting elements.
 Put the N/2 results of size 2*S in the output vector (operand 0).
 
+@cindex @code{vec_addh_narrow@var{m}} instruction pattern
+@item @samp{vec_addh_narrow@var{m}}
+Signed or unsigned addition of two input vectors, then extracts the
+most significant half of each result element and narrows it back to the
+original element width.
+
+Concretely, it computes:
+@code{(bits(a)/2)((a + b) >> bits(a))}
+
+where @code{bits(a)} is the width in bits of each input element.
+
+Its operands (@code{1} and @code{2}) are vectors containing the same number
+of signed or unsigned integral elements (@code{N}) of size @code{S}.  The
+result (operand @code{0}) is a vector of length @code{N}, with elements of
+an integral type whose size is half that of @code{S}.
+
+This operation currently only used for early break result compression when the
+result of a vector boolean can be represented as 0 or -1.
+
 @cindex @code{vec_addsub@var{m}3} instruction pattern
 @item @samp{vec_addsub@var{m}3}
 Alternating subtract, add with even lanes doing subtract and odd
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 
d2480a1bf7927476215bc7bb99c0b74197d2b7e9..cb18058d9f48cc0dff96ed4b31d0abc9adb67867
 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -422,6 +422,8 @@ DEF_INTERNAL_OPTAB_FN (COMPLEX_ADD_ROT270, ECF_CONST, 
cadd270, binary)
 DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL, ECF_CONST, cmul, binary)
 DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL_CONJ, ECF_CONST, cmul_conj, binary)
 DEF_INTERNAL_OPTAB_FN (VEC_ADDSUB, ECF_CONST, vec_addsub, binary)
+DEF_INTERNAL_OPTAB_FN (VEC_ADD_HALVING_NARROW, ECF_CONST | ECF_NOTHROW,
+                      vec_addh_narrow, binary)
 DEF_INTERNAL_WIDENING_OPTAB_FN (VEC_WIDEN_PLUS,
                                ECF_CONST | ECF_NOTHROW,
                                first,
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 
87a8b85da1592646d0a3447572e842ceb158cd97..b2bedc3692f914c2b80d7972db81b542b32c9eb8
 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -492,6 +492,7 @@ OPTAB_D (vec_widen_uabd_hi_optab, "vec_widen_uabd_hi_$a")
 OPTAB_D (vec_widen_uabd_lo_optab, "vec_widen_uabd_lo_$a")
 OPTAB_D (vec_widen_uabd_odd_optab, "vec_widen_uabd_odd_$a")
 OPTAB_D (vec_widen_uabd_even_optab, "vec_widen_uabd_even_$a")
+OPTAB_D (vec_addh_narrow_optab, "vec_addh_narrow$a")
 OPTAB_D (vec_addsub_optab, "vec_addsub$a3")
 OPTAB_D (vec_fmaddsub_optab, "vec_fmaddsub$a4")
 OPTAB_D (vec_fmsubadd_optab, "vec_fmsubadd$a4")
diff --git a/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_1.c 
b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_1.c
new file mode 100644
index 
0000000000000000000000000000000000000000..4ecb187513e525e0cd9b8b063e418a75a23c525d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_1.c
@@ -0,0 +1,33 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -fdump-tree-vect-details -std=c99" } */
+/* { dg-final { check-function-bodies "**" "" "" } } */
+
+#define TYPE int
+#define N 800
+
+#pragma GCC target "+nosve"
+
+TYPE a[N];
+
+/*
+** foo:
+**     ...
+**     ldp     q[0-9]+, q[0-9]+, \[x[0-9]+\], 32
+**     cmeq    v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
+**     cmeq    v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
+**     addhn   v[0-9]+.4h, v[0-9]+.4s, v[0-9]+.4s
+**     fmov    x[0-9]+, d[0-9]+
+**     ...
+*/
+
+int foo ()
+{
+#pragma GCC unroll 8
+  for (int i = 0; i < N; i++)
+    if (a[i] == 124)
+      return 1;
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump "VEC_ADD_HALFING_NARROW" "vect" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_2.c 
b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_2.c
new file mode 100644
index 
0000000000000000000000000000000000000000..d67d0d13d1733935aaf805e59188eb8155cb5f06
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_2.c
@@ -0,0 +1,33 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -fdump-tree-vect-details -std=c99" } */
+/* { dg-final { check-function-bodies "**" "" "" } } */
+
+#define TYPE long long
+#define N 800
+
+#pragma GCC target "+nosve"
+
+TYPE a[N];
+
+/*
+** foo:
+**     ...
+**     ldp     q[0-9]+, q[0-9]+, \[x[0-9]+\], 32
+**     cmeq    v[0-9]+.2d, v[0-9]+.2d, v[0-9]+.2d
+**     cmeq    v[0-9]+.2d, v[0-9]+.2d, v[0-9]+.2d
+**     addhn   v[0-9]+.2s, v[0-9]+.2d, v[0-9]+.2d
+**     fmov    x[0-9]+, d[0-9]+
+**     ...
+*/
+
+int foo ()
+{
+#pragma GCC unroll 4
+  for (int i = 0; i < N; i++)
+    if (a[i] == 124)
+      return 1;
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump "VEC_ADD_HALFING_NARROW" "vect" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_3.c 
b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_3.c
new file mode 100644
index 
0000000000000000000000000000000000000000..57dbc44ae0cdcbcdccd3d8dbe98c79713eaf5607
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_3.c
@@ -0,0 +1,33 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -fdump-tree-vect-details -std=c99" } */
+/* { dg-final { check-function-bodies "**" "" "" } } */
+
+#define TYPE short
+#define N 800
+
+#pragma GCC target "+nosve"
+
+TYPE a[N];
+
+/*
+** foo:
+**     ...
+**     ldp     q[0-9]+, q[0-9]+, \[x[0-9]+\], 32
+**     cmeq    v[0-9]+.8h, v[0-9]+.8h, v[0-9]+.8h
+**     cmeq    v[0-9]+.8h, v[0-9]+.8h, v[0-9]+.8h
+**     addhn   v[0-9]+.8b, v[0-9]+.8h, v[0-9]+.8h
+**     fmov    x[0-9]+, d[0-9]+
+**     ...
+*/
+
+int foo ()
+{
+#pragma GCC unroll 16
+  for (int i = 0; i < N; i++)
+    if (a[i] == 124)
+      return 1;
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump "VEC_ADD_HALFING_NARROW" "vect" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_4.c 
b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_4.c
new file mode 100644
index 
0000000000000000000000000000000000000000..8ad42b22024479283d6814d815ef1dce411d1c72
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_4.c
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -fdump-tree-vect-details -std=c99" } */
+
+#define TYPE char
+#define N 800
+
+#pragma GCC target "+nosve"
+
+TYPE a[N];
+
+int foo ()
+{
+#pragma GCC unroll 32
+  for (int i = 0; i < N; i++)
+    if (a[i] == 124)
+      return 1;
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-not "VEC_ADD_HALFING_NARROW" "vect" } } */
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 
1545fab364792f75bcc786ba1311b8bdc82edd70..179ce5e0a66b6f88976ffb544c6874d7bec999a8
 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -12328,7 +12328,7 @@ vectorizable_early_exit (loop_vec_info loop_vinfo, 
stmt_vec_info stmt_info,
   gimple *orig_stmt = STMT_VINFO_STMT (vect_orig_stmt (stmt_info));
   gcond *cond_stmt = as_a <gcond *>(orig_stmt);
 
-  tree cst = build_zero_cst (vectype);
+  tree vectype_out = vectype;
   auto bb = gimple_bb (cond_stmt);
   edge exit_true_edge = EDGE_SUCC (bb, 0);
   if (exit_true_edge->flags & EDGE_FALSE_VALUE)
@@ -12452,12 +12452,40 @@ vectorizable_early_exit (loop_vec_info loop_vinfo, 
stmt_vec_info stmt_info,
       else
        workset.splice (stmts);
 
+      /* See if we support ADDHN and use that for the reduction.  */
+      internal_fn ifn = IFN_VEC_ADD_HALVING_NARROW;
+      bool addhn_supported_p
+       = direct_internal_fn_supported_p (ifn, vectype, OPTIMIZE_FOR_SPEED);
+      tree narrow_type = NULL_TREE;
+      if (addhn_supported_p)
+       {
+         /* Calculate the narrowing type for the result.  */
+         auto halfprec = TYPE_PRECISION (TREE_TYPE (vectype)) / 2;
+         auto unsignedp = TYPE_UNSIGNED (TREE_TYPE (vectype));
+         tree itype = build_nonstandard_integer_type (halfprec, unsignedp);
+         poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
+         tree tmp_type = build_vector_type (itype, nunits);
+         narrow_type = truth_type_for (tmp_type);
+       }
+
       while (workset.length () > 1)
        {
-         new_temp = make_temp_ssa_name (vectype, NULL, "vexit_reduc");
          tree arg0 = workset.pop ();
          tree arg1 = workset.pop ();
-         new_stmt = gimple_build_assign (new_temp, BIT_IOR_EXPR, arg0, arg1);
+         if (addhn_supported_p && workset.length () == 0)
+           {
+             new_stmt = gimple_build_call_internal (ifn, 2, arg0, arg1);
+             vectype_out = narrow_type;
+             new_temp = make_temp_ssa_name (vectype_out, NULL, "vexit_reduc");
+             gimple_call_set_lhs (as_a <gcall *> (new_stmt), new_temp);
+             gimple_call_set_nothrow (as_a <gcall *> (new_stmt), true);
+           }
+         else
+           {
+             new_temp = make_temp_ssa_name (vectype_out, NULL, "vexit_reduc");
+             new_stmt
+               = gimple_build_assign (new_temp, BIT_IOR_EXPR, arg0, arg1);
+           }
          vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt,
                                       &cond_gsi);
          workset.quick_insert (0, new_temp);
@@ -12480,6 +12508,7 @@ vectorizable_early_exit (loop_vec_info loop_vinfo, 
stmt_vec_info stmt_info,
 
   gcc_assert (new_temp);
 
+  tree cst = build_zero_cst (vectype_out);
   gimple_cond_set_condition (cond_stmt, NE_EXPR, new_temp, cst);
   update_stmt (orig_stmt);
 


--

diff --git a/gcc/doc/generic.texi b/gcc/doc/generic.texi
index d4ac580a7a8b9cd339d26cb97f7eb963f83746a4..ff16ff47bbf45e795df0d230e9a885d9d218d9af 100644
--- a/gcc/doc/generic.texi
+++ b/gcc/doc/generic.texi
@@ -1834,6 +1834,7 @@ a value from @code{enum annot_expr_kind}, the third is an @code{INTEGER_CST}.
 @tindex IFN_VEC_WIDEN_MINUS_LO
 @tindex IFN_VEC_WIDEN_MINUS_EVEN
 @tindex IFN_VEC_WIDEN_MINUS_ODD
+@tindex IFN_VEC_ADD_HALVING_NARROW
 @tindex VEC_UNPACK_HI_EXPR
 @tindex VEC_UNPACK_LO_EXPR
 @tindex VEC_UNPACK_FLOAT_HI_EXPR
@@ -1956,6 +1957,24 @@ vector of @code{N/2} subtractions.  In the case of
 vector are subtracted from the odd @code{N/2} of the first to produce the
 vector of @code{N/2} subtractions.
 
+@item IFN_VEC_ADD_HALVING_NARROW
+This internal function performs an addition of two input vectors,
+then extracts the most significant half of each result element and
+narrows it back to the original element width.
+
+Concretely, it computes:
+@code{(bits(a)/2)((a + b) >> bits(a))}
+
+where @code{bits(a)} is the width in bits of each input element.
+
+Its operands are vectors containing the same number of elements (@code{N})
+of the same integral type.  The result is a vector of length @code{N}, with
+elements of an integral type whose size is half that of the input element
+type.
+
+This operation currently only used for early break result compression when the
+result of a vector boolean can be represented as 0 or -1.
+
 @item VEC_UNPACK_HI_EXPR
 @itemx VEC_UNPACK_LO_EXPR
 These nodes represent unpacking of the high and low parts of the input vector,
diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index aba93f606eca59d31c103a05b2567fd4f3be55f3..ec0193e4eee079e00168bbaf9b28ba8d52e5d464 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -6087,6 +6087,25 @@ vectors with N signed/unsigned elements of size S@.  Find the absolute
 difference between operands 1 and 2 and widen the resulting elements.
 Put the N/2 results of size 2*S in the output vector (operand 0).
 
+@cindex @code{vec_addh_narrow@var{m}} instruction pattern
+@item @samp{vec_addh_narrow@var{m}}
+Signed or unsigned addition of two input vectors, then extracts the
+most significant half of each result element and narrows it back to the
+original element width.
+
+Concretely, it computes:
+@code{(bits(a)/2)((a + b) >> bits(a))}
+
+where @code{bits(a)} is the width in bits of each input element.
+
+Its operands (@code{1} and @code{2}) are vectors containing the same number
+of signed or unsigned integral elements (@code{N}) of size @code{S}.  The
+result (operand @code{0}) is a vector of length @code{N}, with elements of
+an integral type whose size is half that of @code{S}.
+
+This operation currently only used for early break result compression when the
+result of a vector boolean can be represented as 0 or -1.
+
 @cindex @code{vec_addsub@var{m}3} instruction pattern
 @item @samp{vec_addsub@var{m}3}
 Alternating subtract, add with even lanes doing subtract and odd
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index d2480a1bf7927476215bc7bb99c0b74197d2b7e9..cb18058d9f48cc0dff96ed4b31d0abc9adb67867 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -422,6 +422,8 @@ DEF_INTERNAL_OPTAB_FN (COMPLEX_ADD_ROT270, ECF_CONST, cadd270, binary)
 DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL, ECF_CONST, cmul, binary)
 DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL_CONJ, ECF_CONST, cmul_conj, binary)
 DEF_INTERNAL_OPTAB_FN (VEC_ADDSUB, ECF_CONST, vec_addsub, binary)
+DEF_INTERNAL_OPTAB_FN (VEC_ADD_HALVING_NARROW, ECF_CONST | ECF_NOTHROW,
+		       vec_addh_narrow, binary)
 DEF_INTERNAL_WIDENING_OPTAB_FN (VEC_WIDEN_PLUS,
 				ECF_CONST | ECF_NOTHROW,
 				first,
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 87a8b85da1592646d0a3447572e842ceb158cd97..b2bedc3692f914c2b80d7972db81b542b32c9eb8 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -492,6 +492,7 @@ OPTAB_D (vec_widen_uabd_hi_optab, "vec_widen_uabd_hi_$a")
 OPTAB_D (vec_widen_uabd_lo_optab, "vec_widen_uabd_lo_$a")
 OPTAB_D (vec_widen_uabd_odd_optab, "vec_widen_uabd_odd_$a")
 OPTAB_D (vec_widen_uabd_even_optab, "vec_widen_uabd_even_$a")
+OPTAB_D (vec_addh_narrow_optab, "vec_addh_narrow$a")
 OPTAB_D (vec_addsub_optab, "vec_addsub$a3")
 OPTAB_D (vec_fmaddsub_optab, "vec_fmaddsub$a4")
 OPTAB_D (vec_fmsubadd_optab, "vec_fmsubadd$a4")
diff --git a/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_1.c b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_1.c
new file mode 100644
index 0000000000000000000000000000000000000000..4ecb187513e525e0cd9b8b063e418a75a23c525d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_1.c
@@ -0,0 +1,33 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -fdump-tree-vect-details -std=c99" } */
+/* { dg-final { check-function-bodies "**" "" "" } } */
+
+#define TYPE int
+#define N 800
+
+#pragma GCC target "+nosve"
+
+TYPE a[N];
+
+/*
+** foo:
+**	...
+**	ldp	q[0-9]+, q[0-9]+, \[x[0-9]+\], 32
+**	cmeq	v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
+**	cmeq	v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
+**	addhn	v[0-9]+.4h, v[0-9]+.4s, v[0-9]+.4s
+**	fmov	x[0-9]+, d[0-9]+
+**	...
+*/
+
+int foo ()
+{
+#pragma GCC unroll 8
+  for (int i = 0; i < N; i++)
+    if (a[i] == 124)
+      return 1;
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump "VEC_ADD_HALFING_NARROW" "vect" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_2.c b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_2.c
new file mode 100644
index 0000000000000000000000000000000000000000..d67d0d13d1733935aaf805e59188eb8155cb5f06
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_2.c
@@ -0,0 +1,33 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -fdump-tree-vect-details -std=c99" } */
+/* { dg-final { check-function-bodies "**" "" "" } } */
+
+#define TYPE long long
+#define N 800
+
+#pragma GCC target "+nosve"
+
+TYPE a[N];
+
+/*
+** foo:
+**	...
+**	ldp	q[0-9]+, q[0-9]+, \[x[0-9]+\], 32
+**	cmeq	v[0-9]+.2d, v[0-9]+.2d, v[0-9]+.2d
+**	cmeq	v[0-9]+.2d, v[0-9]+.2d, v[0-9]+.2d
+**	addhn	v[0-9]+.2s, v[0-9]+.2d, v[0-9]+.2d
+**	fmov	x[0-9]+, d[0-9]+
+**	...
+*/
+
+int foo ()
+{
+#pragma GCC unroll 4
+  for (int i = 0; i < N; i++)
+    if (a[i] == 124)
+      return 1;
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump "VEC_ADD_HALFING_NARROW" "vect" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_3.c b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_3.c
new file mode 100644
index 0000000000000000000000000000000000000000..57dbc44ae0cdcbcdccd3d8dbe98c79713eaf5607
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_3.c
@@ -0,0 +1,33 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -fdump-tree-vect-details -std=c99" } */
+/* { dg-final { check-function-bodies "**" "" "" } } */
+
+#define TYPE short
+#define N 800
+
+#pragma GCC target "+nosve"
+
+TYPE a[N];
+
+/*
+** foo:
+**	...
+**	ldp	q[0-9]+, q[0-9]+, \[x[0-9]+\], 32
+**	cmeq	v[0-9]+.8h, v[0-9]+.8h, v[0-9]+.8h
+**	cmeq	v[0-9]+.8h, v[0-9]+.8h, v[0-9]+.8h
+**	addhn	v[0-9]+.8b, v[0-9]+.8h, v[0-9]+.8h
+**	fmov	x[0-9]+, d[0-9]+
+**	...
+*/
+
+int foo ()
+{
+#pragma GCC unroll 16
+  for (int i = 0; i < N; i++)
+    if (a[i] == 124)
+      return 1;
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump "VEC_ADD_HALFING_NARROW" "vect" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_4.c b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_4.c
new file mode 100644
index 0000000000000000000000000000000000000000..8ad42b22024479283d6814d815ef1dce411d1c72
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_4.c
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -fdump-tree-vect-details -std=c99" } */
+
+#define TYPE char
+#define N 800
+
+#pragma GCC target "+nosve"
+
+TYPE a[N];
+
+int foo ()
+{
+#pragma GCC unroll 32
+  for (int i = 0; i < N; i++)
+    if (a[i] == 124)
+      return 1;
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-not "VEC_ADD_HALFING_NARROW" "vect" } } */
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 1545fab364792f75bcc786ba1311b8bdc82edd70..179ce5e0a66b6f88976ffb544c6874d7bec999a8 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -12328,7 +12328,7 @@ vectorizable_early_exit (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
   gimple *orig_stmt = STMT_VINFO_STMT (vect_orig_stmt (stmt_info));
   gcond *cond_stmt = as_a <gcond *>(orig_stmt);
 
-  tree cst = build_zero_cst (vectype);
+  tree vectype_out = vectype;
   auto bb = gimple_bb (cond_stmt);
   edge exit_true_edge = EDGE_SUCC (bb, 0);
   if (exit_true_edge->flags & EDGE_FALSE_VALUE)
@@ -12452,12 +12452,40 @@ vectorizable_early_exit (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
       else
 	workset.splice (stmts);
 
+      /* See if we support ADDHN and use that for the reduction.  */
+      internal_fn ifn = IFN_VEC_ADD_HALVING_NARROW;
+      bool addhn_supported_p
+	= direct_internal_fn_supported_p (ifn, vectype, OPTIMIZE_FOR_SPEED);
+      tree narrow_type = NULL_TREE;
+      if (addhn_supported_p)
+	{
+	  /* Calculate the narrowing type for the result.  */
+	  auto halfprec = TYPE_PRECISION (TREE_TYPE (vectype)) / 2;
+	  auto unsignedp = TYPE_UNSIGNED (TREE_TYPE (vectype));
+	  tree itype = build_nonstandard_integer_type (halfprec, unsignedp);
+	  poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
+	  tree tmp_type = build_vector_type (itype, nunits);
+	  narrow_type = truth_type_for (tmp_type);
+	}
+
       while (workset.length () > 1)
 	{
-	  new_temp = make_temp_ssa_name (vectype, NULL, "vexit_reduc");
 	  tree arg0 = workset.pop ();
 	  tree arg1 = workset.pop ();
-	  new_stmt = gimple_build_assign (new_temp, BIT_IOR_EXPR, arg0, arg1);
+	  if (addhn_supported_p && workset.length () == 0)
+	    {
+	      new_stmt = gimple_build_call_internal (ifn, 2, arg0, arg1);
+	      vectype_out = narrow_type;
+	      new_temp = make_temp_ssa_name (vectype_out, NULL, "vexit_reduc");
+	      gimple_call_set_lhs (as_a <gcall *> (new_stmt), new_temp);
+	      gimple_call_set_nothrow (as_a <gcall *> (new_stmt), true);
+	    }
+	  else
+	    {
+	      new_temp = make_temp_ssa_name (vectype_out, NULL, "vexit_reduc");
+	      new_stmt
+		= gimple_build_assign (new_temp, BIT_IOR_EXPR, arg0, arg1);
+	    }
 	  vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt,
 				       &cond_gsi);
 	  workset.quick_insert (0, new_temp);
@@ -12480,6 +12508,7 @@ vectorizable_early_exit (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
 
   gcc_assert (new_temp);
 
+  tree cst = build_zero_cst (vectype_out);
   gimple_cond_set_condition (cond_stmt, NE_EXPR, new_temp, cst);
   update_stmt (orig_stmt);

[PATCH v2 3/3]middle-end: Use addhn for compression instead of inclusive OR when reducing comparison values

Reply via email to