On Fri, May 24, 2024 at 11:27 AM Feng Xue OS <f...@os.amperecomputing.com> wrote: > > Hi, > > The patch was updated with the newest trunk, and also contained some minor > changes. > > I am working on another new feature which is meant to support pattern > recognition > of lane-reducing operations in affine closure originated from loop reduction > variable, > like: > > sum += cst1 * dot_prod_1 + cst2 * sad_2 + ... + cstN * lane_reducing_op_N > > The feature WIP depends on the patch. It has been a little bit long time > since its post, > would you please take a time to review this one? Thanks.
This seems to do multiple things so I wonder if you can split up the patch a bit? For example adding lane_reducing_op_p can be split out, it also seems like the vect_transform_reduction change to better distribute work can be done separately? Likewise refactoring like splitting out vect_reduction_use_partial_vector. When we have sum += d0[i] * d1[i]; // dot-prod <vector(16) char> sum += w[i]; // widen-sum <vector(16) short> sum += abs(s0[i] - s1[i]); // sad <vector(8) short> sum += n[i]; // normal <vector(4) int> the vector DOT_PROD and friend ops can end up mixing different lanes since it is not specified which lanes are reduced into which output lane. So, DOT_PROD might combine 0-3, 4-7, ... but SAD might combine 0,4,8,12; 1,5,9,13; ... I think this isn't worse than what one op itself is doing, but it's worth pointing out (it's probably unlikely a target mixes different reduction strategies anyway). Can you make sure to add at least one SLP reduction example to show this works for SLP as well? Richard. > Feng > ---- > > gcc/ > PR tree-optimization/114440 > * tree-vectorizer.h (struct _stmt_vec_info): Add a new field > reduc_result_pos. > (lane_reducing_op_p): New function. > (vectorizable_lane_reducing): New function declaration. > * tree-vect-stmts.cc (vectorizable_condition): Treat the condition > statement that is pointed by stmt_vec_info of reduction PHI as the > real "for_reduction" statement. > (vect_analyze_stmt): Call new function vectorizable_lane_reducing > to analyze lane-reducing operation. > * tree-vect-loop.cc (vect_is_emulated_mixed_dot_prod): Remove > parameter > loop_vinfo. Get input vectype from stmt_info instead of reduction PHI. > (vect_model_reduction_cost): Remove cost computation code related to > emulated_mixed_dot_prod. > (vect_reduction_use_partial_vector): New function. > (vectorizable_lane_reducing): New function. > (vectorizable_reduction): Allow multiple lane-reducing operations in > loop reduction. Move some original lane-reducing related code to > vectorizable_lane_reducing, and move partial vectorization checking > code to vect_reduction_use_partial_vector. > (vect_transform_reduction): Extend transformation to support reduction > statements with mixed input vectypes. > * tree-vect-slp.cc (vect_analyze_slp): Use new function > lane_reducing_op_p to check statement code. > > gcc/testsuite/ > PR tree-optimization/114440 > * gcc.dg/vect/vect-reduc-chain-1.c > * gcc.dg/vect/vect-reduc-chain-2.c > * gcc.dg/vect/vect-reduc-chain-3.c > * gcc.dg/vect/vect-reduc-dot-slp-1.c > * gcc.dg/vect/vect-reduc-dot-slp-2.c > --- > .../gcc.dg/vect/vect-reduc-chain-1.c | 62 ++ > .../gcc.dg/vect/vect-reduc-chain-2.c | 77 ++ > .../gcc.dg/vect/vect-reduc-chain-3.c | 66 ++ > .../gcc.dg/vect/vect-reduc-dot-slp-1.c | 97 +++ > .../gcc.dg/vect/vect-reduc-dot-slp-2.c | 81 +++ > gcc/tree-vect-loop.cc | 680 ++++++++++++------ > gcc/tree-vect-slp.cc | 4 +- > gcc/tree-vect-stmts.cc | 13 +- > gcc/tree-vectorizer.h | 14 + > 9 files changed, 873 insertions(+), 221 deletions(-) > create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c > create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c > create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c > create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c > create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-2.c > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c > b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c > new file mode 100644 > index 00000000000..04bfc419dbd > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c > @@ -0,0 +1,62 @@ > +/* Disabling epilogues until we find a better way to deal with scans. */ > +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */ > +/* { dg-require-effective-target vect_int } */ > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { > aarch64*-*-* || arm*-*-* } } } */ > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > + > +#include "tree-vect.h" > + > +#define N 50 > + > +#ifndef SIGNEDNESS_1 > +#define SIGNEDNESS_1 signed > +#define SIGNEDNESS_2 signed > +#endif > + > +SIGNEDNESS_1 int __attribute__ ((noipa)) > +f (SIGNEDNESS_1 int res, > + SIGNEDNESS_2 char *restrict a, > + SIGNEDNESS_2 char *restrict b, > + SIGNEDNESS_2 char *restrict c, > + SIGNEDNESS_2 char *restrict d, > + SIGNEDNESS_1 int *restrict e) > +{ > + for (int i = 0; i < N; ++i) > + { > + res += a[i] * b[i]; > + res += c[i] * d[i]; > + res += e[i]; > + } > + return res; > +} > + > +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4) > +#define OFFSET 20 > + > +int > +main (void) > +{ > + check_vect (); > + > + SIGNEDNESS_2 char a[N], b[N]; > + SIGNEDNESS_2 char c[N], d[N]; > + SIGNEDNESS_1 int e[N]; > + int expected = 0x12345; > + for (int i = 0; i < N; ++i) > + { > + a[i] = BASE + i * 5; > + b[i] = BASE + OFFSET + i * 4; > + c[i] = BASE + i * 2; > + d[i] = BASE + OFFSET + i * 3; > + e[i] = i; > + asm volatile ("" ::: "memory"); > + expected += a[i] * b[i]; > + expected += c[i] * d[i]; > + expected += e[i]; > + } > + if (f (0x12345, a, b, c, d, e) != expected) > + __builtin_abort (); > +} > + > +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" > "vect" } } */ > +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = > DOT_PROD_EXPR" 2 "vect" { target vect_sdot_qi } } } */ > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c > b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c > new file mode 100644 > index 00000000000..6c803b80120 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c > @@ -0,0 +1,77 @@ > +/* Disabling epilogues until we find a better way to deal with scans. */ > +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */ > +/* { dg-require-effective-target vect_int } */ > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { > aarch64*-*-* || arm*-*-* } } } */ > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > + > +#include "tree-vect.h" > + > +#define N 50 > + > +#ifndef SIGNEDNESS_1 > +#define SIGNEDNESS_1 signed > +#define SIGNEDNESS_2 unsigned > +#define SIGNEDNESS_3 signed > +#define SIGNEDNESS_4 signed > +#endif > + > +SIGNEDNESS_1 int __attribute__ ((noipa)) > +fn (SIGNEDNESS_1 int res, > + SIGNEDNESS_2 char *restrict a, > + SIGNEDNESS_2 char *restrict b, > + SIGNEDNESS_3 char *restrict c, > + SIGNEDNESS_3 char *restrict d, > + SIGNEDNESS_4 short *restrict e, > + SIGNEDNESS_4 short *restrict f, > + SIGNEDNESS_1 int *restrict g) > +{ > + for (int i = 0; i < N; ++i) > + { > + res += a[i] * b[i]; > + res += i + 1; > + res += c[i] * d[i]; > + res += e[i] * f[i]; > + res += g[i]; > + } > + return res; > +} > + > +#define BASE2 ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4) > +#define BASE3 ((SIGNEDNESS_3 int) -1 < 0 ? -126 : 4) > +#define BASE4 ((SIGNEDNESS_4 int) -1 < 0 ? -1026 : 373) > +#define OFFSET 20 > + > +int > +main (void) > +{ > + check_vect (); > + > + SIGNEDNESS_2 char a[N], b[N]; > + SIGNEDNESS_3 char c[N], d[N]; > + SIGNEDNESS_4 short e[N], f[N]; > + SIGNEDNESS_1 int g[N]; > + int expected = 0x12345; > + for (int i = 0; i < N; ++i) > + { > + a[i] = BASE2 + i * 5; > + b[i] = BASE2 + OFFSET + i * 4; > + c[i] = BASE3 + i * 2; > + d[i] = BASE3 + OFFSET + i * 3; > + e[i] = BASE4 + i * 6; > + f[i] = BASE4 + OFFSET + i * 5; > + g[i] = i; > + asm volatile ("" ::: "memory"); > + expected += a[i] * b[i]; > + expected += i + 1; > + expected += c[i] * d[i]; > + expected += e[i] * f[i]; > + expected += g[i]; > + } > + if (fn (0x12345, a, b, c, d, e, f, g) != expected) > + __builtin_abort (); > +} > + > +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" > "vect" } } */ > +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" > "vect" { target { vect_sdot_qi } } } } */ > +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" > "vect" { target { vect_udot_qi } } } } */ > +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" > "vect" { target { vect_sdot_hi } } } } */ > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c > b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c > new file mode 100644 > index 00000000000..a41e4b176c4 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c > @@ -0,0 +1,66 @@ > +/* Disabling epilogues until we find a better way to deal with scans. */ > +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */ > +/* { dg-require-effective-target vect_int } */ > + > +#include "tree-vect.h" > + > +#define N 50 > + > +#ifndef SIGNEDNESS_1 > +#define SIGNEDNESS_1 signed > +#define SIGNEDNESS_2 unsigned > +#define SIGNEDNESS_3 signed > +#endif > + > +SIGNEDNESS_1 int __attribute__ ((noipa)) > +f (SIGNEDNESS_1 int res, > + SIGNEDNESS_2 char *restrict a, > + SIGNEDNESS_2 char *restrict b, > + SIGNEDNESS_3 short *restrict c, > + SIGNEDNESS_3 short *restrict d, > + SIGNEDNESS_1 int *restrict e) > +{ > + for (int i = 0; i < N; ++i) > + { > + short diff = a[i] - b[i]; > + SIGNEDNESS_2 short abs = diff < 0 ? -diff : diff; > + res += abs; > + res += c[i] * d[i]; > + res += e[i]; > + } > + return res; > +} > + > +#define BASE2 ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4) > +#define BASE3 ((SIGNEDNESS_3 int) -1 < 0 ? -1236 : 373) > +#define OFFSET 20 > + > +int > +main (void) > +{ > + check_vect (); > + > + SIGNEDNESS_2 char a[N], b[N]; > + SIGNEDNESS_3 short c[N], d[N]; > + SIGNEDNESS_1 int e[N]; > + int expected = 0x12345; > + for (int i = 0; i < N; ++i) > + { > + a[i] = BASE2 + i * 5; > + b[i] = BASE2 - i * 4; > + c[i] = BASE3 + i * 2; > + d[i] = BASE3 + OFFSET + i * 3; > + e[i] = i; > + asm volatile ("" ::: "memory"); > + short diff = a[i] - b[i]; > + SIGNEDNESS_2 short abs = diff < 0 ? -diff : diff; > + expected += abs; > + expected += c[i] * d[i]; > + expected += e[i]; > + } > + if (f (0x12345, a, b, c, d, e) != expected) > + __builtin_abort (); > +} > + > +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = SAD_EXPR" > "vect" { target vect_udot_qi } } } */ > +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" > "vect" { target vect_sdot_hi } } } */ > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c > b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c > new file mode 100644 > index 00000000000..51ef4eaaed8 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c > @@ -0,0 +1,97 @@ > +/* Disabling epilogues until we find a better way to deal with scans. */ > +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */ > +/* { dg-require-effective-target vect_int } */ > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { > aarch64*-*-* || arm*-*-* } } } */ > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > + > +#include "tree-vect.h" > + > +#define N 50 > + > +#ifndef SIGNEDNESS_1 > +#define SIGNEDNESS_1 signed > +#define SIGNEDNESS_2 signed > +#endif > + > +SIGNEDNESS_1 int __attribute__ ((noipa)) > +f (SIGNEDNESS_1 int res, > + SIGNEDNESS_2 char *a, > + SIGNEDNESS_2 char *b, > + int step, int n) > +{ > + for (int i = 0; i < n; i++) > + { > + res += a[0] * b[0]; > + res += a[1] * b[1]; > + res += a[2] * b[2]; > + res += a[3] * b[3]; > + res += a[4] * b[4]; > + res += a[5] * b[5]; > + res += a[6] * b[6]; > + res += a[7] * b[7]; > + res += a[8] * b[8]; > + res += a[9] * b[9]; > + res += a[10] * b[10]; > + res += a[11] * b[11]; > + res += a[12] * b[12]; > + res += a[13] * b[13]; > + res += a[14] * b[14]; > + res += a[15] * b[15]; > + > + a += step; > + b += step; > + } > + > + return res; > +} > + > +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4) > +#define OFFSET 20 > + > +int > +main (void) > +{ > + check_vect (); > + > + SIGNEDNESS_2 char a[100], b[100]; > + int expected = 0x12345; > + int step = 16; > + int n = 2; > + int t = 0; > + > + for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i) > + { > + a[i] = BASE + i * 5; > + b[i] = BASE + OFFSET + i * 4; > + asm volatile ("" ::: "memory"); > + } > + > + for (int i = 0; i < n; i++) > + { > + asm volatile ("" ::: "memory"); > + expected += a[t + 0] * b[t + 0]; > + expected += a[t + 1] * b[t + 1]; > + expected += a[t + 2] * b[t + 2]; > + expected += a[t + 3] * b[t + 3]; > + expected += a[t + 4] * b[t + 4]; > + expected += a[t + 5] * b[t + 5]; > + expected += a[t + 6] * b[t + 6]; > + expected += a[t + 7] * b[t + 7]; > + expected += a[t + 8] * b[t + 8]; > + expected += a[t + 9] * b[t + 9]; > + expected += a[t + 10] * b[t + 10]; > + expected += a[t + 11] * b[t + 11]; > + expected += a[t + 12] * b[t + 12]; > + expected += a[t + 13] * b[t + 13]; > + expected += a[t + 14] * b[t + 14]; > + expected += a[t + 15] * b[t + 15]; > + t += step; > + } > + > + if (f (0x12345, a, b, step, n) != expected) > + __builtin_abort (); > +} > + > +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" > "vect" } } */ > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ > +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = > DOT_PROD_EXPR" 16 "vect" } } */ > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-2.c > b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-2.c > new file mode 100644 > index 00000000000..1532833c3ae > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-2.c > @@ -0,0 +1,81 @@ > +/* Disabling epilogues until we find a better way to deal with scans. */ > +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */ > +/* { dg-require-effective-target vect_int } */ > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { > aarch64*-*-* || arm*-*-* } } } */ > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > + > +#include "tree-vect.h" > + > +#define N 50 > + > +#ifndef SIGNEDNESS_1 > +#define SIGNEDNESS_1 signed > +#define SIGNEDNESS_2 signed > +#endif > + > +SIGNEDNESS_1 int __attribute__ ((noipa)) > +f (SIGNEDNESS_1 int res, > + SIGNEDNESS_2 short *a, > + SIGNEDNESS_2 short *b, > + int step, int n) > +{ > + for (int i = 0; i < n; i++) > + { > + res += a[0] * b[0]; > + res += a[1] * b[1]; > + res += a[2] * b[2]; > + res += a[3] * b[3]; > + res += a[4] * b[4]; > + res += a[5] * b[5]; > + res += a[6] * b[6]; > + res += a[7] * b[7]; > + > + a += step; > + b += step; > + } > + > + return res; > +} > + > +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -1026 : 373) > +#define OFFSET 20 > + > +int > +main (void) > +{ > + check_vect (); > + > + SIGNEDNESS_2 short a[100], b[100]; > + int expected = 0x12345; > + int step = 8; > + int n = 2; > + int t = 0; > + > + for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i) > + { > + a[i] = BASE + i * 5; > + b[i] = BASE + OFFSET + i * 4; > + asm volatile ("" ::: "memory"); > + } > + > + for (int i = 0; i < n; i++) > + { > + asm volatile ("" ::: "memory"); > + expected += a[t + 0] * b[t + 0]; > + expected += a[t + 1] * b[t + 1]; > + expected += a[t + 2] * b[t + 2]; > + expected += a[t + 3] * b[t + 3]; > + expected += a[t + 4] * b[t + 4]; > + expected += a[t + 5] * b[t + 5]; > + expected += a[t + 6] * b[t + 6]; > + expected += a[t + 7] * b[t + 7]; > + t += step; > + } > + > + if (f (0x12345, a, b, step, n) != expected) > + __builtin_abort (); > +} > + > +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" > "vect" } } */ > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ > +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = > DOT_PROD_EXPR" 8 "vect" { target vect_sdot_hi } } } */ > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > index 83c0544b6aa..92d07df2890 100644 > --- a/gcc/tree-vect-loop.cc > +++ b/gcc/tree-vect-loop.cc > @@ -5270,8 +5270,7 @@ have_whole_vector_shift (machine_mode mode) > See vect_emulate_mixed_dot_prod for the actual sequence used. */ > > static bool > -vect_is_emulated_mixed_dot_prod (loop_vec_info loop_vinfo, > - stmt_vec_info stmt_info) > +vect_is_emulated_mixed_dot_prod (stmt_vec_info stmt_info) > { > gassign *assign = dyn_cast<gassign *> (stmt_info->stmt); > if (!assign || gimple_assign_rhs_code (assign) != DOT_PROD_EXPR) > @@ -5282,10 +5281,9 @@ vect_is_emulated_mixed_dot_prod (loop_vec_info > loop_vinfo, > if (TYPE_SIGN (TREE_TYPE (rhs1)) == TYPE_SIGN (TREE_TYPE (rhs2))) > return false; > > - stmt_vec_info reduc_info = info_for_reduction (loop_vinfo, stmt_info); > - gcc_assert (reduc_info->is_reduc_info); > + gcc_assert (STMT_VINFO_REDUC_VECTYPE_IN (stmt_info)); > return !directly_supported_p (DOT_PROD_EXPR, > - STMT_VINFO_REDUC_VECTYPE_IN (reduc_info), > + STMT_VINFO_REDUC_VECTYPE_IN (stmt_info), > optab_vector_mixed_sign); > } > > @@ -5324,8 +5322,6 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo, > if (!gimple_extract_op (orig_stmt_info->stmt, &op)) > gcc_unreachable (); > > - bool emulated_mixed_dot_prod > - = vect_is_emulated_mixed_dot_prod (loop_vinfo, stmt_info); > if (reduction_type == EXTRACT_LAST_REDUCTION) > /* No extra instructions are needed in the prologue. The loop body > operations are costed in vectorizable_condition. */ > @@ -5360,12 +5356,8 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo, > initial result of the data reduction, initial value of the index > reduction. */ > prologue_stmts = 4; > - else if (emulated_mixed_dot_prod) > - /* We need the initial reduction value and two invariants: > - one that contains the minimum signed value and one that > - contains half of its negative. */ > - prologue_stmts = 3; > else > + /* We need the initial reduction value. */ > prologue_stmts = 1; > prologue_cost += record_stmt_cost (cost_vec, prologue_stmts, > scalar_to_vec, stmt_info, 0, > @@ -7384,6 +7376,244 @@ build_vect_cond_expr (code_helper code, tree vop[3], > tree mask, > } > } > > +/* Given an operation with CODE in loop reduction path whose reduction PHI is > + specified by REDUC_INFO, the operation has TYPE of scalar result, and its > + input vectype is represented by VECTYPE_IN. The vectype of vectorized > result > + may be different from VECTYPE_IN, either in base type or vectype lanes, > + lane-reducing operation is the case. This function check if it is > possible, > + and how to perform partial vectorization on the operation in the context > + of LOOP_VINFO. */ > + > +static void > +vect_reduction_use_partial_vector (loop_vec_info loop_vinfo, > + stmt_vec_info reduc_info, > + slp_tree slp_node, code_helper code, > + tree type, tree vectype_in) > +{ > + if (!LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)) > + return; > + > + enum vect_reduction_type reduc_type = STMT_VINFO_REDUC_TYPE (reduc_info); > + internal_fn reduc_fn = STMT_VINFO_REDUC_FN (reduc_info); > + internal_fn cond_fn = get_conditional_internal_fn (code, type); > + > + if (reduc_type != FOLD_LEFT_REDUCTION > + && !use_mask_by_cond_expr_p (code, cond_fn, vectype_in) > + && (cond_fn == IFN_LAST > + || !direct_internal_fn_supported_p (cond_fn, vectype_in, > + OPTIMIZE_FOR_SPEED))) > + { > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > + "can't operate on partial vectors because" > + " no conditional operation is available.\n"); > + LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false; > + } > + else if (reduc_type == FOLD_LEFT_REDUCTION > + && reduc_fn == IFN_LAST > + && !expand_vec_cond_expr_p (vectype_in, truth_type_for > (vectype_in), > + SSA_NAME)) > + { > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > + "can't operate on partial vectors because" > + " no conditional operation is available.\n"); > + LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false; > + } > + else if (reduc_type == FOLD_LEFT_REDUCTION > + && internal_fn_mask_index (reduc_fn) == -1 > + && FLOAT_TYPE_P (vectype_in) > + && HONOR_SIGN_DEPENDENT_ROUNDING (vectype_in)) > + { > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > + "can't operate on partial vectors because" > + " signed zeros cannot be preserved.\n"); > + LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false; > + } > + else > + { > + internal_fn mask_reduc_fn > + = get_masked_reduction_fn (reduc_fn, vectype_in); > + vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo); > + vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo); > + unsigned nvectors; > + > + if (slp_node) > + nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); > + else > + nvectors = vect_get_num_copies (loop_vinfo, vectype_in); > + > + if (mask_reduc_fn == IFN_MASK_LEN_FOLD_LEFT_PLUS) > + vect_record_loop_len (loop_vinfo, lens, nvectors, vectype_in, 1); > + else > + vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype_in, NULL); > + } > +} > + > +/* Check if STMT_INFO is a lane-reducing operation that can be vectorized in > + the context of LOOP_VINFO, and vector cost will be recorded in COST_VEC. > + Now there are three such kinds of operations: dot-prod/widen-sum/sad > + (sum-of-absolute-differences). > + > + For a lane-reducing operation, the loop reduction path that it lies in, > + may contain normal operation, or other lane-reducing operation of > different > + input type size, an example as: > + > + int sum = 0; > + for (i) > + { > + ... > + sum += d0[i] * d1[i]; // dot-prod <vector(16) char> > + sum += w[i]; // widen-sum <vector(16) char> > + sum += abs(s0[i] - s1[i]); // sad <vector(8) short> > + sum += n[i]; // normal <vector(4) int> > + ... > + } > + > + Vectorization factor is essentially determined by operation whose input > + vectype has the most lanes ("vector(16) char" in the example), while we > + need to choose input vectype with the least lanes ("vector(4) int" in the > + example) for the reduction PHI statement. */ > + > +bool > +vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info > stmt_info, > + slp_tree slp_node, stmt_vector_for_cost *cost_vec) > +{ > + gassign *stmt = dyn_cast <gassign *> (stmt_info->stmt); > + if (!stmt) > + return false; > + > + enum tree_code code = gimple_assign_rhs_code (stmt); > + > + if (!lane_reducing_op_p (code)) > + return false; > + > + tree type = TREE_TYPE (gimple_assign_lhs (stmt)); > + > + if (!INTEGRAL_TYPE_P (type) && !SCALAR_FLOAT_TYPE_P (type)) > + return false; > + > + /* Do not try to vectorize bit-precision reductions. */ > + if (!type_has_mode_precision_p (type)) > + return false; > + > + tree vectype_in = NULL_TREE; > + > + for (int i = 0; i < (int) gimple_num_ops (stmt) - 1; i++) > + { > + stmt_vec_info def_stmt_info; > + slp_tree slp_op; > + tree op; > + tree vectype; > + enum vect_def_type dt; > + > + if (!vect_is_simple_use (loop_vinfo, stmt_info, slp_node, i, &op, > + &slp_op, &dt, &vectype, &def_stmt_info)) > + { > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > + "use not simple.\n"); > + return false; > + } > + > + if (!vectype) > + { > + vectype = get_vectype_for_scalar_type (loop_vinfo, TREE_TYPE (op), > + slp_op); > + if (!vectype) > + return false; > + } > + > + if (slp_node && !vect_maybe_update_slp_op_vectype (slp_op, vectype)) > + { > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > + "incompatible vector types for invariants\n"); > + return false; > + } > + > + if (i == STMT_VINFO_REDUC_IDX (stmt_info)) > + continue; > + > + /* There should be at most one cycle def in the stmt. */ > + if (VECTORIZABLE_CYCLE_DEF (dt)) > + return false; > + > + /* To properly compute ncopies we are interested in the widest > + non-reduction input type in case we're looking at a widening > + accumulation that we later handle in vect transformation. */ > + if (!vectype_in > + || (GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE (vectype_in))) > + < GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE (vectype))))) > + vectype_in = vectype; > + } > + > + STMT_VINFO_REDUC_VECTYPE_IN (stmt_info) = vectype_in; > + > + stmt_vec_info reduc_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt > (stmt_info)); > + > + /* TODO: Support lane-reducing operation that does not directly participate > + in loop reduction. */ > + if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0) > + return false; > + > + /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not > + recoginized. */ > + gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) == vect_reduction_def); > + gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) == TREE_CODE_REDUCTION); > + > + tree vphi_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info); > + > + /* To accommodate lane-reducing operations of mixed input vectypes, choose > + input vectype with the least lanes for the reduction PHI statement, > which > + would result in the most ncopies for vectorized reduction results. */ > + if (!vphi_vectype_in > + || (GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE (vectype_in))) > + > GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE (vphi_vectype_in))))) > + STMT_VINFO_REDUC_VECTYPE_IN (reduc_info) = vectype_in; > + > + int ncopies_for_cost; > + > + if (slp_node) > + { > + /* Now lane-reducing operations in a slp node should only come from > + the same loop reduction path. */ > + gcc_assert (REDUC_GROUP_FIRST_ELEMENT (stmt_info)); > + ncopies_for_cost = 1; > + } > + else > + { > + ncopies_for_cost = vect_get_num_copies (loop_vinfo, vectype_in); > + gcc_assert (ncopies_for_cost >= 1); > + } > + > + if (vect_is_emulated_mixed_dot_prod (stmt_info)) > + { > + /* We need extra two invariants: one that contains the minimum signed > + value and one that contains half of its negative. */ > + int prologue_stmts = 2; > + unsigned cost = record_stmt_cost (cost_vec, prologue_stmts, > + scalar_to_vec, stmt_info, 0, > + vect_prologue); > + if (dump_enabled_p ()) > + dump_printf (MSG_NOTE, "vectorizable_lane_reducing: " > + "extra prologue_cost = %d .\n", cost); > + > + /* Three dot-products and a subtraction. */ > + ncopies_for_cost *= 4; > + } > + > + record_stmt_cost (cost_vec, ncopies_for_cost, vector_stmt, stmt_info, 0, > + vect_body); > + > + vect_reduction_use_partial_vector (loop_vinfo, reduc_info, slp_node, code, > + type, vectype_in); > + > + STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type; > + return true; > +} > + > /* Function vectorizable_reduction. > > Check if STMT_INFO performs a reduction operation that can be vectorized. > @@ -7449,7 +7679,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo, > bool single_defuse_cycle = false; > bool nested_cycle = false; > bool double_reduc = false; > - int vec_num; > tree cr_index_scalar_type = NULL_TREE, cr_index_vector_type = NULL_TREE; > tree cond_reduc_val = NULL_TREE; > > @@ -7530,6 +7759,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo, > (gimple_bb (reduc_def_phi)->loop_father)); > unsigned reduc_chain_length = 0; > bool only_slp_reduc_chain = true; > + bool only_lane_reducing = true; > stmt_info = NULL; > slp_tree slp_for_stmt_info = slp_node ? slp_node_instance->root : NULL; > while (reduc_def != PHI_RESULT (reduc_def_phi)) > @@ -7551,14 +7781,15 @@ vectorizable_reduction (loop_vec_info loop_vinfo, > all lanes here - even though we only will vectorize from > the SLP node with live lane zero the other live lanes also > need to be identified as part of a reduction to be able > - to skip code generation for them. */ > + to skip code generation for them. For lane-reducing operation > + vectorizable analysis needs the reduction PHI information. */ > if (slp_for_stmt_info) > { > for (auto s : SLP_TREE_SCALAR_STMTS (slp_for_stmt_info)) > if (STMT_VINFO_LIVE_P (s)) > STMT_VINFO_REDUC_DEF (vect_orig_stmt (s)) = phi_info; > } > - else if (STMT_VINFO_LIVE_P (vdef)) > + else > STMT_VINFO_REDUC_DEF (def) = phi_info; > gimple_match_op op; > if (!gimple_extract_op (vdef->stmt, &op)) > @@ -7579,9 +7810,16 @@ vectorizable_reduction (loop_vec_info loop_vinfo, > return false; > } > } > - else if (!stmt_info) > - /* First non-conversion stmt. */ > - stmt_info = vdef; > + else > + { > + /* First non-conversion stmt. */ > + if (!stmt_info) > + stmt_info = vdef; > + > + if (!lane_reducing_op_p (op.code)) > + only_lane_reducing = false; > + } > + > reduc_def = op.ops[STMT_VINFO_REDUC_IDX (vdef)]; > reduc_chain_length++; > if (!stmt_info && slp_node) > @@ -7643,9 +7881,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo, > gimple_match_op op; > if (!gimple_extract_op (stmt_info->stmt, &op)) > gcc_unreachable (); > - bool lane_reduc_code_p = (op.code == DOT_PROD_EXPR > - || op.code == WIDEN_SUM_EXPR > - || op.code == SAD_EXPR); > + bool lane_reducing = lane_reducing_op_p (op.code); > > if (!POINTER_TYPE_P (op.type) && !INTEGRAL_TYPE_P (op.type) > && !SCALAR_FLOAT_TYPE_P (op.type)) > @@ -7655,23 +7891,11 @@ vectorizable_reduction (loop_vec_info loop_vinfo, > if (!type_has_mode_precision_p (op.type)) > return false; > > - /* For lane-reducing ops we're reducing the number of reduction PHIs > - which means the only use of that may be in the lane-reducing operation. > */ > - if (lane_reduc_code_p > - && reduc_chain_length != 1 > - && !only_slp_reduc_chain) > - { > - if (dump_enabled_p ()) > - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > - "lane-reducing reduction with extra stmts.\n"); > - return false; > - } > - > /* Lane-reducing ops also never can be used in a SLP reduction group > since we'll mix lanes belonging to different reductions. But it's > OK to use them in a reduction chain or when the reduction group > has just one element. */ > - if (lane_reduc_code_p > + if (lane_reducing > && slp_node > && !REDUC_GROUP_FIRST_ELEMENT (stmt_info) > && SLP_TREE_LANES (slp_node) > 1) > @@ -7710,9 +7934,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo, > "use not simple.\n"); > return false; > } > - if (i == STMT_VINFO_REDUC_IDX (stmt_info)) > - continue; > - > /* For an IFN_COND_OP we might hit the reduction definition operand > twice (once as definition, once as else). */ > if (op.ops[i] == op.ops[STMT_VINFO_REDUC_IDX (stmt_info)]) > @@ -7731,7 +7952,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo, > /* To properly compute ncopies we are interested in the widest > non-reduction input type in case we're looking at a widening > accumulation that we later handle in vect_transform_reduction. */ > - if (lane_reduc_code_p > + if (lane_reducing > && vectype_op[i] > && (!vectype_in > || (GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE (vectype_in))) > @@ -7758,12 +7979,21 @@ vectorizable_reduction (loop_vec_info loop_vinfo, > } > if (!vectype_in) > vectype_in = STMT_VINFO_VECTYPE (phi_info); > - STMT_VINFO_REDUC_VECTYPE_IN (reduc_info) = vectype_in; > > - enum vect_reduction_type v_reduc_type = STMT_VINFO_REDUC_TYPE (phi_info); > - STMT_VINFO_REDUC_TYPE (reduc_info) = v_reduc_type; > + /* If there is a normal (non-lane-reducing) operation in the loop reduction > + path, to ensure there will be enough copies to hold vectorized results > of > + the operation, we need set the input vectype of the reduction PHI to be > + same as the reduction output vectype somewhere, here is a suitable > place. > + Otherwise the input vectype is set to the one with the least lanes, > which > + can only be determined in vectorizable analysis routine of lane-reducing > + operation. */ > + if (!only_lane_reducing) > + STMT_VINFO_REDUC_VECTYPE_IN (reduc_info) = STMT_VINFO_VECTYPE (phi_info); > + > + enum vect_reduction_type reduction_type = STMT_VINFO_REDUC_TYPE (phi_info); > + STMT_VINFO_REDUC_TYPE (reduc_info) = reduction_type; > /* If we have a condition reduction, see if we can simplify it further. */ > - if (v_reduc_type == COND_REDUCTION) > + if (reduction_type == COND_REDUCTION) > { > if (slp_node) > return false; > @@ -7929,8 +8159,8 @@ vectorizable_reduction (loop_vec_info loop_vinfo, > } > > STMT_VINFO_REDUC_CODE (reduc_info) = orig_code; > + reduction_type = STMT_VINFO_REDUC_TYPE (reduc_info); > > - vect_reduction_type reduction_type = STMT_VINFO_REDUC_TYPE (reduc_info); > if (reduction_type == TREE_CODE_REDUCTION) > { > /* Check whether it's ok to change the order of the computation. > @@ -8204,14 +8434,11 @@ vectorizable_reduction (loop_vec_info loop_vinfo, > && loop_vinfo->suggested_unroll_factor == 1) > single_defuse_cycle = true; > > - if (single_defuse_cycle || lane_reduc_code_p) > + if (single_defuse_cycle && !lane_reducing) > { > gcc_assert (op.code != COND_EXPR); > > - /* 4. Supportable by target? */ > - bool ok = true; > - > - /* 4.1. check support for the operation in the loop > + /* 4. check support for the operation in the loop > > This isn't necessary for the lane reduction codes, since they > can only be produced by pattern matching, and it's up to the > @@ -8220,14 +8447,13 @@ vectorizable_reduction (loop_vec_info loop_vinfo, > mixed-sign dot-products can be implemented using signed > dot-products. */ > machine_mode vec_mode = TYPE_MODE (vectype_in); > - if (!lane_reduc_code_p > - && !directly_supported_p (op.code, vectype_in, optab_vector)) > + if (!directly_supported_p (op.code, vectype_in, optab_vector)) > { > if (dump_enabled_p ()) > dump_printf (MSG_NOTE, "op not supported by target.\n"); > if (maybe_ne (GET_MODE_SIZE (vec_mode), UNITS_PER_WORD) > || !vect_can_vectorize_without_simd_p (op.code)) > - ok = false; > + single_defuse_cycle = false; > else > if (dump_enabled_p ()) > dump_printf (MSG_NOTE, "proceeding using word mode.\n"); > @@ -8240,35 +8466,12 @@ vectorizable_reduction (loop_vec_info loop_vinfo, > dump_printf (MSG_NOTE, "using word mode not possible.\n"); > return false; > } > - > - /* lane-reducing operations have to go through > vect_transform_reduction. > - For the other cases try without the single cycle optimization. */ > - if (!ok) > - { > - if (lane_reduc_code_p) > - return false; > - else > - single_defuse_cycle = false; > - } > } > STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info) = single_defuse_cycle; > > - /* If the reduction stmt is one of the patterns that have lane > - reduction embedded we cannot handle the case of ! single_defuse_cycle. > */ > - if ((ncopies > 1 && ! single_defuse_cycle) > - && lane_reduc_code_p) > - { > - if (dump_enabled_p ()) > - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > - "multi def-use cycle not possible for lane-reducing " > - "reduction operation\n"); > - return false; > - } > - > - if (slp_node > - && !(!single_defuse_cycle > - && !lane_reduc_code_p > - && reduction_type != FOLD_LEFT_REDUCTION)) > + /* Reduction type of lane-reducing operation is TREE_CODE_REDUCTION, the > + below processing will be done in its own vectorizable function. */ > + if (slp_node && reduction_type == FOLD_LEFT_REDUCTION) > for (i = 0; i < (int) op.num_ops; i++) > if (!vect_maybe_update_slp_op_vectype (slp_op[i], vectype_op[i])) > { > @@ -8278,36 +8481,24 @@ vectorizable_reduction (loop_vec_info loop_vinfo, > return false; > } > > - if (slp_node) > - vec_num = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); > - else > - vec_num = 1; > - > vect_model_reduction_cost (loop_vinfo, stmt_info, reduc_fn, > reduction_type, ncopies, cost_vec); > /* Cost the reduction op inside the loop if transformed via > - vect_transform_reduction. Otherwise this is costed by the > - separate vectorizable_* routines. */ > - if (single_defuse_cycle || lane_reduc_code_p) > - { > - int factor = 1; > - if (vect_is_emulated_mixed_dot_prod (loop_vinfo, stmt_info)) > - /* Three dot-products and a subtraction. */ > - factor = 4; > - record_stmt_cost (cost_vec, ncopies * factor, vector_stmt, > - stmt_info, 0, vect_body); > - } > + vect_transform_reduction for non-lane-reducing operation. Otherwise > + this is costed by the separate vectorizable_* routines. */ > + if (single_defuse_cycle && !lane_reducing) > + record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0, > vect_body); > > if (dump_enabled_p () > && reduction_type == FOLD_LEFT_REDUCTION) > dump_printf_loc (MSG_NOTE, vect_location, > "using an in-order (fold-left) reduction.\n"); > STMT_VINFO_TYPE (orig_stmt_of_analysis) = cycle_phi_info_type; > - /* All but single defuse-cycle optimized, lane-reducing and fold-left > - reductions go through their own vectorizable_* routines. */ > - if (!single_defuse_cycle > - && !lane_reduc_code_p > - && reduction_type != FOLD_LEFT_REDUCTION) > + > + /* All but single defuse-cycle optimized and fold-left reductions go > + through their own vectorizable_* routines. */ > + if ((!single_defuse_cycle && reduction_type != FOLD_LEFT_REDUCTION) > + || lane_reducing) > { > stmt_vec_info tem > = vect_stmt_to_vectorize (STMT_VINFO_REDUC_DEF (phi_info)); > @@ -8319,60 +8510,10 @@ vectorizable_reduction (loop_vec_info loop_vinfo, > STMT_VINFO_DEF_TYPE (vect_orig_stmt (tem)) = vect_internal_def; > STMT_VINFO_DEF_TYPE (tem) = vect_internal_def; > } > - else if (loop_vinfo && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)) > - { > - vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo); > - vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo); > - internal_fn cond_fn = get_conditional_internal_fn (op.code, op.type); > - > - if (reduction_type != FOLD_LEFT_REDUCTION > - && !use_mask_by_cond_expr_p (op.code, cond_fn, vectype_in) > - && (cond_fn == IFN_LAST > - || !direct_internal_fn_supported_p (cond_fn, vectype_in, > - OPTIMIZE_FOR_SPEED))) > - { > - if (dump_enabled_p ()) > - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > - "can't operate on partial vectors because" > - " no conditional operation is available.\n"); > - LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false; > - } > - else if (reduction_type == FOLD_LEFT_REDUCTION > - && reduc_fn == IFN_LAST > - && !expand_vec_cond_expr_p (vectype_in, > - truth_type_for (vectype_in), > - SSA_NAME)) > - { > - if (dump_enabled_p ()) > - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > - "can't operate on partial vectors because" > - " no conditional operation is available.\n"); > - LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false; > - } > - else if (reduction_type == FOLD_LEFT_REDUCTION > - && internal_fn_mask_index (reduc_fn) == -1 > - && FLOAT_TYPE_P (vectype_in) > - && HONOR_SIGN_DEPENDENT_ROUNDING (vectype_in)) > - { > - if (dump_enabled_p ()) > - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > - "can't operate on partial vectors because" > - " signed zeros cannot be preserved.\n"); > - LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false; > - } > - else > - { > - internal_fn mask_reduc_fn > - = get_masked_reduction_fn (reduc_fn, vectype_in); > + else > + vect_reduction_use_partial_vector (loop_vinfo, reduc_info, slp_node, > + op.code, op.type, vectype_in); > > - if (mask_reduc_fn == IFN_MASK_LEN_FOLD_LEFT_PLUS) > - vect_record_loop_len (loop_vinfo, lens, ncopies * vec_num, > - vectype_in, 1); > - else > - vect_record_loop_mask (loop_vinfo, masks, ncopies * vec_num, > - vectype_in, NULL); > - } > - } > return true; > } > > @@ -8463,6 +8604,7 @@ vect_transform_reduction (loop_vec_info loop_vinfo, > class loop *loop = LOOP_VINFO_LOOP (loop_vinfo); > int i; > int ncopies; > + int stmt_ncopies; > int vec_num; > > stmt_vec_info reduc_info = info_for_reduction (loop_vinfo, stmt_info); > @@ -8486,15 +8628,28 @@ vect_transform_reduction (loop_vec_info loop_vinfo, > gphi *reduc_def_phi = as_a <gphi *> (phi_info->stmt); > int reduc_index = STMT_VINFO_REDUC_IDX (stmt_info); > tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info); > + tree stmt_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info); > + > + /* Get input vectypes from the reduction PHI and the statement to be > + transformed, these two vectypes may have different lanes when > + lane-reducing operation is present. */ > + if (!vectype_in) > + vectype_in = STMT_VINFO_REDUC_VECTYPE (reduc_info); > + > + if (!stmt_vectype_in) > + stmt_vectype_in = STMT_VINFO_VECTYPE (stmt_info); > > if (slp_node) > { > ncopies = 1; > + stmt_ncopies = 1; > vec_num = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); > } > else > { > ncopies = vect_get_num_copies (loop_vinfo, vectype_in); > + stmt_ncopies = vect_get_num_copies (loop_vinfo, stmt_vectype_in); > + gcc_assert (stmt_ncopies >= 1 && stmt_ncopies <= ncopies); > vec_num = 1; > } > > @@ -8503,14 +8658,10 @@ vect_transform_reduction (loop_vec_info loop_vinfo, > > vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo); > vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo); > - bool mask_by_cond_expr = use_mask_by_cond_expr_p (code, cond_fn, > vectype_in); > - > + bool mask_by_cond_expr = use_mask_by_cond_expr_p (code, cond_fn, > + stmt_vectype_in); > /* Transform. */ > - tree new_temp = NULL_TREE; > - auto_vec<tree> vec_oprnds0; > - auto_vec<tree> vec_oprnds1; > - auto_vec<tree> vec_oprnds2; > - tree def0; > + auto_vec<tree> vec_oprnds[3]; > > if (dump_enabled_p ()) > dump_printf_loc (MSG_NOTE, vect_location, "transform reduction.\n"); > @@ -8534,8 +8685,6 @@ vect_transform_reduction (loop_vec_info loop_vinfo, > == op.ops[internal_fn_else_index ((internal_fn) > code)])); > } > > - bool masked_loop_p = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo); > - > vect_reduction_type reduction_type = STMT_VINFO_REDUC_TYPE (reduc_info); > if (reduction_type == FOLD_LEFT_REDUCTION) > { > @@ -8543,69 +8692,172 @@ vect_transform_reduction (loop_vec_info loop_vinfo, > gcc_assert (code.is_tree_code () || cond_fn_p); > return vectorize_fold_left_reduction > (loop_vinfo, stmt_info, gsi, vec_stmt, slp_node, reduc_def_phi, > - code, reduc_fn, op.ops, op.num_ops, vectype_in, > + code, reduc_fn, op.ops, op.num_ops, stmt_vectype_in, > reduc_index, masks, lens); > } > > bool single_defuse_cycle = STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info); > - gcc_assert (single_defuse_cycle > - || code == DOT_PROD_EXPR > - || code == WIDEN_SUM_EXPR > - || code == SAD_EXPR); > + bool lane_reducing = lane_reducing_op_p (code); > + > + gcc_assert (single_defuse_cycle || lane_reducing); > > /* Create the destination vector */ > tree scalar_dest = gimple_get_lhs (stmt_info->stmt); > tree vec_dest = vect_create_destination_var (scalar_dest, vectype_out); > > - /* Get NCOPIES vector definitions for all operands except the reduction > - definition. */ > - if (!cond_fn_p) > + gcc_assert (reduc_index < 3); > + > + if (slp_node) > { > - vect_get_vec_defs (loop_vinfo, stmt_info, slp_node, ncopies, > - single_defuse_cycle && reduc_index == 0 > - ? NULL_TREE : op.ops[0], &vec_oprnds0, > - single_defuse_cycle && reduc_index == 1 > - ? NULL_TREE : op.ops[1], &vec_oprnds1, > - op.num_ops == 3 > - && !(single_defuse_cycle && reduc_index == 2) > - ? op.ops[2] : NULL_TREE, &vec_oprnds2); > + gcc_assert (!single_defuse_cycle && op.num_ops <= 3); > + > + for (i = 0; i < (int) op.num_ops; i++) > + vect_get_slp_defs (SLP_TREE_CHILDREN (slp_node)[i], &vec_oprnds[i]); > } > else > { > - /* For a conditional operation pass the truth type as mask > - vectype. */ > - gcc_assert (single_defuse_cycle > - && (reduc_index == 1 || reduc_index == 2)); > - vect_get_vec_defs (loop_vinfo, stmt_info, slp_node, ncopies, > - op.ops[0], truth_type_for (vectype_in), &vec_oprnds0, > - reduc_index == 1 ? NULL_TREE : op.ops[1], > - NULL_TREE, &vec_oprnds1, > - reduc_index == 2 ? NULL_TREE : op.ops[2], > - NULL_TREE, &vec_oprnds2); > - } > + int result_pos = 0; > > - /* For single def-use cycles get one copy of the vectorized reduction > - definition. */ > - if (single_defuse_cycle) > - { > - gcc_assert (!slp_node); > - vect_get_vec_defs_for_operand (loop_vinfo, stmt_info, 1, > - op.ops[reduc_index], > - reduc_index == 0 ? &vec_oprnds0 > - : (reduc_index == 1 ? &vec_oprnds1 > - : &vec_oprnds2)); > + /* The input vectype of the reduction PHI determines copies of > + vectorized def-use cycles, which might be more than effective copies > + of vectorized lane-reducing reduction statements. This could be > + complemented by generating extra trivial pass-through copies. For > + example: > + > + int sum = 0; > + for (i) > + { > + sum += d0[i] * d1[i]; // dot-prod <vector(16) char> > + sum += abs(s0[i] - s1[i]); // sad <vector(8) short> > + sum += n[i]; // normal <vector(4) int> > + } > + > + The vector size is 128-bit,vectorization factor is 16. Reduction > + statements would be transformed as: > + > + vector<4> int sum_v0 = { 0, 0, 0, 0 }; > + vector<4> int sum_v1 = { 0, 0, 0, 0 }; > + vector<4> int sum_v2 = { 0, 0, 0, 0 }; > + vector<4> int sum_v3 = { 0, 0, 0, 0 }; > + > + for (i / 16) > + { > + sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); > + sum_v1 = sum_v1; // copy > + sum_v2 = sum_v2; // copy > + sum_v3 = sum_v3; // copy > + > + sum_v0 = sum_v0; // copy > + sum_v1 = SAD (s0_v1[i: 0 ~ 7 ], s1_v1[i: 0 ~ 7 ], sum_v1); > + sum_v2 = SAD (s0_v2[i: 8 ~ 15], s1_v2[i: 8 ~ 15], sum_v2); > + sum_v3 = sum_v3; // copy > + > + sum_v0 += n_v0[i: 0 ~ 3 ]; > + sum_v1 += n_v1[i: 4 ~ 7 ]; > + sum_v2 += n_v2[i: 8 ~ 11]; > + sum_v3 += n_v3[i: 12 ~ 15]; > + } > + > + Moreover, for a higher instruction parallelism in final vectorized > + loop, it is considered to make those effective vectorized > + lane-reducing statements be distributed evenly among all def-use > + cycles. In the above example, SADs are generated into other cycles > + rather than that of DOT_PROD. */ > + > + if (stmt_ncopies < ncopies) > + { > + gcc_assert (lane_reducing); > + result_pos = reduc_info->reduc_result_pos; > + reduc_info->reduc_result_pos = (result_pos + stmt_ncopies) % > ncopies; > + gcc_assert (result_pos >= 0 && result_pos < ncopies); > + } > + > + for (i = 0; i < MIN (3, (int) op.num_ops); i++) > + { > + tree vectype = NULL_TREE; > + int used_ncopies = ncopies; > + > + if (cond_fn_p && i == 0) > + { > + /* For a conditional operation pass the truth type as mask > + vectype. */ > + gcc_assert (single_defuse_cycle && reduc_index > 0); > + vectype = truth_type_for (vectype_in); > + } > + > + if (i != reduc_index) > + { > + /* For non-reduction operand, deduce effictive copies that are > + involved in vectorized def-use cycles based on the input > + vectype of the reduction statement. */ > + used_ncopies = stmt_ncopies; > + } > + else if (single_defuse_cycle) > + { > + /* For single def-use cycles get one copy of the vectorized > + reduction definition. */ > + used_ncopies = 1; > + } > + > + vect_get_vec_defs_for_operand (loop_vinfo, stmt_info, used_ncopies, > + op.ops[i], &vec_oprnds[i], vectype); > + > + if (used_ncopies < ncopies) > + { > + vec_oprnds[i].safe_grow_cleared (ncopies); > + > + /* Find suitable def-use cycles to generate vectorized > + statements into, and reorder operands based on the > + selection. */ > + if (i != reduc_index && result_pos) > + { > + int count = ncopies - used_ncopies; > + int start = result_pos - count; > + > + if (start < 0) > + { > + count = result_pos; > + start = 0; > + } > + > + for (int j = used_ncopies - 1; j >= start; j--) > + { > + std::swap (vec_oprnds[i][j], vec_oprnds[i][j + count]); > + gcc_assert (!vec_oprnds[i][j]); > + } > + } > + } > + } > } > > - bool emulated_mixed_dot_prod > - = vect_is_emulated_mixed_dot_prod (loop_vinfo, stmt_info); > - FOR_EACH_VEC_ELT (vec_oprnds0, i, def0) > + bool masked_loop_p = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo); > + bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info); > + tree def0; > + > + FOR_EACH_VEC_ELT (vec_oprnds[0], i, def0) > { > gimple *new_stmt; > - tree vop[3] = { def0, vec_oprnds1[i], NULL_TREE }; > - if (masked_loop_p && !mask_by_cond_expr) > + tree new_temp = NULL_TREE; > + tree vop[3] = { def0, vec_oprnds[1][i], NULL_TREE }; > + > + if (!vop[0] || !vop[1]) > { > - /* No conditional ifns have been defined for dot-product yet. */ > - gcc_assert (code != DOT_PROD_EXPR); > + tree reduc_vop = vec_oprnds[reduc_index][i]; > + > + /* Insert trivial copy if no need to generate vectorized > + statement. */ > + gcc_assert (reduc_vop && stmt_ncopies < ncopies); > + > + new_stmt = gimple_build_assign (vec_dest, reduc_vop); > + new_temp = make_ssa_name (vec_dest, new_stmt); > + gimple_set_lhs (new_stmt, new_temp); > + vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi); > + } > + else if (masked_loop_p && !mask_by_cond_expr) > + { > + /* No conditional ifns have been defined for dot-product and sad > + yet. */ > + gcc_assert (code != DOT_PROD_EXPR && code != SAD_EXPR); > > /* Make sure that the reduction accumulator is vop[0]. */ > if (reduc_index == 1) > @@ -8614,7 +8866,8 @@ vect_transform_reduction (loop_vec_info loop_vinfo, > std::swap (vop[0], vop[1]); > } > tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks, > - vec_num * ncopies, vectype_in, i); > + vec_num * stmt_ncopies, > + stmt_vectype_in, i); > gcall *call = gimple_build_call_internal (cond_fn, 4, mask, > vop[0], vop[1], vop[0]); > new_temp = make_ssa_name (vec_dest, call); > @@ -8626,12 +8879,13 @@ vect_transform_reduction (loop_vec_info loop_vinfo, > else > { > if (op.num_ops >= 3) > - vop[2] = vec_oprnds2[i]; > + vop[2] = vec_oprnds[2][i]; > > if (masked_loop_p && mask_by_cond_expr) > { > tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks, > - vec_num * ncopies, vectype_in, > i); > + vec_num * stmt_ncopies, > + stmt_vectype_in, i); > build_vect_cond_expr (code, vop, mask, gsi); > } > > @@ -8658,16 +8912,8 @@ vect_transform_reduction (loop_vec_info loop_vinfo, > > if (slp_node) > slp_node->push_vec_def (new_stmt); > - else if (single_defuse_cycle > - && i < ncopies - 1) > - { > - if (reduc_index == 0) > - vec_oprnds0.safe_push (gimple_get_lhs (new_stmt)); > - else if (reduc_index == 1) > - vec_oprnds1.safe_push (gimple_get_lhs (new_stmt)); > - else if (reduc_index == 2) > - vec_oprnds2.safe_push (gimple_get_lhs (new_stmt)); > - } > + else if (single_defuse_cycle && i < ncopies - 1) > + vec_oprnds[reduc_index][i + 1] = gimple_get_lhs (new_stmt); > else > STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt); > } > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc > index c7ed520b629..5713e32f545 100644 > --- a/gcc/tree-vect-slp.cc > +++ b/gcc/tree-vect-slp.cc > @@ -3924,9 +3924,7 @@ vect_analyze_slp (vec_info *vinfo, unsigned > max_tree_size) > /* Do not discover SLP reductions for lane-reducing ops, > that > will fail later. */ > && (!(g = dyn_cast <gassign *> (STMT_VINFO_STMT > (next_info))) > - || (gimple_assign_rhs_code (g) != DOT_PROD_EXPR > - && gimple_assign_rhs_code (g) != WIDEN_SUM_EXPR > - && gimple_assign_rhs_code (g) != SAD_EXPR))) > + || !lane_reducing_op_p (gimple_assign_rhs_code (g)))) > scalar_stmts.quick_push (next_info); > } > if (scalar_stmts.length () > 1) > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > index 4219ad832db..41ee3051756 100644 > --- a/gcc/tree-vect-stmts.cc > +++ b/gcc/tree-vect-stmts.cc > @@ -12096,11 +12096,20 @@ vectorizable_condition (vec_info *vinfo, > vect_reduction_type reduction_type = TREE_CODE_REDUCTION; > bool for_reduction > = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info)) != NULL; > + if (for_reduction) > + { > + reduc_info = info_for_reduction (vinfo, stmt_info); > + if (STMT_VINFO_REDUC_DEF (reduc_info) != vect_orig_stmt (stmt_info)) > + { > + for_reduction = false; > + reduc_info = NULL; > + } > + } > + > if (for_reduction) > { > if (slp_node) > return false; > - reduc_info = info_for_reduction (vinfo, stmt_info); > reduction_type = STMT_VINFO_REDUC_TYPE (reduc_info); > reduc_index = STMT_VINFO_REDUC_IDX (stmt_info); > gcc_assert (reduction_type != EXTRACT_LAST_REDUCTION > @@ -13289,6 +13298,8 @@ vect_analyze_stmt (vec_info *vinfo, > NULL, NULL, node, cost_vec) > || vectorizable_load (vinfo, stmt_info, NULL, NULL, node, cost_vec) > || vectorizable_store (vinfo, stmt_info, NULL, NULL, node, cost_vec) > + || vectorizable_lane_reducing (as_a <loop_vec_info> (vinfo), > + stmt_info, node, cost_vec) > || vectorizable_reduction (as_a <loop_vec_info> (vinfo), stmt_info, > node, node_instance, cost_vec) > || vectorizable_induction (as_a <loop_vec_info> (vinfo), stmt_info, > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h > index 93bc30ef660..392fff5b799 100644 > --- a/gcc/tree-vectorizer.h > +++ b/gcc/tree-vectorizer.h > @@ -1399,6 +1399,12 @@ public: > /* The vector type for performing the actual reduction. */ > tree reduc_vectype; > > + /* For loop reduction with multiple vectorized results (ncopies > 1), a > + lane-reducing operation participating in it may not use all of those > + results, this field specifies result index starting from which any > + following land-reducing operation would be assigned to. */ > + int reduc_result_pos; > + > /* If IS_REDUC_INFO is true and if the vector code is performing > N scalar reductions in parallel, this variable gives the initial > scalar values of those N reductions. */ > @@ -2166,6 +2172,12 @@ vect_apply_runtime_profitability_check_p > (loop_vec_info loop_vinfo) > && th >= vect_vf_for_cost (loop_vinfo)); > } > > +inline bool > +lane_reducing_op_p (code_helper code) > +{ > + return code == DOT_PROD_EXPR || code == WIDEN_SUM_EXPR || code == SAD_EXPR; > +} > + > /* Source location + hotness information. */ > extern dump_user_location_t vect_location; > > @@ -2434,6 +2446,8 @@ extern loop_vec_info vect_create_loop_vinfo (class loop > *, vec_info_shared *, > extern bool vectorizable_live_operation (vec_info *, stmt_vec_info, > slp_tree, slp_instance, int, > bool, stmt_vector_for_cost *); > +extern bool vectorizable_lane_reducing (loop_vec_info, stmt_vec_info, > + slp_tree, stmt_vector_for_cost *); > extern bool vectorizable_reduction (loop_vec_info, stmt_vec_info, > slp_tree, slp_instance, > stmt_vector_for_cost *); > > ________________________________________ > From: Feng Xue OS <f...@os.amperecomputing.com> > Sent: Sunday, April 7, 2024 2:59 PM > To: Richard Biener > Cc: gcc-patches@gcc.gnu.org > Subject: [PATCH] vect: Support multiple lane-reducing operations for loop > reduction [PR114440] > > For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, current > vectorizer could only handle the pattern if the reduction chain does not > contain other operation, no matter the other is normal or lane-reducing. > > Acctually, to allow multiple arbitray lane-reducing operations, we need to > support vectorization of loop reduction chain with mixed input vectypes. Since > lanes of vectype may vary with operation, the effective ncopies of vectorized > statements for operation also may not be same to each other, this causes > mismatch on vectorized def-use cycles. A simple way is to align all operations > with the one that has the most ncopies, the gap could be complemented by > generating extra trival pass-through copies. For example: > > int sum = 0; > for (i) > { > sum += d0[i] * d1[i]; // dot-prod <vector(16) char> > sum += w[i]; // widen-sum <vector(16) char> > sum += abs(s0[i] - s1[i]); // sad <vector(8) short> > sum += n[i]; // normal <vector(4) int> > } > > The vector size is 128-bit,vectorization factor is 16. Reduction statements > would be transformed as: > > vector<4> int sum_v0 = { 0, 0, 0, 0 }; > vector<4> int sum_v1 = { 0, 0, 0, 0 }; > vector<4> int sum_v2 = { 0, 0, 0, 0 }; > vector<4> int sum_v3 = { 0, 0, 0, 0 }; > > for (i / 16) > { > sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); > sum_v1 = sum_v1; // copy > sum_v2 = sum_v2; // copy > sum_v3 = sum_v3; // copy > > sum_v0 = sum_v0; // copy > sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1); > sum_v2 = sum_v2; // copy > sum_v3 = sum_v3; // copy > > sum_v0 = sum_v0; // copy > sum_v1 = sum_v1; // copy > sum_v2 = SAD (s0_v2[i: 0 ~ 7 ], s1_v2[i: 0 ~ 7 ], sum_v2); > sum_v3 = SAD (s0_v3[i: 8 ~ 15], s1_v3[i: 8 ~ 15], sum_v3); > > sum_v0 += n_v0[i: 0 ~ 3 ]; > sum_v1 += n_v1[i: 4 ~ 7 ]; > sum_v2 += n_v2[i: 8 ~ 11]; > sum_v3 += n_v3[i: 12 ~ 15]; > } > > Moreover, for a higher instruction parallelism in final vectorized loop, it > is considered to make those effective vectorized lane-reducing statements be > distributed evenly among all def-use cycles. In the above example, DOT_PROD, > WIDEN_SUM and SADs are generated into disparate cycles. > > Bootstrapped/regtested on x86_64-linux and aarch64-linux. > > Feng