Hi, Thanks for the suggestion. I tested it and found a gcc_assert failure: gcc.target/aarch64/sve/cost_model_13.c (internal compiler error: in info_for_reduction, at tree-vect-loop.cc:5473)
It is caused by empty STMT_VINFO_REDUC_DEF. So, I added an extra check before checking single_defuse_cycle. The updated patch is below. Is it OK for trunk? --- The new costs should only count reduction latency by multiplying count for single_defuse_cycle. For other situations, this will increase the reduction latency a lot and miss vectorization opportunities. Tested on aarch64-linux-gnu. gcc/ChangeLog: PR target/110625 * config/aarch64/aarch64.cc (count_ops): Only '* count' for single_defuse_cycle while counting reduction_latency. gcc/testsuite/ChangeLog: * gcc.target/aarch64/pr110625_1.c: New testcase. * gcc.target/aarch64/pr110625_2.c: New testcase. --- gcc/config/aarch64/aarch64.cc | 13 ++++-- gcc/testsuite/gcc.target/aarch64/pr110625_1.c | 46 +++++++++++++++++++ gcc/testsuite/gcc.target/aarch64/pr110625_2.c | 14 ++++++ 3 files changed, 69 insertions(+), 4 deletions(-) create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625_1.c create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625_2.c diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 560e5431636..478a4e00110 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -16788,10 +16788,15 @@ aarch64_vector_costs::count_ops (unsigned int count, vect_cost_for_stmt kind, { unsigned int base = aarch64_in_loop_reduction_latency (m_vinfo, stmt_info, m_vec_flags); - - /* ??? Ideally we'd do COUNT reductions in parallel, but unfortunately - that's not yet the case. */ - ops->reduction_latency = MAX (ops->reduction_latency, base * count); + if (STMT_VINFO_REDUC_DEF (stmt_info) + && STMT_VINFO_FORCE_SINGLE_CYCLE ( + info_for_reduction (m_vinfo, stmt_info))) + /* ??? Ideally we'd use a tree to reduce the copies down to 1 vector, + and then accumulate that, but at the moment the loop-carried + dependency includes all copies. */ + ops->reduction_latency = MAX (ops->reduction_latency, base * count); + else + ops->reduction_latency = MAX (ops->reduction_latency, base); } /* Assume that multiply-adds will become a single operation. */ diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_1.c b/gcc/testsuite/gcc.target/aarch64/pr110625_1.c new file mode 100644 index 00000000000..0965cac33a0 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/pr110625_1.c @@ -0,0 +1,46 @@ +/* { dg-do compile } */ +/* { dg-options "-Ofast -mcpu=neoverse-n2 -fdump-tree-vect-details -fno-tree-slp-vectorize" } */ +/* { dg-final { scan-tree-dump-not "reduction latency = 8" "vect" } } */ + +/* Do not increase the vector body cost due to the incorrect reduction latency + Original vector body cost = 51 + Scalar issue estimate: + ... + reduction latency = 2 + estimated min cycles per iteration = 2.000000 + estimated cycles per vector iteration (for VF 2) = 4.000000 + Vector issue estimate: + ... + reduction latency = 8 <-- Too large + estimated min cycles per iteration = 8.000000 + Increasing body cost to 102 because scalar code would issue more quickly + ... + missed: cost model: the vector iteration cost = 102 divided by the scalar iteration cost = 44 is greater or equal to the vectorization factor = 2. + missed: not vectorized: vectorization not profitable. */ + +typedef struct +{ + unsigned short m1, m2, m3, m4; +} the_struct_t; +typedef struct +{ + double m1, m2, m3, m4, m5; +} the_struct2_t; + +double +bar (the_struct2_t *); + +double +foo (double *k, unsigned int n, the_struct_t *the_struct) +{ + unsigned int u; + the_struct2_t result; + for (u = 0; u < n; u++, k--) + { + result.m1 += (*k) * the_struct[u].m1; + result.m2 += (*k) * the_struct[u].m2; + result.m3 += (*k) * the_struct[u].m3; + result.m4 += (*k) * the_struct[u].m4; + } + return bar (&result); +} diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_2.c b/gcc/testsuite/gcc.target/aarch64/pr110625_2.c new file mode 100644 index 00000000000..7a84aa8355e --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/pr110625_2.c @@ -0,0 +1,14 @@ +/* { dg-do compile } */ +/* { dg-options "-Ofast -mcpu=neoverse-n2 -fdump-tree-vect-details -fno-tree-slp-vectorize" } */ +/* { dg-final { scan-tree-dump "reduction latency = 8" "vect" } } */ + +/* The reduction latency should be multiplied by the count for + single_defuse_cycle. */ + +long +f (long res, short *ptr1, short *ptr2, int n) +{ + for (int i = 0; i < n; ++i) + res += (long) ptr1[i] << ptr2[i]; + return res; +} -- 2.34.1 ________________________________________ From: Richard Sandiford <richard.sandif...@arm.com> Sent: Monday, July 24, 2023 19:10 To: Hao Liu OS Cc: GCC-patches@gcc.gnu.org Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625] Hao Liu OS <h...@os.amperecomputing.com> writes: > This only affects the new costs in aarch64 backend. Currently, the reduction > latency of vector body is too large as it is multiplied by stmt count. As the > scalar reduction latency is small, the new costs model may think "scalar code > would issue more quickly" and increase the vector body cost a lot, which will > miss vectorization opportunities. > > Tested by bootstrapping on aarch64-linux-gnu. > > gcc/ChangeLog: > > PR target/110625 > * config/aarch64/aarch64.cc (count_ops): Remove the '* count' > for reduction_latency. > > gcc/testsuite/ChangeLog: > > * gcc.target/aarch64/pr110625.c: New testcase. > --- > gcc/config/aarch64/aarch64.cc | 5 +-- > gcc/testsuite/gcc.target/aarch64/pr110625.c | 46 +++++++++++++++++++++ > 2 files changed, 47 insertions(+), 4 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625.c > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > index 560e5431636..27afa64b7d5 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -16788,10 +16788,7 @@ aarch64_vector_costs::count_ops (unsigned int count, > vect_cost_for_stmt kind, > { > unsigned int base > = aarch64_in_loop_reduction_latency (m_vinfo, stmt_info, m_vec_flags); > - > - /* ??? Ideally we'd do COUNT reductions in parallel, but unfortunately > - that's not yet the case. */ > - ops->reduction_latency = MAX (ops->reduction_latency, base * count); > + ops->reduction_latency = MAX (ops->reduction_latency, base); The ??? is referring to the single_defuse_cycle code in vectorizable_reduction. E.g. consider: long f (long res, short *ptr1, short *ptr2, int n) { for (int i = 0; i < n; ++i) res += (long) ptr1[i] << ptr2[i]; return res; } compiled at -O3. The main loop is: movi v25.4s, 0 // init accumulator lsl x5, x5, 4 .p2align 3,,7 .L4: ldr q31, [x1, x4] ldr q29, [x2, x4] add x4, x4, 16 sxtl v30.4s, v31.4h sxtl2 v31.4s, v31.8h sxtl v28.4s, v29.4h sxtl2 v29.4s, v29.8h sxtl v27.2d, v30.2s sxtl2 v30.2d, v30.4s sxtl v23.2d, v28.2s sxtl2 v26.2d, v28.4s sxtl v24.2d, v29.2s sxtl v28.2d, v31.2s sshl v27.2d, v27.2d, v23.2d sshl v30.2d, v30.2d, v26.2d sxtl2 v31.2d, v31.4s sshl v28.2d, v28.2d, v24.2d add v27.2d, v27.2d, v25.2d // v25 -> v27 sxtl2 v29.2d, v29.4s add v30.2d, v30.2d, v27.2d // v27 -> v30 sshl v31.2d, v31.2d, v29.2d add v30.2d, v28.2d, v30.2d // v30 -> v30 add v25.2d, v31.2d, v30.2d // v30 -> v25 cmp x4, x5 bne .L4 Here count is 4 and the latency is 4 additions (8 cycles). So as a worst case, the current cost is correct. To remove the count in all cases, we would need to (1) avoid single def-use cycles or (2) reassociate the reduction as a tree, (4->2, 2->1, 1+acc->acc) But looking again, it seems we do have the information to distinguish the cases. We can do something like: stmt_vec_info reduc_info = info_for_reduction (m_vinfo, stmt_info); if (STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info)) /* ??? Ideally we'd use a tree to reduce the copies down to 1 vector, and then accumulate that, but at the moment the loop-carried dependency includes all copies. */ ops->reduction_latency = MAX (ops->reduction_latency, base * count); else ops->reduction_latency = MAX (ops->reduction_latency, base); (completely untested). Thanks, Richard