Hi Richard, Update the patch with a simple case (see below case and comments). It shows a live stmt may not have reduction def, which introduce the ICE.
Is it OK for trunk? ---- Fix the assertion failure on empty reduction define in info_for_reduction. Even a stmt is live, it may still have empty reduction define. Check the reduction definition instead of live info before calling info_for_reduction. gcc/ChangeLog: PR target/110625 * config/aarch64/aarch64.cc (aarch64_force_single_cycle): check STMT_VINFO_REDUC_DEF to avoid failures in info_for_reduction. gcc/testsuite/ChangeLog: * gcc.target/aarch64/pr110625_3.c: New testcase. --- gcc/config/aarch64/aarch64.cc | 2 +- gcc/testsuite/gcc.target/aarch64/pr110625_3.c | 34 +++++++++++++++++++ 2 files changed, 35 insertions(+), 1 deletion(-) create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625_3.c diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index d4d76025545..5b8d8fa8e2d 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -16776,7 +16776,7 @@ aarch64_adjust_stmt_cost (vect_cost_for_stmt kind, stmt_vec_info stmt_info, static bool aarch64_force_single_cycle (vec_info *vinfo, stmt_vec_info stmt_info) { - if (!STMT_VINFO_LIVE_P (stmt_info)) + if (!STMT_VINFO_REDUC_DEF (stmt_info)) return false; auto reduc_info = info_for_reduction (vinfo, stmt_info); diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_3.c b/gcc/testsuite/gcc.target/aarch64/pr110625_3.c new file mode 100644 index 00000000000..35a50290cb0 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/pr110625_3.c @@ -0,0 +1,34 @@ +/* { dg-do compile } */ +/* { dg-options "-O3 -mcpu=neoverse-n2" } */ + +/* Avoid ICE on empty reduction def in single_defuse_cycle. + + E.g. + <bb 3> [local count: 858993456]: + # sum_18 = PHI <sum_15(5), 0(2)> + sum.0_5 = (unsigned int) sum_18; + _6 = _4 + sum.0_5; <-- it is "live" but doesn't have reduction def + sum_15 = (int) _6; + ... + if (ivtmp_29 != 0) + goto <bb 5>; [75.00%] + else + goto <bb 4>; [25.00%] + + <bb 5> [local count: 644245086]: + goto <bb 3>; [100.00%] + + <bb 4> [local count: 214748368]: + # _31 = PHI <_6(3)> + _8 = _31 >> 1; +*/ + +int +f (unsigned int *tmp) +{ + int sum = 0; + for (int i = 0; i < 4; i++) + sum += tmp[i]; + + return (unsigned int) sum >> 1; +} -- 2.34.1 ________________________________________ From: Hao Liu OS <h...@os.amperecomputing.com> Sent: Tuesday, August 1, 2023 17:43 To: Richard Sandiford Cc: Richard Biener; GCC-patches@gcc.gnu.org Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625] Hi Richard, This is a quick fix to the several ICEs. It seems even STMT_VINFO_LIVE_P is true, some reduct stmts still don't have REDUC_DEF. So I change the check to STMT_VINFO_REDUC_DEF. Is it OK for trunk? --- Fix the ICEs on empty reduction define. Even STMT_VINFO_LIVE_P is true, some reduct stmts still don't have definition. gcc/ChangeLog: PR target/110625 * config/aarch64/aarch64.cc (aarch64_force_single_cycle): check STMT_VINFO_REDUC_DEF to avoid failures in info_for_reduction --- gcc/config/aarch64/aarch64.cc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index d4d76025545..5b8d8fa8e2d 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -16776,7 +16776,7 @@ aarch64_adjust_stmt_cost (vect_cost_for_stmt kind, stmt_vec_info stmt_info, static bool aarch64_force_single_cycle (vec_info *vinfo, stmt_vec_info stmt_info) { - if (!STMT_VINFO_LIVE_P (stmt_info)) + if (!STMT_VINFO_REDUC_DEF (stmt_info)) return false; auto reduc_info = info_for_reduction (vinfo, stmt_info); -- 2.40.0 ________________________________________ From: Richard Sandiford <richard.sandif...@arm.com> Sent: Monday, July 31, 2023 17:11 To: Hao Liu OS Cc: Richard Biener; GCC-patches@gcc.gnu.org Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625] Hao Liu OS <h...@os.amperecomputing.com> writes: >> Which test case do you see this for? The two tests in the patch still >> seem to report correct latencies for me if I make the change above. > > Not the newly added tests. It is still the existing case causing the > previous ICE (i.e. assertion problem): gcc.target/aarch64/sve/cost_model_13.c. > > It's not the test case itself failed, but the dump message of vect says the > "reduction latency" is 0: > > Before the change: > cost_model_13.c:7:21: note: Original vector body cost = 6 > cost_model_13.c:7:21: note: Scalar issue estimate: > cost_model_13.c:7:21: note: load operations = 1 > cost_model_13.c:7:21: note: store operations = 0 > cost_model_13.c:7:21: note: general operations = 1 > cost_model_13.c:7:21: note: reduction latency = 1 > cost_model_13.c:7:21: note: estimated min cycles per iteration = 1.000000 > cost_model_13.c:7:21: note: estimated cycles per vector iteration (for VF > 8) = 8.000000 > cost_model_13.c:7:21: note: Vector issue estimate: > cost_model_13.c:7:21: note: load operations = 1 > cost_model_13.c:7:21: note: store operations = 0 > cost_model_13.c:7:21: note: general operations = 1 > cost_model_13.c:7:21: note: reduction latency = 2 > cost_model_13.c:7:21: note: estimated min cycles per iteration = 2.000000 > > After the change: > cost_model_13.c:7:21: note: Original vector body cost = 6 > cost_model_13.c:7:21: note: Scalar issue estimate: > cost_model_13.c:7:21: note: load operations = 1 > cost_model_13.c:7:21: note: store operations = 0 > cost_model_13.c:7:21: note: general operations = 1 > cost_model_13.c:7:21: note: reduction latency = 0 <--- seems not > consistent with above result > cost_model_13.c:7:21: note: estimated min cycles per iteration = 1.000000 > cost_model_13.c:7:21: note: estimated cycles per vector iteration (for VF > 8) = 8.000000 > cost_model_13.c:7:21: note: Vector issue estimate: > cost_model_13.c:7:21: note: load operations = 1 > cost_model_13.c:7:21: note: store operations = 0 > cost_model_13.c:7:21: note: general operations = 1 > cost_model_13.c:7:21: note: reduction latency = 0 <--- seems not > consistent with above result > cost_model_13.c:7:21: note: estimated min cycles per iteration = 1.000000 > <--- seems not consistent with above result > > BTW. this should be caused by the reduction stmt is not live, which indicates > whether this stmts is part of a computation whose result is used outside the > loop (tree-vectorized.h:1204): > <bb 3>: > # res_18 = PHI <res_15(7), 0(6)> > # i_20 = PHI <i_16(7), 0(6)> > _1 = (long unsigned int) i_20; > _2 = _1 * 2; > _3 = x_14(D) + _2; > _4 = *_3; > _5 = (unsigned short) _4; > res.0_6 = (unsigned short) res_18; > _7 = _5 + res.0_6; <-- This is not live, may be > caused by the below type cast stmt. > res_15 = (short int) _7; > i_16 = i_20 + 1; > if (n_11(D) > i_16) > goto <bb 7>; > else > goto <bb 4>; > > <bb 7>: > goto <bb 3>; Ah, I see, thanks. My concern was: if requiring !STMT_VINFO_LIVE_P stmts can cause "normal" reductions to have a latency of 0, could the same thing happen for single-cycle reductions? But I suppose the answer is "no". Introducing a cast like the above would cause reduc_chain_length > 1, and so: if (ncopies > 1 && (STMT_VINFO_RELEVANT (stmt_info) <= vect_used_only_live) && reduc_chain_length == 1 && loop_vinfo->suggested_unroll_factor == 1) single_defuse_cycle = true; wouldn't trigger. Which makes the single-cycle thing a bit hit-and-miss... So yeah, I agree the patch is safe after all. Please split the check out into a helper though, to avoid the awkward formatting: /* Return true if STMT_INFO is part of a reduction that has the form: r = r op ...; r = r op ...; with the single accumulator being read and written multiple times. */ static bool aarch64_force_single_cycle (vec_info *vinfo, stmt_vec_info stmt_info) { if (!STMT_VINFO_LIVE_P (stmt_info)) return false; auto reduc_info = info_for_reduction (vinfo, stmt_info); return STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info); } OK with that change, thanks. Richard