https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110625
Bug ID: 110625 Summary: [AArch64] Vect: SLP fails to vectorize a loop as the reduction_latency calculated by new costs is too large Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: hliu at amperecomputing dot com Target Milestone: --- This problem causes a performance regression in SPEC2017 538.imagick. For the following simple case (modified from pr96208): typedef struct { unsigned short m1, m2, m3, m4; } the_struct_t; typedef struct { double m1, m2, m3, m4, m5; } the_struct2_t; double bar1 (the_struct2_t*); double foo (double* k, unsigned int n, the_struct_t* the_struct) { unsigned int u; the_struct2_t result; for (u=0; u < n; u++, k--) { result.m1 += (*k)*the_struct[u].m1; result.m2 += (*k)*the_struct[u].m2; result.m3 += (*k)*the_struct[u].m3; result.m4 += (*k)*the_struct[u].m4; } return bar1 (&result); } Compile it with "-Ofast -S -mcpu=neoverse-n2 -fdump-tree-vect-details -fno-tree-slp-vectorize". SLP fails to vectorize the loop as the vector body cost is increased due to the too large "reduction latency". See the dump of vect pass: Original vector body cost = 51 Scalar issue estimate: ... reduction latency = 2 estimated min cycles per iteration = 2.000000 estimated cycles per vector iteration (for VF 2) = 4.000000 Vector issue estimate: ... reduction latency = 8 <-- Too large estimated min cycles per iteration = 8.000000 Increasing body cost to 102 because scalar code would issue more quickly Cost model analysis: Vector inside of loop cost: 102 ... Scalar iteration cost: 44 ... missed: cost model: the vector iteration cost = 102 divided by the scalar iteration cost = 44 is greater or equal to the vectorization factor = 2. missed: not vectorized: vectorization not profitable. SLP will success with "-mcpu=neoverse-n1", as N1 doesn't use the new vector costs and vector body cost is not increased. The "reduction latency" is calculated in aarch64.cc count_ops(): /* ??? Ideally we'd do COUNT reductions in parallel, but unfortunately that's not yet the case. */ ops->reduction_latency = MAX (ops->reduction_latency, base * count); For this case, the "base" is 2 and "count" is 4 . To my understanding, the "count" of SLP means the number of scalar stmts (i.e. results.m1 +=, ...) in a permutation group to be merged into a vector stmt. It seems not reasonable to multiply cost by "count" (maybe it doesn't consider about the SLP situation). So, I'm thinking to calcualte it differently for SLP situation, e.g. unsigned int latency = PURE_SLP_STMT(stmt_info) ? base : base * count; ops->reduction_latency = MAX (ops->reduction_latency, latency); Is this reasonable?