https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121290
--- Comment #24 from GCC Commits <cvs-commit at gcc dot gnu.org> --- The master branch has been updated by Tamar Christina <[email protected]>: https://gcc.gnu.org/g:e3a2fff040204ce71f8b7fbc6aa09e839982735a commit r16-6497-ge3a2fff040204ce71f8b7fbc6aa09e839982735a Author: Tamar Christina <[email protected]> Date: Mon Jan 5 14:27:14 2026 +0000 AArch64: tweak inner-loop penalty when doing outer-loop vect [PR121290] r16-3394-g28ab83367e8710a78fffa2513e6e008ebdfbee3e added a cost model adjustment to detect invariant load and replicate cases when doing outer-loop vectorization where the inner loop uses a value defined in the outer-loop. In other words, it's trying to detect the cases where the inner loop would need to do an ld1r and all inputs are then working on replicated values. The argument is that in this case the vector loop is just the scalar loop since each lane just works on the duplicated values. But it had two short comings. 1. It's an all or nothing thing. The load and replicate may only be a small percentage of the amount of data being processed. As such this patch now requires the load and replicate to be at least 50% of the leafs of an SLP tree. Ideally we'd just only increase body by VF * invariant leafs, but we can't since the middle-end cost model applies a rather large penalty to the scalar code (* 50) and as such the base cost ends up being too high and we just never vectorize. The 50% is an attempt to strike a balance in this awkward situation. Experiments show it works reasonably well and we get the right codegen in all the test cases. 2. It does not keep in mind that a load + replicate where that vector value is used in a by index operation will result in is decomposing the load back to scalar. e.g. ld1r {v0.4s}, x0 mul v1.4s, v2.4s, v0.4s is transformed into ldr s0, x0 mul v1.4s, v2.4s, v0.s[0] and as such this case may actually be profitable because we're only doing a scalar load of a single element, similar to the scalar loop. This patch tries to detect (loosely) such cases and doesn't apply the penalty for these. It's a bit hard to tell whether we end up with a by index operation so early as the vectorizer itself is not aware of them and as such the patch does not do an exhaustive check, but only does the most obvious one. gcc/ChangeLog: PR target/121290 * config/aarch64/aarch64.cc (aarch64_possible_by_lane_insn_p): New. (aarch64_vector_costs): Add m_num_dup_stmts and m_num_total_stmts. (aarch64_vector_costs::add_stmt_cost): Use them. (adjust_body_cost): Likewise. gcc/testsuite/ChangeLog: PR target/121290 * gcc.target/aarch64/pr121290.c: Move to... * gcc.target/aarch64/pr121290_1.c: ...here. * g++.target/aarch64/pr121290_1.C: New test. * gcc.target/aarch64/pr121290_2.c: New test.
