https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121290

--- Comment #24 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Tamar Christina <[email protected]>:

https://gcc.gnu.org/g:e3a2fff040204ce71f8b7fbc6aa09e839982735a

commit r16-6497-ge3a2fff040204ce71f8b7fbc6aa09e839982735a
Author: Tamar Christina <[email protected]>
Date:   Mon Jan 5 14:27:14 2026 +0000

    AArch64: tweak inner-loop penalty when doing outer-loop vect [PR121290]

    r16-3394-g28ab83367e8710a78fffa2513e6e008ebdfbee3e added a cost model
adjustment
    to detect invariant load and replicate cases when doing outer-loop
vectorization
    where the inner loop uses a value defined in the outer-loop.

    In other words, it's trying to detect the cases where the inner loop would
need
    to do an ld1r and all inputs are then working on replicated values.  The
    argument is that in this case the vector loop is just the scalar loop since
each
    lane just works on the duplicated values.

    But it had two short comings.

    1. It's an all or nothing thing.  The load and replicate may only be a
small
       percentage of the amount of data being processed.   As such this patch
now
       requires the load and replicate to be at least 50% of the leafs of an
SLP
       tree.  Ideally we'd just only increase body by VF * invariant leafs, 
but we
       can't since the middle-end cost model applies a rather large penalty to
the
       scalar code (* 50) and as such the base cost ends up being too high and
we
       just never vectorize.  The 50% is an attempt to strike a balance in this
       awkward situation.   Experiments show it works reasonably well and we
get the
       right codegen in all the test cases.

    2. It does not keep in mind that a load + replicate where that vector value
is
       used in a by index operation will result in is decomposing the load back
to
       scalar.  e.g.

       ld1r {v0.4s}, x0
       mul  v1.4s, v2.4s, v0.4s

       is transformed into

       ldr  s0, x0
       mul  v1.4s, v2.4s, v0.s[0]

       and as such this case may actually be profitable because we're only
doing a
       scalar load of a single element, similar to the scalar loop.

       This patch tries to detect (loosely) such cases and doesn't apply the
penalty
       for these.  It's a bit hard to tell whether we end up with a by index
       operation so early as the vectorizer itself is not aware of them and as
such
       the patch does not do an exhaustive check, but only does the most
obvious
       one.

    gcc/ChangeLog:

            PR target/121290
            * config/aarch64/aarch64.cc (aarch64_possible_by_lane_insn_p): New.
            (aarch64_vector_costs): Add m_num_dup_stmts and m_num_total_stmts.
            (aarch64_vector_costs::add_stmt_cost): Use them.
            (adjust_body_cost): Likewise.

    gcc/testsuite/ChangeLog:

            PR target/121290
            * gcc.target/aarch64/pr121290.c: Move to...
            * gcc.target/aarch64/pr121290_1.c: ...here.
            * g++.target/aarch64/pr121290_1.C: New test.
            * gcc.target/aarch64/pr121290_2.c: New test.

Reply via email to