[Bug target/110625] New: [AArch64] Vect: SLP fails to vectorize a loop as the reduction_latency calculated by new costs is too large

hliu at amperecomputing dot com via Gcc-bugs Tue, 11 Jul 2023 02:16:02 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110625


            Bug ID: 110625
           Summary: [AArch64] Vect: SLP fails to vectorize a loop as the
                    reduction_latency calculated by new costs is too large
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: hliu at amperecomputing dot com
  Target Milestone: ---

This problem causes a performance regression in SPEC2017 538.imagick.  For the
following simple case (modified from pr96208):

    typedef struct {
        unsigned short m1, m2, m3, m4;
    } the_struct_t;
    typedef struct {
        double m1, m2, m3, m4, m5;
    } the_struct2_t;

    double bar1 (the_struct2_t*);

    double foo (double* k, unsigned int n, the_struct_t* the_struct) {
        unsigned int u;
        the_struct2_t result;
        for (u=0; u < n; u++, k--) {
            result.m1 += (*k)*the_struct[u].m1;
            result.m2 += (*k)*the_struct[u].m2;
            result.m3 += (*k)*the_struct[u].m3;
            result.m4 += (*k)*the_struct[u].m4;
        }
        return bar1 (&result);
    }


Compile it with "-Ofast -S -mcpu=neoverse-n2 -fdump-tree-vect-details
-fno-tree-slp-vectorize". SLP fails to vectorize the loop as the vector body
cost is increased due to the too large "reduction latency".  See the dump of
vect pass:

    Original vector body cost = 51
    Scalar issue estimate:
      ...
      reduction latency = 2
      estimated min cycles per iteration = 2.000000
      estimated cycles per vector iteration (for VF 2) = 4.000000
    Vector issue estimate:
      ...
      reduction latency = 8      <-- Too large
      estimated min cycles per iteration = 8.000000
    Increasing body cost to 102 because scalar code would issue more quickly
    Cost model analysis: 
    Vector inside of loop cost: 102
    ...
    Scalar iteration cost: 44
    ...
    missed:  cost model: the vector iteration cost = 102 divided by the scalar
iteration cost = 44 is greater or equal to the vectorization factor = 2.
    missed:  not vectorized: vectorization not profitable.


SLP will success with "-mcpu=neoverse-n1", as N1 doesn't use the new vector
costs and vector body cost is not increased. The "reduction latency" is
calculated in aarch64.cc count_ops():
      /* ??? Ideally we'd do COUNT reductions in parallel, but unfortunately
         that's not yet the case.  */
      ops->reduction_latency = MAX (ops->reduction_latency, base * count);

For this case, the "base" is 2 and "count" is 4 .  To my understanding, the
"count" of SLP means the number of scalar stmts (i.e. results.m1 +=, ...) in a
permutation group to be merged into a vector stmt.  It seems not reasonable to
multiply cost by "count" (maybe it doesn't consider about the SLP situation). 
So, I'm thinking to calcualte it differently for SLP situation, e.g.

      unsigned int latency = PURE_SLP_STMT(stmt_info) ? base : base * count;
      ops->reduction_latency = MAX (ops->reduction_latency, latency);

Is this reasonable?

[Bug target/110625] New: [AArch64] Vect: SLP fails to vectorize a loop as the reduction_latency calculated by new costs is too large

Reply via email to