https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116186

            Bug ID: 116186
           Summary: the scalar cost for popcount is off for
                    -mcpu=neoverse-n2 (and generic-armv9-a)
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: enhancement
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: pinskia at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64

Take:
```
void
f_v4si (unsigned int *__restrict b, unsigned int *__restrict d)
{
  d[0] = __builtin_popcountll (b[0]);
  d[1] = __builtin_popcountll (b[1]);
  d[2] = __builtin_popcountll (b[2]);
  d[3] = __builtin_popcountll (b[3]);
}
```

This should SLP but currently does not with `-O3 -mcpu=neoverse-n2` due to the
cost model:
```
/app/example.cpp:5:8: note: Cost model analysis for part in loop 0:
  Vector cost: 7
  Scalar cost: 4
/app/example.cpp:5:8: missed: not vectorized: vectorization is not profitable.
```

But the cost of the scalar popcount here is basically similar to the cost of
doing V2SI.

With generic-armv8 we get:
```
/app/example.cpp:5:8: note: Cost model analysis for part in loop 0:
  Vector cost: 3
  Scalar cost: 4
```

And it is vectorized.

Reply via email to