https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116186
Bug ID: 116186 Summary: the scalar cost for popcount is off for -mcpu=neoverse-n2 (and generic-armv9-a) Product: gcc Version: 15.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: enhancement Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: pinskia at gcc dot gnu.org Target Milestone: --- Target: aarch64 Take: ``` void f_v4si (unsigned int *__restrict b, unsigned int *__restrict d) { d[0] = __builtin_popcountll (b[0]); d[1] = __builtin_popcountll (b[1]); d[2] = __builtin_popcountll (b[2]); d[3] = __builtin_popcountll (b[3]); } ``` This should SLP but currently does not with `-O3 -mcpu=neoverse-n2` due to the cost model: ``` /app/example.cpp:5:8: note: Cost model analysis for part in loop 0: Vector cost: 7 Scalar cost: 4 /app/example.cpp:5:8: missed: not vectorized: vectorization is not profitable. ``` But the cost of the scalar popcount here is basically similar to the cost of doing V2SI. With generic-armv8 we get: ``` /app/example.cpp:5:8: note: Cost model analysis for part in loop 0: Vector cost: 3 Scalar cost: 4 ``` And it is vectorized.