https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113860
Bug ID: 113860 Summary: SVE popcount can be used for 16bit, 32bit and 64bit Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: enhancement Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: pinskia at gcc dot gnu.org Target Milestone: --- Target: aarch64 Take: ``` void f(unsigned long * __restrict b, unsigned long * __restrict d) { d[0] = __builtin_popcountll(b[0]); } ``` Currently with `-march=armv9-a`, GCC produces: ``` ldr d31, [x0] cnt v31.8b, v31.8b addv b31, v31.8b str d31, [x1] ``` But I think we could do: ``` ptrue p6.b, all ldr d31, [x0] cnt z31.d, p6/m, z31.d str d31, [x1] ``` Instead, especially if this is inside a loop (not vectorized), as p6.b assignment could be pulled out. Or something similar to that. Likewise for short (.h) and int (.b).