https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113860

            Bug ID: 113860
           Summary: SVE popcount can be used for 16bit, 32bit and 64bit
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: enhancement
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: pinskia at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64

Take:
```
void f(unsigned long *  __restrict b, unsigned long * __restrict d)
{
    d[0]  = __builtin_popcountll(b[0]);
}

```

Currently with `-march=armv9-a`, GCC produces:
```
        ldr     d31, [x0]
        cnt     v31.8b, v31.8b
        addv    b31, v31.8b
        str     d31, [x1]
```

But I think we could do:
```
        ptrue   p6.b, all
        ldr     d31, [x0]
        cnt     z31.d, p6/m, z31.d
        str     d31, [x1]
```

Instead, especially if this is inside a loop (not vectorized), as p6.b
assignment could be pulled out. Or something similar to that.

Likewise for short (.h) and int (.b).

Reply via email to