https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101927
Bug ID: 101927
Summary: There is no vector mode popcount for aarch64
Product: gcc
Version: 12.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: enhancement
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: pinskia at gcc dot gnu.org
Target Milestone: ---
Target: aarch64
Take:
#include <stdlib.h>
#include <stdint.h>
size_t hd (const uint8_t *restrict a, const uint8_t *restrict b, size_t l) {
size_t r = 0, x;
for (x = 0; x < l; x++)
r += __builtin_popcount (a[x] ^ b[x]);
return r;
}
at -O3 we don't vectorize this.
Clang/LLVM does:
.LBB0_5: // =>This Inner Loop Header: Depth=1
ld1 { v3.b }[0], [x8]
sub x12, x8, #2
ld1 { v5.b }[0], [x10]
ld1 { v4.b }[0], [x12]
sub x12, x10, #2
ld1 { v6.b }[0], [x12]
add x12, x8, #1
ld1 { v3.b }[4], [x12]
add x12, x10, #1
ld1 { v5.b }[4], [x12]
sub x12, x8, #1
ld1 { v4.b }[4], [x12]
sub x12, x10, #1
ld1 { v6.b }[4], [x12]
eor v3.8b, v5.8b, v3.8b
ushll v3.2d, v3.2s, #0
and v3.16b, v3.16b, v1.16b
eor v4.8b, v6.8b, v4.8b
ushll v4.2d, v4.2s, #0
and v4.16b, v4.16b, v1.16b
cnt v3.16b, v3.16b
cnt v4.16b, v4.16b
uaddlp v3.8h, v3.16b
uaddlp v4.8h, v4.16b
uaddlp v3.4s, v3.8h
uaddlp v4.4s, v4.8h
add x8, x8, #4
subs x11, x11, #4
uadalp v2.2d, v3.4s
uadalp v0.2d, v4.4s
add x10, x10, #4
b.ne .LBB0_5
------ CUT ----
Note I think we could be better.