Issue |
81469
|
Summary |
[X86] Prefer trunc(reduce(x)) over reduce(trunc(x))
|
Labels |
backend:X86,
missed-optimization
|
Assignees |
|
Reporter |
RKSimon
|
Reported here: https://discourse.llvm.org/t/avx2-popcount-regression/76926
```cpp
int popcount8(uint64_t data[8]) {
int count = 0;
for (int i = 0; i < 8; ++i)
count += __builtin_popcountll(data[i]);
return count;
}
```
```ll
define i32 @popcount8(ptr %data) {
entry:
%0 = load <8 x i64>, ptr %data, align 8
%1 = tail call <8 x i64> @llvm.ctpop.v8i64(<8 x i64> %0)
%2 = trunc <8 x i64> %1 to <8 x i32>
%3 = tail call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> %2)
ret i32 %3
}
declare <8 x i64> @llvm.ctpop.v8i64(<8 x i64>)
declare i32 @llvm.vector.reduce.add.v8i32(<8 x i32>)
```
We can avoid the vector truncation replacing with a free scalar truncation if we perform the reduction on the v8i64:
```ll
define i32 @popcount8(ptr %data) {
entry:
%0 = load <8 x i64>, ptr %data, align 8
%1 = tail call <8 x i64> @llvm.ctpop.v8i64(<8 x i64> %0)
%2 = tail call i64 @llvm.vector.reduce.add.v8i64 (<8 x i64 > %1)
%3 = trunc i64 %2 to i32
ret i32 %3
}
declare <8 x i64> @llvm.ctpop.v8i64(<8 x i64>) #1
declare i64 @llvm.vector.reduce.add.v8i64(<8 x i64>)
```
Godbolt: https://simd.godbolt.org/z/ooK497x7s
We might be best off attempting this in vector-combine
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs