Issue 81469
Summary [X86] Prefer trunc(reduce(x)) over reduce(trunc(x))
Labels backend:X86, missed-optimization
Assignees
Reporter RKSimon
    Reported here: https://discourse.llvm.org/t/avx2-popcount-regression/76926

```cpp
int popcount8(uint64_t data[8]) {
  int count = 0;
  for (int i = 0; i < 8; ++i)
    count += __builtin_popcountll(data[i]);
  return count;
}
```
```ll
define i32 @popcount8(ptr %data) {
entry:
  %0 = load <8 x i64>, ptr %data, align 8
  %1 = tail call <8 x i64> @llvm.ctpop.v8i64(<8 x i64> %0)
  %2 = trunc <8 x i64> %1 to <8 x i32>
  %3 = tail call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> %2)
  ret i32 %3
}
declare <8 x i64> @llvm.ctpop.v8i64(<8 x i64>)
declare i32 @llvm.vector.reduce.add.v8i32(<8 x i32>)
```

We can avoid the vector truncation replacing with a free scalar truncation if we perform the reduction on the v8i64:
```ll
define i32 @popcount8(ptr %data)  {
entry:
  %0 = load <8 x i64>, ptr %data, align 8
  %1 = tail call <8 x i64> @llvm.ctpop.v8i64(<8 x i64> %0)
  %2 = tail call i64 @llvm.vector.reduce.add.v8i64 (<8 x i64 > %1)
  %3 = trunc i64 %2 to i32
  ret i32 %3
}
declare <8 x i64> @llvm.ctpop.v8i64(<8 x i64>) #1
declare i64 @llvm.vector.reduce.add.v8i64(<8 x i64>)
```
Godbolt: https://simd.godbolt.org/z/ooK497x7s

We might be best off attempting this in vector-combine
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to