Issue |
131588
|
Summary |
x86 avx2 vpor is first done on calculation-heavy operands
|
Labels |
new issue
|
Assignees |
|
Reporter |
ImpleLee
|
See the code and the compilation result at https://godbolt.org/z/Kchh341vW . This code calculates vpor of several operands in the loop, where some operands are relatively cheap to calculate, while some are not. Compilation flags: `-O3 -std=c++2b -march=skylake`.
```c++
#include <experimental/simd>
#include <cstdint>
namespace stdx = std::experimental;
template <class T, std::size_t N>
using simd_of = stdx::simd<T, stdx::simd_abi::deduce_t<T, N>>;
using data_t = simd_of<std::uint64_t, 4>;
data_t f(data_t a, data_t b) {
while (true) {
data_t result = a;
result |= (a << 1) & std::uint64_t(0x802008020080200);
result |= a >> 1;
result |= a >> 10;
data_t temp = a << 50;
result |= data_t([=](auto i) {
if constexpr (i + 1 >= 4) return 0;
else return temp[i + 1];
});
result &= b;
if (all_of((result & ~a) == 0)) return a;
a = result;
}
}
```
The assembly of the loop is as follows.
```asm
.LBB0_1:
vmovdqa %ymm4, %ymm3
vpaddq %ymm4, %ymm4, %ymm4
vpand %ymm1, %ymm4, %ymm4
vpsrlq $1, %ymm3, %ymm5
vpsrlq $10, %ymm3, %ymm6
vpor %ymm6, %ymm5, %ymm5
vpsllq $50, %ymm3, %ymm6
vpermq $249, %ymm6, %ymm6 # latency 3 on skylake
vpblendd $192, %ymm2, %ymm6, %ymm6
vpor %ymm6, %ymm5, %ymm5 # ymm6 is heavy to calculate, but or'ed first
vpor %ymm3, %ymm5, %ymm5 # ymm3 and ymm4 are cheap to calculate, but or'ed later
vpor %ymm4, %ymm5, %ymm4
vpand %ymm0, %ymm4, %ymm4
vptest %ymm4, %ymm3
jae .LBB0_1
```
The critical path of this loop is `vpmov-> vpsll $50 -> vperm -> vpblend -> vpor -> vpor -> vpor -> vpand`, but if ymm6 is vpor'ed later, the other two vpor's does not need to be on the critical path.
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs