| Issue |
172217
|
| Summary |
Loop vectorizer generates inefficient code
|
| Labels |
new issue
|
| Assignees |
|
| Reporter |
zijinshanren
|
https://godbolt.org/z/MPPnvT5h8
the simple code:
```cpp
void swap_ptr_impl(int64_t* ptr, size_t len) {
for (size_t i = 0; i < len; i++) {
ptr[i] = std::byteswap(ptr[i]);
}
}
void swap_ptr2_impl(int64_t* ptr, size_t len) {
auto end = ptr + len;
for (; ptr < end; ptr++) {
*ptr = std::byteswap(*ptr);
}
}
void swap_span_impl(std::span<int64_t> sp) {
for (auto& x : sp) {
x = std::byteswap(x);
}
}
void swap_span_2(std::span<int64_t, 1024> sp) {
for (auto& x : sp) {
x = std::byteswap(x);
}
}
```
swap_ptr_impl is 2x slower than other functions on i9-14900KF. 2.8x slower is seen on quickbench.
swap_span_2 (span length known) is also 2x slower.
```text
Run on (32 X 3187 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x16)
L1 Instruction 32 KiB (x16)
L2 Unified 2048 KiB (x16)
L3 Unified 36864 KiB (x1)
------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------
swap_ptr 400 ns 390 ns 1723077
swap_ptr2 184 ns 180 ns 4072727
swap_span 176 ns 165 ns 4072727
swap_span_2 403 ns 399 ns 1723077
```
with -fno-vectorize, the results are reasonable.
```text
Run on (32 X 3187 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x16)
L1 Instruction 32 KiB (x16)
L2 Unified 2048 KiB (x16)
L3 Unified 36864 KiB (x1)
------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------
swap_ptr 181 ns 184 ns 4072727
swap_ptr2 181 ns 180 ns 3733333
swap_span 173 ns 172 ns 3733333
swap_span_2 175 ns 173 ns 4072727
```
so I assume that there is something wrong in the loop vectorizer. Verified since clang 17.
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs