| Issue |
162644
|
| Summary |
Invalid tail-predication in ARMLowOverheadLoops
|
| Labels |
new issue
|
| Assignees |
|
| Reporter |
statham-arm
|
When compiling the following C source, the ARMLowOverheadLoops pass changes the semantics of the code by performing tail-predication.
```c
// clang --target=arm-none-eabi -mcpu=cortex-m52 -mfloat-abi=hard -fno-inline-functions -O1 -S -o - -mllvm -print-before-all llvmaeng4598.c
#include <arm_mve.h>
float32x4_t inactive = {0.0, 0.0, 0.0, 0.0};
float32x4_t test_func(float32_t *array, int32_t len) {
float32x4_t acc = vdupq_n_f32(0.1f);
do {
mve_pred16_t tailpred = vctp32q(len);
float32x4_t vecSrc = vldrwq_z_f32(array, tailpred);
acc = vaddq_m_f32(inactive, acc, vecSrc, tailpred);
array += 4;
len -= 4;
} while (len > 0);
return acc;
}
```
The source code loads four floats at a time from the input array, and adds them elementwise to the vector `acc`. In case `len` is not a multiple of 4, an explicit MVE predicate is constructed using `vctp32q` so that the final loop iteration will load fewer than 4 floats.
Usually in this kind of code the `vaddq_m_32` instruction would pass `acc` as its first operand as well as its second, so that any vector lanes disabled by the predicate would be left unchanged from their value in the previous iteration. However, in _this_ code, the `vaddq_m_32` takes its inactive lanes from the constant all-zero vector `inactive`.
So the semantics of this code as written is that any vector lane not used by the last loop iteration will be _zero_ in the returned vector, rather than containing the sum of array elements from previous iterations.
Compiling this code with the extra option `-mllvm -arm-loloops-disable-tailpred`, the ARMLowOverheadLoops pass generates a low-overhead loop using `dls` and `le`, but leaves the tail-predication alone. The `vmov q0,q1` in the middle of the loop is unpredicated, and copies _all_ of the `inactive` vector into q0, including the lanes disabled by the current loop iteration's predicate. Then the predicated `vaddt` after that overwrites only the active lanes with the sum of the previous `acc` with the loaded values, just as the source code says.
```
dls lr, r2
.LBB0_1:
vctp.32 r1
vmov q2, q0
vpst
vldrwt.u32 q3, [r0], #16 // tail-predicated: load from input array
vmov q0, q1 // unpredicated: copy 'inactive' into q0
vpst
vaddt.f32 q0, q2, q3 // tail-pred: overwrite some of q0 with q2+q3
subs r1, #4
le lr, .LBB0_1
```
But removing `-mllvm -arm-loloops-disable-tailpred` causes ARMLowOverheadLoops to perform a transformation that changes the semantics (as of commit b256d0a7aa00079e7ff0e64d52b8055ed6440682):
```
dlstp.32 lr, r1
.LBB0_1:
vmov q2, q0
vldrw.u32 q3, [r0], #16 // tail-predication now done by LTPSIZE
vmov q0, q1 // ALSO TAIL-PREDICATED but shouldn't be
vadd.f32 q0, q2, q3 // tail-predicated as before
letp lr, .LBB0_1
```
Now the tail-predication in the last loop iteration is done by the `dlstp` and `letp` instructions setting the LTPSIZE field in FPSCR, instead of by constructing a predicate in VPR. This means that _all_ the instructions in the loop are affected by the tail-predication. In particular, the `vmov q0,q1` is now copying only the _active_ lanes into q0. So the inactive lanes in the final iteration will not be zeroed: they will take whatever value was left in q0 after the previous iteration.
In this situation, ARMLowOverheadLoops should recognize that tail-predicating the loop via LTPSIZE is an invalid transformation: the inactive lanes written by that `vmov` are needed, so the write to them cannot be discarded.
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs