Issue 162644
Summary Invalid tail-predication in ARMLowOverheadLoops
Labels new issue
Assignees
Reporter statham-arm
    When compiling the following C source, the ARMLowOverheadLoops pass changes the semantics of the code by performing tail-predication.
```c
// clang --target=arm-none-eabi -mcpu=cortex-m52 -mfloat-abi=hard -fno-inline-functions -O1 -S -o - -mllvm -print-before-all llvmaeng4598.c

#include <arm_mve.h>

float32x4_t inactive = {0.0, 0.0, 0.0, 0.0};

float32x4_t test_func(float32_t *array, int32_t len) {
 float32x4_t acc = vdupq_n_f32(0.1f);

    do {
        mve_pred16_t tailpred = vctp32q(len);
        float32x4_t vecSrc = vldrwq_z_f32(array, tailpred);
        acc = vaddq_m_f32(inactive, acc, vecSrc, tailpred);
 array += 4;
        len -= 4;
    } while (len > 0);

    return acc;
}
```

The source code loads four floats at a time from the input array, and adds them elementwise to the vector `acc`. In case `len` is not a multiple of 4, an explicit MVE predicate is constructed using `vctp32q` so that the final loop iteration will load fewer than 4 floats.

Usually in this kind of code the `vaddq_m_32` instruction would pass `acc` as its first operand as well as its second, so that any vector lanes disabled by the predicate would be left unchanged from their value in the previous iteration. However, in _this_ code, the `vaddq_m_32` takes its inactive lanes from the constant all-zero vector `inactive`.

So the semantics of this code as written is that any vector lane not used by the last loop iteration will be _zero_ in the returned vector, rather than containing the sum of array elements from previous iterations.

Compiling this code with the extra option `-mllvm -arm-loloops-disable-tailpred`, the ARMLowOverheadLoops pass generates a low-overhead loop using `dls` and `le`, but leaves the tail-predication alone. The `vmov q0,q1` in the middle of the loop is unpredicated, and copies _all_ of the `inactive` vector into q0, including the lanes disabled by the current loop iteration's predicate. Then the predicated `vaddt` after that overwrites only the active lanes with the sum of the previous `acc` with the loaded values, just as the source code says.
```
        dls         lr, r2
.LBB0_1:
        vctp.32     r1
 vmov        q2, q0
        vpst
        vldrwt.u32  q3, [r0], #16 // tail-predicated: load from input array
        vmov        q0, q1        // unpredicated: copy 'inactive' into q0
        vpst
        vaddt.f32   q0, q2, q3    // tail-pred: overwrite some of q0 with q2+q3
        subs r1, #4
        le          lr, .LBB0_1
```

But removing `-mllvm -arm-loloops-disable-tailpred` causes ARMLowOverheadLoops to perform a transformation that changes the semantics (as of commit b256d0a7aa00079e7ff0e64d52b8055ed6440682):
```
        dlstp.32    lr, r1
.LBB0_1:
        vmov        q2, q0
        vldrw.u32   q3, [r0], #16 // tail-predication now done by LTPSIZE
        vmov        q0, q1        // ALSO TAIL-PREDICATED but shouldn't be
        vadd.f32    q0, q2, q3    // tail-predicated as before
        letp        lr, .LBB0_1

```

Now the tail-predication in the last loop iteration is done by the `dlstp` and `letp` instructions setting the LTPSIZE field in FPSCR, instead of by constructing a predicate in VPR. This means that _all_ the instructions in the loop are affected by the tail-predication. In particular, the `vmov q0,q1` is now copying only the _active_ lanes into q0. So the inactive lanes in the final iteration will not be zeroed: they will take whatever value was left in q0 after the previous iteration.

In this situation, ARMLowOverheadLoops should recognize that tail-predicating the loop via LTPSIZE is an invalid transformation: the inactive lanes written by that `vmov` are needed, so the write to them cannot be discarded.
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to