| Issue |
173274
|
| Summary |
A few issues with fma and fcma optimizations on aarch64
|
| Labels |
new issue
|
| Assignees |
|
| Reporter |
yuyichao
|
There are a few different and possibly related issues. I can separate them out later but I'd like to report a single one first and see if any of these should be separated out. All of the issues can be shown with the following code which is based on complex number multiplication. All tested with 21.1.7 and I've double checked the assembly output for some of these on master with compiler explorer.
```c++
struct Complex {
double real;
double imag;
};
static inline void f1(Complex &a, Complex &b, Complex &c)
{
auto vb = b; // make sure it's safe to access the memory for both fields.
a = {(a.real + vb.real * c.real),
(a.imag + vb.real * c.imag)};
}
static inline void f2(Complex &a, Complex &b, Complex &c)
{
a = {(a.real + b.real * c.real) - b.imag * c.imag,
(a.imag + b.real * c.imag) + b.imag * c.real};
}
void g11(Complex *a, Complex *b, Complex *c, int n)
{
for (int i = 0; i < n; i++) {
f1(a[i], b[i], c[i]);
}
}
void g12(Complex *a, Complex *b, Complex *c)
{
f1(*a, *b, *c);
}
void g21(Complex *a, Complex *b, Complex *c, int n)
{
for (int i = 0; i < n; i++) {
f2(a[i], b[i], c[i]);
}
}
void g22(Complex *a, Complex *b, Complex *c)
{
f2(*a, *b, *c);
}
```
First of all, when compiling with `-ffp-contract=fast -O3`, `g22` (scalar version of full complication) compiles to
```asm
ldp d4, d0, [x1]
ldp d1, d2, [x2]
ldp d5, d6, [x0]
fmul d3, d1, d0
fmadd d1, d1, d4, d5
fmadd d3, d2, d4, d3
fmsub d0, d0, d2, d1
fadd d1, d3, d6
stp d0, d1, [x0]
ret
```
when the `-ffp-contract=fast -O2` version compiles to
```asm
ldp d0, d4, [x0]
ldp d1, d5, [x2]
ldp d2, d3, [x1]
fmadd d0, d1, d2, d0
fmadd d1, d3, d1, d4
fmsub d0, d3, d5, d0
fmadd d1, d5, d2, d1
stp d0, d1, [x0]
ret
```
instead with the array version (`g21`) showing similar difference, i.e. the `O2` version uses 4 fma instructions whereas the `O3` version failed to fuse one of them. Testing the array version on apple m1 shows that the `O3` version is ~30% slower.
Now with the fcma extension, the computation here can be done with fewer instructions (`fcmla`). However, clang refuses to do so with `-march=armv8.3-a -ffp-contract=fast` but only with `-march=armv8.3-a -ffast-math`. Base on a brief read of the pattern matching code, it seems that llvm requires the reassociate flag for this to work. This should be unnecessary since the `fcmla` instruction is simply an fused multiply-add and should just require the contract flag to work. This also means that there doesn't seem to be a way for frontend to generate `fcmla` instructions without adding the reassociate flag that might have a much broader impact on the result.
Even with `-march=armv8.3-a -ffast-math` clang did not use `fcmla` for the scalar version. This is similar to GCC right now, https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123260.
Unlike GCC however, clang only uses `fcmla` for the unrolled loop in `g21`,
```asm
_Z3g21P7ComplexS0_S0_i: // @_Z3g21P7ComplexS0_S0_i
.cfi_startproc
// %bb.0:
cmp w3, #1
b.lt .LBB2_5
// %bb.1:
cmp w3, #4
mov w8, w3
b.hs .LBB2_6
// %bb.2:
mov x9, xzr
.LBB2_3:
mov w10, #8 // =0x8
sub x8, x8, x9
orr x11, x10, x9, lsl #4
add x9, x2, x11
add x10, x0, x11
add x11, x1, x11
.LBB2_4: // =>This Inner Loop Header: Depth=1
ldp d0, d4, [x10, #-8]
subs x8, x8, #1
ldp d1, d5, [x9, #-8]
add x9, x9, #16
ldp d2, d3, [x11, #-8]
add x11, x11, #16
fmadd d0, d1, d2, d0
fmadd d1, d3, d1, d4
fmsub d0, d3, d5, d0
fmadd d1, d5, d2, d1
stp d0, d1, [x10, #-8]
add x10, x10, #16
b.ne .LBB2_4
.LBB2_5:
ret
.LBB2_6:
lsl x9, x8, #4
add x10, x2, x9
add x11, x0, x9
add x9, x1, x9
cmp x0, x10
ccmp x2, x11, #2, lo
cset w10, lo
cmp x1, x11
ccmp x0, x9, #2, lo
mov x9, xzr
b.lo .LBB2_3
// %bb.7:
tbnz w10, #0, .LBB2_3
// %bb.8:
and x9, x8, #0x7ffffffe
mov x10, x0
mov x11, x1
mov x12, x2
mov x13, x9
.LBB2_9: // =>This Inner Loop Header: Depth=1
ldp q0, q4, [x11], #32
ldp q1, q5, [x12], #32
ldp q2, q3, [x10]
subs x13, x13, #2
fcmla v2.2d, v0.2d, v1.2d, #0
fcmla v3.2d, v4.2d, v5.2d, #0
fcmla v2.2d, v0.2d, v1.2d, #90
fcmla v3.2d, v4.2d, v5.2d, #90
stp q2, q3, [x10], #32
b.ne .LBB2_9
// %bb.10:
cmp x9, x8
b.ne .LBB2_3
b .LBB2_5
.Lfunc_end2:
.size _Z3g21P7ComplexS0_S0_i, .Lfunc_end2-_Z3g21P7ComplexS0_S0_i
.cfi_endproc
```
Note the `LBB2_4` block uses normal fma instructions where as the unrolled `LBB2_9` block uses `fcmla`. Both should be able to use `fcmla` instead.
Finally, the pattern matching doesn't seem to work for partial computation, none of `g11` and `g12` under any flag I've tested uses `fcmla` when it could. It might be understandable for the scalar version `g12` since it replaces a 16 bytes load with a 8 bytes one (still doubt it) but for the loop one using `fcmla` would've get rid of some of the shuffling generated. This is again similar to GCC https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121925
# Summary of the sub-issues
1. `-O3` fused fewer multiply and add than `-O2`
2. `fcmla` requires `reassoc` when it should only require `contract`
3. `fcmla` not generated for scalar operation
4. `fcmla` not generated for non-unrolled loops
5. `fcmla` not used for partial complex number operation and only for the full one.
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs