[llvm-bugs] [Bug 173274] A few issues with fma and fcma optimizations on aarch64

LLVM Bugs via llvm-bugs Mon, 22 Dec 2025 07:55:26 -0800

Issue	173274
Summary	A few issues with fma and fcma optimizations on aarch64
Labels	new issue
Assignees
Reporter	yuyichao

    There are a few different and possibly related issues. I can separate them out later but I'd like to report a single one first and see if any of these should be separated out. All of the issues can be shown with the following code which is based on complex number multiplication. All tested with 21.1.7 and I've double checked the assembly output for some of these on master with compiler explorer.


```c++
struct Complex {
    double real;
    double imag;
};

static inline void f1(Complex &a, Complex &b, Complex &c)
{
    auto vb = b; // make sure it's safe to access the memory for both fields.
    a = {(a.real + vb.real * c.real),
        (a.imag + vb.real * c.imag)};
}

static inline void f2(Complex &a, Complex &b, Complex &c)
{
    a = {(a.real + b.real * c.real) - b.imag * c.imag,
        (a.imag + b.real * c.imag) + b.imag * c.real};
}

void g11(Complex *a, Complex *b, Complex *c, int n)
{
 for (int i = 0; i < n; i++) {
        f1(a[i], b[i], c[i]);
 }
}

void g12(Complex *a, Complex *b, Complex *c)
{
    f1(*a, *b, *c);
}

void g21(Complex *a, Complex *b, Complex *c, int n)
{
    for (int i = 0; i < n; i++) {
        f2(a[i], b[i], c[i]);
    }
}

void g22(Complex *a, Complex *b, Complex *c)
{
    f2(*a, *b, *c);
}
```

First of all, when compiling with `-ffp-contract=fast -O3`, `g22` (scalar version of full complication) compiles to

```asm
 ldp     d4, d0, [x1]
        ldp     d1, d2, [x2]
        ldp     d5, d6, [x0]
        fmul    d3, d1, d0
        fmadd   d1, d1, d4, d5
 fmadd   d3, d2, d4, d3
        fmsub   d0, d0, d2, d1
        fadd    d1, d3, d6
        stp     d0, d1, [x0]
        ret
```

when the `-ffp-contract=fast -O2` version compiles to

```asm
        ldp     d0, d4, [x0]
        ldp     d1, d5, [x2]
        ldp     d2, d3, [x1]
 fmadd   d0, d1, d2, d0
        fmadd   d1, d3, d1, d4
        fmsub   d0, d3, d5, d0
        fmadd   d1, d5, d2, d1
        stp     d0, d1, [x0]
 ret
```

instead with the array version (`g21`) showing similar difference, i.e. the `O2` version uses 4 fma instructions whereas the `O3` version failed to fuse one of them. Testing the array version on apple m1 shows that the `O3` version is ~30% slower.

Now with the fcma extension, the computation here can be done with fewer instructions (`fcmla`). However, clang refuses to do so with `-march=armv8.3-a -ffp-contract=fast` but only with `-march=armv8.3-a -ffast-math`. Base on a brief read of the pattern matching code, it seems that llvm requires the reassociate flag for this to work. This should be unnecessary since the `fcmla` instruction is simply an fused multiply-add and should just require the contract flag to work. This also means that there doesn't seem to be a way for frontend to generate `fcmla` instructions without adding the reassociate flag that might have a much broader impact on the result.

Even with `-march=armv8.3-a -ffast-math` clang did not use `fcmla` for the scalar version. This is similar to GCC right now, https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123260.

Unlike GCC however, clang only uses `fcmla` for the unrolled loop in `g21`,

```asm
_Z3g21P7ComplexS0_S0_i:                 // @_Z3g21P7ComplexS0_S0_i
        .cfi_startproc
// %bb.0:
        cmp w3, #1
        b.lt    .LBB2_5
// %bb.1:
        cmp     w3, #4
 mov     w8, w3
        b.hs    .LBB2_6
// %bb.2:
        mov     x9, xzr
.LBB2_3:
        mov     w10, #8                         // =0x8
 sub     x8, x8, x9
        orr     x11, x10, x9, lsl #4
        add x9, x2, x11
        add     x10, x0, x11
        add     x11, x1, x11
.LBB2_4:                                // =>This Inner Loop Header: Depth=1
        ldp     d0, d4, [x10, #-8]
        subs    x8, x8, #1
 ldp     d1, d5, [x9, #-8]
        add     x9, x9, #16
        ldp d2, d3, [x11, #-8]
        add     x11, x11, #16
        fmadd   d0, d1, d2, d0
        fmadd   d1, d3, d1, d4
        fmsub   d0, d3, d5, d0
 fmadd   d1, d5, d2, d1
        stp     d0, d1, [x10, #-8]
        add x10, x10, #16
        b.ne    .LBB2_4
.LBB2_5:
        ret
.LBB2_6:
 lsl     x9, x8, #4
        add     x10, x2, x9
        add     x11, x0, x9
        add     x9, x1, x9
        cmp     x0, x10
        ccmp    x2, x11, #2, lo
        cset    w10, lo
        cmp     x1, x11
        ccmp x0, x9, #2, lo
        mov     x9, xzr
        b.lo    .LBB2_3
// %bb.7:
        tbnz    w10, #0, .LBB2_3
// %bb.8:
        and     x9, x8, #0x7ffffffe
        mov     x10, x0
        mov     x11, x1
        mov x12, x2
        mov     x13, x9
.LBB2_9:                                // =>This Inner Loop Header: Depth=1
        ldp     q0, q4, [x11], #32
 ldp     q1, q5, [x12], #32
        ldp     q2, q3, [x10]
        subs x13, x13, #2
        fcmla   v2.2d, v0.2d, v1.2d, #0
        fcmla   v3.2d, v4.2d, v5.2d, #0
        fcmla   v2.2d, v0.2d, v1.2d, #90
        fcmla v3.2d, v4.2d, v5.2d, #90
        stp     q2, q3, [x10], #32
        b.ne .LBB2_9
// %bb.10:
        cmp     x9, x8
        b.ne    .LBB2_3
 b       .LBB2_5
.Lfunc_end2:
        .size   _Z3g21P7ComplexS0_S0_i, .Lfunc_end2-_Z3g21P7ComplexS0_S0_i
        .cfi_endproc
```

Note the `LBB2_4` block uses normal fma instructions where as the unrolled `LBB2_9` block uses `fcmla`. Both should be able to use `fcmla` instead.

Finally, the pattern matching doesn't seem to work for partial computation, none of `g11` and `g12` under any flag I've tested uses `fcmla` when it could. It might be understandable for the scalar version `g12` since it replaces a 16 bytes load with a 8 bytes one (still doubt it) but for the loop one using `fcmla` would've get rid of some of the shuffling generated. This is again similar to GCC https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121925

# Summary of the sub-issues
1. `-O3` fused fewer multiply and add than `-O2`
2. `fcmla` requires `reassoc` when it should only require `contract`
3. `fcmla` not generated for scalar operation
4. `fcmla` not generated for non-unrolled loops
5. `fcmla` not used for partial complex number operation and only for the full one.

_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

[llvm-bugs] [Bug 173274] A few issues with fma and fcma optimizations on aarch64

Reply via email to