[llvm-bugs] [Bug 161451] [AArch64] Suboptimal code for 128bit multiplication by constants

LLVM Bugs via llvm-bugs Tue, 30 Sep 2025 14:53:31 -0700

Issue	161451
Summary	[AArch64] Suboptimal code for 128bit multiplication by constants
Labels	backend:AArch64, missed-optimization
Assignees
Reporter	Kmeakin

    https://godbolt.org/z/ejhrxofdb

For certain constants, GCC generates faster and/or smaller code than LLVM


# Example 1
eg for `x * 3`, GCC generates both smaller and faster code:

## LLVM
```asm
mul_3(unsigned __int128):
        mov     w8, #3
        add x9, x1, x1, lsl #1
        umulh   x8, x0, x8
        add     x0, x0, x0, lsl #1
        add     x1, x8, x9
        ret

Iterations: 100
Instructions:      600
Total Cycles:      602
Total uOps: 600

Dispatch Width:    3
uOps Per Cycle:    1.00
IPC: 1.00
Block RThroughput: 2.0


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1 0.33                        mov	w8, #3
 1      2     0.33 add	x9, x1, x1, lsl #1
 1      5     2.00 umulh	x8, x0, x8
 1      2     0.33                        add	x0, x0, x0, lsl #1
 1      1     0.33                        add	x1, x8, x9
 1      1 1.00                  U     ret
```

## GCC
```asm
mul_3(unsigned __int128):
        lsl     x2, x0, 1
        extr    x3, x1, x0, 63
 adds    x0, x2, x0
        adc     x1, x3, x1
        ret

Iterations: 100
Instructions:      500
Total Cycles:      302
Total uOps: 500

Dispatch Width:    3
uOps Per Cycle:    1.66
IPC: 1.66
Block RThroughput: 1.7


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      2 0.33                        lsl	x2, x0, #1
 1      2     0.33 extr	x3, x1, x0, #63
 1      1     0.33 adds	x0, x2, x0
 1      1     0.33                        adc	x1, x3, x1
 1      1     1.00                  U     ret
```

# Example 2
eg for `x * 10`, GCC generates code that is longer, but faster than LLVM:

## LLVM
```asm
mul_10(unsigned __int128):
        mov     w8, #10
 umulh   x9, x0, x8
        madd    x1, x1, x8, x9
        add     x8, x0, x0, lsl #2
        lsl     x0, x8, #1
        ret

Iterations: 100
Instructions:      600
Total Cycles:      1002
Total uOps: 600

Dispatch Width:    3
uOps Per Cycle:    0.60
IPC: 0.60
Block RThroughput: 4.0


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1 0.33                        mov	w8, #10
 1      5     2.00 umulh	x9, x0, x8
 1      5     2.00                        madd	x1, x1, x8, x9
 1      2     0.33                        add	x8, x0, x0, lsl #2
 1      2     0.33                        lsl	x0, x8, #1
 1      1 1.00                  U     ret
```

## GCC
```asm
mul_10(unsigned __int128):
        lsl     x2, x0, 2
        extr    x3, x1, x0, 62
 adds    x2, x2, x0
        adc     x1, x3, x1
        lsl     x0, x2, 1
 extr    x1, x1, x2, 63
        ret

Iterations: 100
Instructions:      700
Total Cycles:      502
Total uOps: 700

Dispatch Width:    3
uOps Per Cycle:    1.39
IPC: 1.39
Block RThroughput: 2.3


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      2 0.33                        lsl	x2, x0, #2
 1      2     0.33 extr	x3, x1, x0, #62
 1      1     0.33 adds	x2, x2, x0
 1      1     0.33                        adc	x1, x3, x1
 1      2     0.33                        lsl	x0, x2, #1
 1      2     0.33 extr	x1, x1, x2, #63
 1      1     1.00 U     ret
```

_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

[llvm-bugs] [Bug 161451] [AArch64] Suboptimal code for 128bit multiplication by constants

Reply via email to