| Issue |
161451
|
| Summary |
[AArch64] Suboptimal code for 128bit multiplication by constants
|
| Labels |
backend:AArch64,
missed-optimization
|
| Assignees |
|
| Reporter |
Kmeakin
|
https://godbolt.org/z/ejhrxofdb
For certain constants, GCC generates faster and/or smaller code than LLVM
# Example 1
eg for `x * 3`, GCC generates both smaller and faster code:
## LLVM
```asm
mul_3(unsigned __int128):
mov w8, #3
add x9, x1, x1, lsl #1
umulh x8, x0, x8
add x0, x0, x0, lsl #1
add x1, x8, x9
ret
Iterations: 100
Instructions: 600
Total Cycles: 602
Total uOps: 600
Dispatch Width: 3
uOps Per Cycle: 1.00
IPC: 1.00
Block RThroughput: 2.0
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 1 0.33 mov w8, #3
1 2 0.33 add x9, x1, x1, lsl #1
1 5 2.00 umulh x8, x0, x8
1 2 0.33 add x0, x0, x0, lsl #1
1 1 0.33 add x1, x8, x9
1 1 1.00 U ret
```
## GCC
```asm
mul_3(unsigned __int128):
lsl x2, x0, 1
extr x3, x1, x0, 63
adds x0, x2, x0
adc x1, x3, x1
ret
Iterations: 100
Instructions: 500
Total Cycles: 302
Total uOps: 500
Dispatch Width: 3
uOps Per Cycle: 1.66
IPC: 1.66
Block RThroughput: 1.7
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 2 0.33 lsl x2, x0, #1
1 2 0.33 extr x3, x1, x0, #63
1 1 0.33 adds x0, x2, x0
1 1 0.33 adc x1, x3, x1
1 1 1.00 U ret
```
# Example 2
eg for `x * 10`, GCC generates code that is longer, but faster than LLVM:
## LLVM
```asm
mul_10(unsigned __int128):
mov w8, #10
umulh x9, x0, x8
madd x1, x1, x8, x9
add x8, x0, x0, lsl #2
lsl x0, x8, #1
ret
Iterations: 100
Instructions: 600
Total Cycles: 1002
Total uOps: 600
Dispatch Width: 3
uOps Per Cycle: 0.60
IPC: 0.60
Block RThroughput: 4.0
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 1 0.33 mov w8, #10
1 5 2.00 umulh x9, x0, x8
1 5 2.00 madd x1, x1, x8, x9
1 2 0.33 add x8, x0, x0, lsl #2
1 2 0.33 lsl x0, x8, #1
1 1 1.00 U ret
```
## GCC
```asm
mul_10(unsigned __int128):
lsl x2, x0, 2
extr x3, x1, x0, 62
adds x2, x2, x0
adc x1, x3, x1
lsl x0, x2, 1
extr x1, x1, x2, 63
ret
Iterations: 100
Instructions: 700
Total Cycles: 502
Total uOps: 700
Dispatch Width: 3
uOps Per Cycle: 1.39
IPC: 1.39
Block RThroughput: 2.3
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 2 0.33 lsl x2, x0, #2
1 2 0.33 extr x3, x1, x0, #62
1 1 0.33 adds x2, x2, x0
1 1 0.33 adc x1, x3, x1
1 2 0.33 lsl x0, x2, #1
1 2 0.33 extr x1, x1, x2, #63
1 1 1.00 U ret
```
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs