Issue |
144562
|
Summary |
[RISCV] Investigation of Memory Operation Expansion Thresholds for RISC-V
|
Labels |
new issue
|
Assignees |
|
Reporter |
mikhailramalho
|
## Summary
We've been investigating the impact of memory operation expansion thresholds as measured on the SpacemiT X60. A sweep of a large number of different threshold values showed little change in execution time, but looking more closely at the generated code showed some potential areas of improvement (and if addressed, it would be worth re-testing the impact of changing the thresholds).
We ran our experiments on the Spacemit-X60 with rva22u64_v, using the SPEC CPU2017 benchmark suite. We tested various combinations of `MaxStoresPerMemset`, `MaxStoresPerMemcpy`, `MaxStoresPerMemmove`, and `MaxLoadsPerMemcmp` parameters (ranging from 2 to 16).
## Key Findings
### 1. Performance is surprisingly stable
After testing 66 different configurations, performance across all benchmarks varied by less than 2% (ranging from ~267 to ~271 in geometric mean). The default configuration (8,8,8,8) performs just as well as any other combination we tried.
### 2. But the generated code has potential issues to explore
(Note: all the examples here are taken when diffing the generated code for different memory operation expansion thresholds - so sometimes the diff shows non-expanded => expanded and other times the other way round.)
Below is a sampling of issues that leapt out, rather than being exhaustive.
#### a) Potential mismatch between the costed value and the complexity of the generated code
A 256-byte memset expands to just a handful of instructions:
```asm
- li a0, 32
+ li a2, 256
+ li a1, 0
+ call memset
li s1, -1
- vsetvli zero, a0, e64, m8, ta, ma
- vmv.v.i v8, 0
- vse64.v v8, (s0)
```
But for a 216-byte memset, the expansion produces 13 instructions. It's not clear that the cost model is taking into account the full expansion cost (although there is opportunity to do something more complex in `findOptimalMemOpLowering`)
```asm
- vsetivli zero, 16, e64, m4, ta, ma
- vmv.v.i v8, 0
- addi a1, a0, 184
- vsetivli zero, 4, e64, m1, ta, ma
- vmv.v.i v12, 0
- vsetivli zero, 16, e64, m4, ta, ma
- vse64.v v8, (a0)
- vsetivli zero, 4, e64, m1, ta, ma
- vse64.v v12, (a1)
- addi a1, a0, 160
- addi a2, a0, 128
- vse64.v v12, (a1)
- vse64.v v12, (a2)
- ret
+ li a2, 216
+ li a1, 0
+ tail memset
```
#### b) Too many vsetvli instructions
We're seeing unnecessary vector configuration changes:
```asm
-.Lpcrel_hi3:
- auipc a1, %pcrel_hi(.L.str.11)
- li a2, 32
- addi a1, a1, %pcrel_lo(.Lpcrel_hi3)
- ld a3, 112(sp)
- vsetvli zero, a2, e8, m1, ta, ma
- vle8.v v8, (a1)
- li a4, 64
- addi a5, a1, 32
+ mv s1, a0
+ ld s0, 112(sp)
sd a0, 8(sp)
- sd a3, 24(sp)
- vse8.v v8, (a0)
- vsetvli zero, a4, e8, m2, ta, ma
- vle8.v v8, (a5)
- addi a4, a0, 32
- vse8.v v8, (a4)
- li a4, 128
- addi a5, a1, 96
- addi a1, a1, 202
- vsetvli zero, a4, e8, m4, ta, ma
- vle8.v v8, (a5)
- vsetvli zero, a2, e8, m1, ta, ma
- vle8.v v12, (a1)
- addi a1, a0, 96
- vsetvli zero, a4, e8, m4, ta, ma
- vse8.v v8, (a1)
- addi a1, a0, 202
- add a0, a0, a3
- vsetvli zero, a2, e8, m1, ta, ma
- vse8.v v12, (a1)
- sd a3, 16(sp)
- sb zero, 0(a0)
+ sd s0, 24(sp)
+.Lpcrel_hi3:
+ auipc a0, %pcrel_hi(.L.str.11)
+ addi a1, a0, %pcrel_lo(.Lpcrel_hi3)
+ li a2, 234
+ mv a0, s1
+ call memcpy
+ add s1, s1, s0
+ sd s0, 16(sp)
+ sb zero, 0(s1)
```
#### c) At least one instance of inefficient offset calculation
This is from Blender:
```asm
-.Lpcrel_hi12:
- auipc a1, %pcrel_hi(.L.str.15)
- li a2, 32
- addi a1, a1, %pcrel_lo(.Lpcrel_hi12)
- ld a3, 40(sp)
- vsetvli zero, a2, e8, m1, ta, ma
- vle8.v v8, (a1)
- li a4, 64
- addi a5, a1, 32
+ mv s0, a0
+ ld s1, 40(sp)
sd a0, 144(sp)
- sd a3, 160(sp)
- vse8.v v8, (a0)
- vsetvli zero, a4, e8, m2, ta, ma
- vle8.v v8, (a5)
- addi a4, a0, 32
- vse8.v v8, (a4)
- li a4, 128
- addi a5, a1, 96
- addi a1, a1, 221
- vsetvli zero, a4, e8, m4, ta, ma
- vle8.v v8, (a5)
- vsetvli zero, a2, e8, m1, ta, ma
- vle8.v v12, (a1)
- addi a1, a0, 96
- vsetvli zero, a4, e8, m4, ta, ma
- vse8.v v8, (a1)
- addi a1, a0, 221
- add a0, a0, a3
- vsetvli zero, a2, e8, m1, ta, ma
- vse8.v v12, (a1)
- sd a3, 152(sp)
- sb zero, 0(a0)
+ sd s1, 160(sp)
+.Lpcrel_hi12:
+ auipc a0, %pcrel_hi(.L.str.15)
+ addi a1, a0, %pcrel_lo(.Lpcrel_hi12)
+ li a2, 253
+ mv a0, s0
+ call memcpy
+ add s0, s0, s1
+ sd s1, 152(sp)
+ sb zero, 0(s0)
```
#### d) memcmp expansion not using vectors
For a 64-byte comparison, instead of using vector instructions, we get:
```asm
ld a1, 112(a0)
ld a0, 104(a0)
ld a6, 0(a1)
ld a7, 8(a1)
ld t0, 16(a1)
ld t1, 24(a1)
ld t5, 0(a0)
ld t4, 8(a0)
ld t2, 16(a0)
ld t3, 24(a0)
ld t6, 32(a1)
ld s3, 40(a1)
ld s8, 48(a1)
ld s9, 56(a1)
ld a3, 32(a0)
ld s0, 40(a0)
ld s1, 48(a0)
ld a0, 56(a0)
xor a4, a6, t5
xor a5, a7, t4
xor a2, t0, t2
xor a1, t1, t3
xor a3, t6, a3
xor s0, s3, s0
xor s1, s8, s1
xor a0, s9, a0
or a4, a4, a5
or a1, a1, a2
or a3, a3, s0
or a0, a0, s1
or a1, a1, a4
or a0, a0, a3
or a0, a0, a1
ld a1, 104(a0)
ld a0, 112(a0)
li a2, 64
call bcmp
```
However this should now be addressed by https://github.com/llvm/llvm-project/commit/4903c11a7e14.
---
The full data from the sweep of expansion parameters is available at https://docs.google.com/spreadsheets/d/1M6x5-SE0zsqebgdAHth8abbw5X3Gm38Z536Pnho6bjE/edit?gid=0#gid=0
Our main takeaway is that while changing the thresholds has limited impact on the X60 as things stand, there are things to explore regarding improving the expansion and some of them may lead to threshold changes having more impact.
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs