Issue 144562
Summary [RISCV] Investigation of Memory Operation Expansion Thresholds for RISC-V
Labels new issue
Assignees
Reporter mikhailramalho
    ## Summary

We've been investigating the impact of memory operation expansion thresholds as measured on the SpacemiT X60. A sweep of a large number of different threshold values showed little change in execution time, but looking more closely at the generated code showed some potential areas of improvement (and if addressed, it would be worth re-testing the impact of changing the thresholds).

We ran our experiments on the Spacemit-X60 with rva22u64_v, using the SPEC CPU2017 benchmark suite. We tested various combinations of `MaxStoresPerMemset`, `MaxStoresPerMemcpy`, `MaxStoresPerMemmove`, and `MaxLoadsPerMemcmp` parameters (ranging from 2 to 16).

## Key Findings

### 1. Performance is surprisingly stable

After testing 66 different configurations, performance across all benchmarks varied by less than 2% (ranging from ~267 to ~271 in geometric mean). The default configuration (8,8,8,8) performs just as well as any other combination we tried. 

### 2. But the generated code has potential issues to explore

(Note: all the examples here are taken when diffing the generated code for different memory operation expansion thresholds - so sometimes the diff shows non-expanded => expanded and other times the other way round.)

Below is a sampling of issues that leapt out, rather than being exhaustive.

#### a) Potential mismatch between the costed value and the complexity of the generated code
A 256-byte memset expands to just a handful of instructions:
```asm
-       li      a0, 32
+ li      a2, 256
+       li      a1, 0
+       call    memset
 li      s1, -1
-       vsetvli zero, a0, e64, m8, ta, ma
-       vmv.v.i v8, 0
-       vse64.v v8, (s0)
```
But for a 216-byte memset, the expansion produces 13 instructions. It's not clear that the cost model is taking into account the full expansion cost (although there is opportunity to do something more complex in `findOptimalMemOpLowering`)
```asm
- vsetivli        zero, 16, e64, m4, ta, ma
-       vmv.v.i v8, 0
- addi    a1, a0, 184
-       vsetivli        zero, 4, e64, m1, ta, ma
- vmv.v.i v12, 0
-       vsetivli        zero, 16, e64, m4, ta, ma
- vse64.v v8, (a0)
-       vsetivli        zero, 4, e64, m1, ta, ma
- vse64.v v12, (a1)
-       addi    a1, a0, 160
-       addi    a2, a0, 128
-       vse64.v v12, (a1)
-       vse64.v v12, (a2)
-       ret
+ li      a2, 216
+       li      a1, 0
+       tail    memset
```

#### b) Too many vsetvli instructions
We're seeing unnecessary vector configuration changes:
```asm
-.Lpcrel_hi3:
-       auipc   a1, %pcrel_hi(.L.str.11)
-       li      a2, 32
-       addi    a1, a1, %pcrel_lo(.Lpcrel_hi3)
-       ld      a3, 112(sp)
-       vsetvli zero, a2, e8, m1, ta, ma
-       vle8.v  v8, (a1)
-       li      a4, 64
- addi    a5, a1, 32
+       mv      s1, a0
+       ld      s0, 112(sp)
 sd      a0, 8(sp)
-       sd      a3, 24(sp)
-       vse8.v  v8, (a0)
- vsetvli zero, a4, e8, m2, ta, ma
-       vle8.v  v8, (a5)
- addi    a4, a0, 32
-       vse8.v  v8, (a4)
-       li      a4, 128
- addi    a5, a1, 96
-       addi    a1, a1, 202
-       vsetvli zero, a4, e8, m4, ta, ma
-       vle8.v  v8, (a5)
-       vsetvli zero, a2, e8, m1, ta, ma
-       vle8.v  v12, (a1)
-       addi    a1, a0, 96
- vsetvli zero, a4, e8, m4, ta, ma
-       vse8.v  v8, (a1)
-       addi a1, a0, 202
-       add     a0, a0, a3
-       vsetvli zero, a2, e8, m1, ta, ma
-       vse8.v  v12, (a1)
-       sd      a3, 16(sp)
-       sb zero, 0(a0)
+       sd      s0, 24(sp)
+.Lpcrel_hi3:
+       auipc   a0, %pcrel_hi(.L.str.11)
+       addi    a1, a0, %pcrel_lo(.Lpcrel_hi3)
+ li      a2, 234
+       mv      a0, s1
+       call    memcpy
+       add s1, s1, s0
+       sd      s0, 16(sp)
+       sb      zero, 0(s1)
```

#### c) At least one instance of inefficient offset calculation 
This is from Blender:
```asm
-.Lpcrel_hi12:
-       auipc   a1, %pcrel_hi(.L.str.15)
-       li      a2, 32
-       addi    a1, a1, %pcrel_lo(.Lpcrel_hi12)
-       ld      a3, 40(sp)
-       vsetvli zero, a2, e8, m1, ta, ma
-       vle8.v  v8, (a1)
-       li      a4, 64
- addi    a5, a1, 32
+       mv      s0, a0
+       ld      s1, 40(sp)
 sd      a0, 144(sp)
-       sd      a3, 160(sp)
-       vse8.v  v8, (a0)
-       vsetvli zero, a4, e8, m2, ta, ma
-       vle8.v  v8, (a5)
- addi    a4, a0, 32
-       vse8.v  v8, (a4)
-       li      a4, 128
- addi    a5, a1, 96
-       addi    a1, a1, 221
-       vsetvli zero, a4, e8, m4, ta, ma
-       vle8.v  v8, (a5)
-       vsetvli zero, a2, e8, m1, ta, ma
-       vle8.v  v12, (a1)
-       addi    a1, a0, 96
- vsetvli zero, a4, e8, m4, ta, ma
-       vse8.v  v8, (a1)
-       addi a1, a0, 221
-       add     a0, a0, a3
-       vsetvli zero, a2, e8, m1, ta, ma
-       vse8.v  v12, (a1)
-       sd      a3, 152(sp)
-       sb zero, 0(a0)
+       sd      s1, 160(sp)
+.Lpcrel_hi12:
+       auipc a0, %pcrel_hi(.L.str.15)
+       addi    a1, a0, %pcrel_lo(.Lpcrel_hi12)
+ li      a2, 253
+       mv      a0, s0
+       call    memcpy
+ add     s0, s0, s1
+       sd      s1, 152(sp)
+       sb      zero, 0(s0)
```

#### d) memcmp expansion not using vectors
For a 64-byte comparison, instead of using vector instructions, we get:
```asm
 ld a1, 112(a0)
 ld      a0, 104(a0)
 ld      a6, 0(a1)
 ld      a7, 8(a1)
 ld      t0, 16(a1)
 ld      t1, 24(a1)
 ld      t5, 0(a0)
 ld      t4, 8(a0)
 ld      t2, 16(a0)
 ld      t3, 24(a0)
 ld      t6, 32(a1)
 ld s3, 40(a1)
 ld      s8, 48(a1)
 ld      s9, 56(a1)
 ld      a3, 32(a0)
 ld      s0, 40(a0)
 ld      s1, 48(a0)
 ld      a0, 56(a0)
 xor     a4, a6, t5
 xor     a5, a7, t4
 xor     a2, t0, t2
 xor     a1, t1, t3
 xor a3, t6, a3
 xor     s0, s3, s0
 xor     s1, s8, s1
 xor     a0, s9, a0
 or      a4, a4, a5
 or      a1, a1, a2
 or      a3, a3, s0
 or      a0, a0, s1
 or      a1, a1, a4
 or      a0, a0, a3
 or      a0, a0, a1
 ld a1, 104(a0)
 ld      a0, 112(a0)
 li      a2, 64
 call bcmp
```

However this should now be addressed by https://github.com/llvm/llvm-project/commit/4903c11a7e14.

---

The full data from the sweep of expansion parameters is available at https://docs.google.com/spreadsheets/d/1M6x5-SE0zsqebgdAHth8abbw5X3Gm38Z536Pnho6bjE/edit?gid=0#gid=0

Our main takeaway is that while changing the thresholds has limited impact on the X60 as things stand, there are things to explore regarding improving the expansion and some of them may lead to threshold changes having more impact.
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to