Issue |
153402
|
Summary |
[RISCV] Inefficient constant pool access
|
Labels |
backend:RISC-V
|
Assignees |
|
Reporter |
asb
|
This issue documents the problem of inefficient constant pool access on RISC-V. It focuses on `double` values, but it could equally apply to e.g. strings.
In general, constant `double` values are accessed from the constant pool (unless they can be generated with 2 instructions). Such a value is emitted in the assembly output like:
```
.LCPI0_0:
.quad 0xbff0000000000000 # double -1
```
When compiling with PIE (default for Linux targets), this will be loaded with a code sequence like:
```
auipc a2, %pcrel_hi(.LCPI0_0)
fld ft10, %pcrel_lo(.Lpcrel_hi1)(a2)
```
For some workloads (e.g. lbm from SPEC), we see very poor codegen where after the `auipc`, that value is spilled and loater reloaded when the constant pool access actually happens. We can gain some control over the hoisting of these accesses with e.g. `isAsCheapAsAMove`, but for the PIE codepath we start with a `PseudoLLA` that is later expanded. Potential options for addressing this:
* Re-evaluate how we handle PseudoLLA and when it is expanded with the aim of preventing hoisting of the auipc. This isn't completely trivial, as keeping PseudoLLA later in the pipeline would have an impact on things like RISCVMergeBaseOffset.
* Although it doesn't fix the problem directly, reducing usage of the constant pool where it isn't necessary will reduce the impact of this kind of poor code generation. e.g. being more liberal about materialising an integer and converting to double, or adding more optimisations around generating one constant using another as a base.
* Don't access such constants via separate symbols, sidestepping the issue. i.e. access the constants for a function at offsets to a common base.
I prototyped the last option as something of a limit study to see what we could gain, and found a simple pass actually works very well.
## Promoting constants
There are multiple ways this could be done, but I prototyped a pass that will:
* Iterate over each function in a module.
* Examine all `double` constants used withing a function and collect any that would be accessed via a constant pool into a new private global array.
* Replace all uses of those constants with an explicit load from that global array.
* Most commonly the calculation of the base of the pool should happen near the function entry. But I found explicitly placing it there, vs placing the calculation of the array's address next to its use made no difference in practice as later passes clean it up appropriately.
For the worst affected SPEC benchmark (lbm), this results in significantly better codegen in `LBM_performStreamCollideTRT` (where the majority of execution time is spent and many constants are accessed). The rough cut patch is [here](https://gist.github.com/asb/6a3db97b6498f1c149e530bb2d88dae7). The impact on executed instruction count for SPEC 2017 benchmarks (compiled for rva22u64 -O3) is:
```
Benchmark Baseline WithDoublePromotion Diff (%)
======================================================================
500.perlbench_r 180485513127 180486905438 0.00%
502.gcc_r 221180482926 221181065104 0.00%
505.mcf_r 131217768764 131217768764 0.00%
508.namd_r 220536219869 220612466820 0.03%
510.parest_r 291735122413 291853988954 0.04%
511.povray_r 30915528669 30990240444 0.24%
519.lbm_r 91960216684 87897183782 -4.42%
520.omnetpp_r 137704535105 138074982921 0.27%
523.xalancbmk_r 283930786269 284431739183 0.18%
525.x264_r 379364891237 379357617460 -0.00%
526.blender_r 660274620672 660527712107 0.04%
531.deepsjeng_r 350830911210 350830911210 0.00%
538.imagick_r 238456376537 238486280612 0.01%
541.leela_r 406267274643 406267274643 0.00%
544.nab_r 397560964084 390704449786 -1.72%
557.xz_r 129480350182 129480350182 0.00%
```
As noted, `lbm` has the largest impact, with `nab` also positively affected. I looked at the cause of slightly increased instruction count for povray and it seemed to come down with ever so slightly different register allocation in one function (leading to more stack accesses), but seemed more along the lines of expected variation rather than a case where the transformation was forcing an obviously "bad" choice.
Notes:
* We lose out on deduplication of constants, but at least for double as handled here it seems unlikely this will be a big deal (compared to e.g. materialising the constants in the instruction stream, which other targets might do).
* If potentially large constant pools are common, you may want to try to arrange constants based on frequency of access etc.
* AArch64 has an AArch64PromoteConstant pass but it only targets vectors.
* The pass could have alternatively promoted constants to globals and then let globals merging handle it. For the sake of testing out the approach, having a simple pass directly implementing the desired transformation seemed most straightforward.
* More optimisations are possible but not implemented so far (or tested for how often they might kick in). e.g. the pass could recognise when a function accesses a constant and its negation, and put just one of those values in the array. Then use fneg when accessing.
* In rare cases fp constants can be introduced later in the pipeline (e.g. sdag legalisation). If moving forward with this pass, we'd want to look more at that.
## What's next
The point of this issue was to document my investigations and approaches so far, and get feedback / experiences from anyone else looking in a similar area. As I said up above, I was surprised that the pass-based approach worked so well with no real downside it. Possibly it makes sense to package that up and ship the improvement while looking at other approaches in parallel.
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs