On Tue, Oct 17, 2023 at 1:52 PM Alex Coplan <alex.cop...@arm.com> wrote: > > This adds a new aarch64-specific RTL-SSA pass dedicated to forming load > and store pairs (LDPs and STPs). > > As a motivating example for the kind of thing this improves, take the > following testcase: > > extern double c[20]; > > double f(double x) > { > double y = x*x; > y += c[16]; > y += c[17]; > y += c[18]; > y += c[19]; > return y; > } > > for which we currently generate (at -O2): > > f: > adrp x0, c > add x0, x0, :lo12:c > ldp d31, d29, [x0, 128] > ldr d30, [x0, 144] > fmadd d0, d0, d0, d31 > ldr d31, [x0, 152] > fadd d0, d0, d29 > fadd d0, d0, d30 > fadd d0, d0, d31 > ret > > but with the pass, we generate: > > f: > .LFB0: > adrp x0, c > add x0, x0, :lo12:c > ldp d31, d29, [x0, 128] > fmadd d0, d0, d0, d31 > ldp d30, d31, [x0, 144] > fadd d0, d0, d29 > fadd d0, d0, d30 > fadd d0, d0, d31 > ret > > The pass is local (only considers a BB at a time). In theory, it should > be possible to extend it to run over EBBs, at least in the case of pure > (MEM_READONLY_P) loads, but this is left for future work. > > The pass works by identifying two kinds of bases: tree decls obtained > via MEM_EXPR, and RTL register bases in the form of RTL-SSA def_infos. > If a candidate memory access has a MEM_EXPR base, then we track it via > this base, and otherwise if it is of a simple reg + <imm> form, we track > it via the RTL-SSA def_info for the register. > > For each BB, for a given kind of base, we build up a hash table mapping > the base to an access_group. The access_group data structure holds a > list of accesses at each offset relative to the same base. It uses a > splay tree to support efficient insertion (while walking the bb), and > the nodes are chained using a linked list to support efficient > iteration (while doing the transformation). > > For each base, we then iterate over the access_group to identify > adjacent accesses, and try to form load/store pairs for those insns that > access adjacent memory. > > The pass is currently run twice, both before and after register > allocation. The first copy of the pass is run late in the pre-RA RTL > pipeline, immediately after sched1, since it was found that sched1 was > increasing register pressure when the pass was run before. The second > copy of the pass runs immediately before peephole2, so as to get any > opportunities that the existing ldp/stp peepholes can handle. > > There are some cases that we punt on before RA, e.g. > accesses relative to eliminable regs (such as the soft frame pointer). > We do this since we can't know the elimination offset before RA, and we > want to avoid the RA reloading the offset (due to being out of ldp/stp > immediate range) as this can generate worse code. > > The post-RA copy of the pass is there to pick up the crumbs that were > left behind / things we punted on in the pre-RA pass. Among other > things, it's needed to handle accesses relative to the stack pointer > (see the previous patch in the series for an example). It can also > handle code that didn't exist at the time the pre-RA pass was run (spill > code, prologue/epilogue code). > > The following table shows the effect of the passes on code size in > SPEC CPU 2017 with -Os -flto=auto -mcpu=neoverse-v1: > > +-----------------+-------------+--------------+---------+ > | Benchmark | Pre-RA pass | Post-RA pass | Overall | > +-----------------+-------------+--------------+---------+ > | 541.leela_r | 0.04% | -0.03% | 0.01% | > | 502.gcc_r | -0.07% | -0.02% | -0.09% | > | 510.parest_r | -0.06% | -0.04% | -0.10% | > | 505.mcf_r | -0.12% | 0.00% | -0.12% | > | 500.perlbench_r | -0.12% | -0.02% | -0.15% | > | 520.omnetpp_r | -0.13% | -0.03% | -0.16% | > | 538.imagick_r | -0.17% | -0.02% | -0.19% | > | 525.x264_r | -0.17% | -0.02% | -0.19% | > | 544.nab_r | -0.22% | -0.01% | -0.23% | > | 557.xz_r | -0.27% | -0.01% | -0.28% | > | 507.cactuBSSN_r | -0.26% | -0.03% | -0.29% | > | 526.blender_r | -0.37% | -0.02% | -0.38% | > | 523.xalancbmk_r | -0.41% | -0.01% | -0.42% | > | 531.deepsjeng_r | -0.41% | -0.05% | -0.46% | > | 511.povray_r | -0.60% | -0.05% | -0.65% | > | 548.exchange2_r | -0.55% | -0.32% | -0.86% | > | 527.cam4_r | -0.82% | -0.16% | -0.98% | > | 503.bwaves_r | -0.63% | -0.41% | -1.04% | > | 521.wrf_r | -1.04% | -0.06% | -1.10% | > | 549.fotonik3d_r | -0.91% | -0.35% | -1.26% | > | 554.roms_r | -1.20% | -0.20% | -1.40% | > | 519.lbm_r | -1.91% | 0.00% | -1.91% | > | 508.namd_r | -2.40% | -0.07% | -2.47% | > +-----------------+-------------+--------------+---------| > | SPEC CPU 2017 | -0.51% | -0.05% | -0.56% | > +-----------------+-------------+--------------+---------+ > > Performance-wise, with SPEC CPU 2017 on Neoverse V1, the pass seems to be > approximately neutral with LTO enabled (overall geomean regression of > 0.07%). With LTO disabled, I see results as follows (geomean change in > runtime, %): > > +----------+--------+---------+---------+ > | Suite | Pre-RA | Post-RA | Overall | > +----------+--------+---------+---------+ > | SPECint | -0.31% | -0.28% | -0.60% | > | SPECfp | -0.24% | 0.44% | 0.20% | > | SPECrate | -0.27% | 0.12% | -0.15% | > +----------+--------+---------+---------+ > > This is an initial implementation, and there are (among other possible > improvements) the following notable caveats / missing features that are > left for future work, but could give further improvements: > > - Moving accesses between BBs within in an EBB, see above. > - Out-of-range opportunities: currently the pass refuses to form pairs > if there isn't a suitable base register with an immediate in range > for ldp/stp, but it can be profitable to emit anchor addresses in the > case that there are four or more out-of-range nearby accesses that can > be formed into pairs. This is handled by the current ldp/stp > peepholes, so it would be good to support this in the future. > - Auto-increment addressing: this currently isn't supported by the > pass, but it would be nice to support this. > > There are a couple of params added with the patch, the first of these is > --param=aarch64-ldp-alias-check-limit. This is needed to avoid > unbounded quadratic behaviour when the pass performs alias checks to see > if it is safe to re-order loads over stores, or stores over stores. A > lower value will cause the alias analysis to bail out earlier > (conservatively assuming an alias conflict, so potentially missing > opportunities, but saving on compile time). There is likely some tuning > work needed for the default value of this param to find a good trade-off > between optimization opportunities and compile time. > > The other param added with this patch switches between two strategies > for making the modes of the two load/store pair arms agree > (--param=aarch64-ldp-canonicalize-modes). unify_access_modes tries to > avoid conversions as much as possible, and just tries to ensure that the > modes agree on VECTOR_MODE_P. canonicalize_access_modes by contrast is > more aggressive in trying to pun accesses such that there is a single > mode used for a given access size. The unify_access_modes strategy > works because for modes of the same size, provided they agree on > VECTOR_MODE_P, the existing load/store pair patterns allow the modes to > differ by using distinct mode iterators (see e.g. > "load_pair_dw_<DX:mode><DX2:mode>"). Unconditionally using > canonicalize_access_modes would in theory allow us to reduce the number > of load/store pair patterns, but it currently seems to give slightly > worse performance than using unify_access_modes. This needs some further > investigation, but for now it seems reasonable to use the > unify_access_modes strategy. > > Bootstrapped/regtested as a series on aarch64-linux-gnu, OK for trunk?
This was something I always wanted to look into implementing but never got around to. So thanks for implementing this. > > Thanks, > Alex > > gcc/ChangeLog: > > * config.gcc: Add aarch64-ldp-fusion.o to extra_objs for aarch64; add > aarch64-ldp-fusion.cc to target_gtfiles. > * config/aarch64/aarch64-passes.def: Add copies of pass_ldp_fusion > before and after RA. > * config/aarch64/aarch64-protos.h (make_pass_ldp_fusion): Declare. > * config/aarch64/aarch64.opt (-mearly-ldp-fusion): New. > (-mlate-ldp-fusion): New. > (--param=aarch64-ldp-alias-check-limit): New. > (--param=aarch64-ldp-canonicalize-modes): New. > * config/aarch64/t-aarch64: Add rule for aarch64-ldp-fusion.o. > * config/aarch64/aarch64-ldp-fusion.cc: New file. > --- > gcc/config.gcc | 4 +- > gcc/config/aarch64/aarch64-ldp-fusion.cc | 2378 ++++++++++++++++++++++ > gcc/config/aarch64/aarch64-passes.def | 2 + > gcc/config/aarch64/aarch64-protos.h | 1 + > gcc/config/aarch64/aarch64.opt | 20 + > gcc/config/aarch64/t-aarch64 | 7 + > 6 files changed, 2410 insertions(+), 2 deletions(-) > create mode 100644 gcc/config/aarch64/aarch64-ldp-fusion.cc >