On Tue, Oct 17, 2023 at 1:52 PM Alex Coplan <alex.cop...@arm.com> wrote:
>
> This adds a new aarch64-specific RTL-SSA pass dedicated to forming load
> and store pairs (LDPs and STPs).
>
> As a motivating example for the kind of thing this improves, take the
> following testcase:
>
> extern double c[20];
>
> double f(double x)
> {
>   double y = x*x;
>   y += c[16];
>   y += c[17];
>   y += c[18];
>   y += c[19];
>   return y;
> }
>
> for which we currently generate (at -O2):
>
> f:
>         adrp    x0, c
>         add     x0, x0, :lo12:c
>         ldp     d31, d29, [x0, 128]
>         ldr     d30, [x0, 144]
>         fmadd   d0, d0, d0, d31
>         ldr     d31, [x0, 152]
>         fadd    d0, d0, d29
>         fadd    d0, d0, d30
>         fadd    d0, d0, d31
>         ret
>
> but with the pass, we generate:
>
> f:
> .LFB0:
>         adrp    x0, c
>         add     x0, x0, :lo12:c
>         ldp     d31, d29, [x0, 128]
>         fmadd   d0, d0, d0, d31
>         ldp     d30, d31, [x0, 144]
>         fadd    d0, d0, d29
>         fadd    d0, d0, d30
>         fadd    d0, d0, d31
>         ret
>
> The pass is local (only considers a BB at a time).  In theory, it should
> be possible to extend it to run over EBBs, at least in the case of pure
> (MEM_READONLY_P) loads, but this is left for future work.
>
> The pass works by identifying two kinds of bases: tree decls obtained
> via MEM_EXPR, and RTL register bases in the form of RTL-SSA def_infos.
> If a candidate memory access has a MEM_EXPR base, then we track it via
> this base, and otherwise if it is of a simple reg + <imm> form, we track
> it via the RTL-SSA def_info for the register.
>
> For each BB, for a given kind of base, we build up a hash table mapping
> the base to an access_group.  The access_group data structure holds a
> list of accesses at each offset relative to the same base.  It uses a
> splay tree to support efficient insertion (while walking the bb), and
> the nodes are chained using a linked list to support efficient
> iteration (while doing the transformation).
>
> For each base, we then iterate over the access_group to identify
> adjacent accesses, and try to form load/store pairs for those insns that
> access adjacent memory.
>
> The pass is currently run twice, both before and after register
> allocation.  The first copy of the pass is run late in the pre-RA RTL
> pipeline, immediately after sched1, since it was found that sched1 was
> increasing register pressure when the pass was run before.  The second
> copy of the pass runs immediately before peephole2, so as to get any
> opportunities that the existing ldp/stp peepholes can handle.
>
> There are some cases that we punt on before RA, e.g.
> accesses relative to eliminable regs (such as the soft frame pointer).
> We do this since we can't know the elimination offset before RA, and we
> want to avoid the RA reloading the offset (due to being out of ldp/stp
> immediate range) as this can generate worse code.
>
> The post-RA copy of the pass is there to pick up the crumbs that were
> left behind / things we punted on in the pre-RA pass.  Among other
> things, it's needed to handle accesses relative to the stack pointer
> (see the previous patch in the series for an example).  It can also
> handle code that didn't exist at the time the pre-RA pass was run (spill
> code, prologue/epilogue code).
>
> The following table shows the effect of the passes on code size in
> SPEC CPU 2017 with -Os -flto=auto -mcpu=neoverse-v1:
>
> +-----------------+-------------+--------------+---------+
> |    Benchmark    | Pre-RA pass | Post-RA pass | Overall |
> +-----------------+-------------+--------------+---------+
> | 541.leela_r     | 0.04%       | -0.03%       | 0.01%   |
> | 502.gcc_r       | -0.07%      | -0.02%       | -0.09%  |
> | 510.parest_r    | -0.06%      | -0.04%       | -0.10%  |
> | 505.mcf_r       | -0.12%      | 0.00%        | -0.12%  |
> | 500.perlbench_r | -0.12%      | -0.02%       | -0.15%  |
> | 520.omnetpp_r   | -0.13%      | -0.03%       | -0.16%  |
> | 538.imagick_r   | -0.17%      | -0.02%       | -0.19%  |
> | 525.x264_r      | -0.17%      | -0.02%       | -0.19%  |
> | 544.nab_r       | -0.22%      | -0.01%       | -0.23%  |
> | 557.xz_r        | -0.27%      | -0.01%       | -0.28%  |
> | 507.cactuBSSN_r | -0.26%      | -0.03%       | -0.29%  |
> | 526.blender_r   | -0.37%      | -0.02%       | -0.38%  |
> | 523.xalancbmk_r | -0.41%      | -0.01%       | -0.42%  |
> | 531.deepsjeng_r | -0.41%      | -0.05%       | -0.46%  |
> | 511.povray_r    | -0.60%      | -0.05%       | -0.65%  |
> | 548.exchange2_r | -0.55%      | -0.32%       | -0.86%  |
> | 527.cam4_r      | -0.82%      | -0.16%       | -0.98%  |
> | 503.bwaves_r    | -0.63%      | -0.41%       | -1.04%  |
> | 521.wrf_r       | -1.04%      | -0.06%       | -1.10%  |
> | 549.fotonik3d_r | -0.91%      | -0.35%       | -1.26%  |
> | 554.roms_r      | -1.20%      | -0.20%       | -1.40%  |
> | 519.lbm_r       | -1.91%      | 0.00%        | -1.91%  |
> | 508.namd_r      | -2.40%      | -0.07%       | -2.47%  |
> +-----------------+-------------+--------------+---------|
> | SPEC CPU 2017   | -0.51%      | -0.05%       | -0.56%  |
> +-----------------+-------------+--------------+---------+
>
> Performance-wise, with SPEC CPU 2017 on Neoverse V1, the pass seems to be
> approximately neutral with LTO enabled (overall geomean regression of
> 0.07%).  With LTO disabled, I see results as follows (geomean change in
> runtime, %):
>
> +----------+--------+---------+---------+
> |  Suite   | Pre-RA | Post-RA | Overall |
> +----------+--------+---------+---------+
> | SPECint  | -0.31% | -0.28%  | -0.60%  |
> | SPECfp   | -0.24% | 0.44%   | 0.20%   |
> | SPECrate | -0.27% | 0.12%   | -0.15%  |
> +----------+--------+---------+---------+
>
> This is an initial implementation, and there are (among other possible
> improvements) the following notable caveats / missing features that are
> left for future work, but could give further improvements:
>
>  - Moving accesses between BBs within in an EBB, see above.
>  - Out-of-range opportunities: currently the pass refuses to form pairs
>    if there isn't a suitable base register with an immediate in range
>    for ldp/stp, but it can be profitable to emit anchor addresses in the
>    case that there are four or more out-of-range nearby accesses that can
>    be formed into pairs.  This is handled by the current ldp/stp
>    peepholes, so it would be good to support this in the future.
>  - Auto-increment addressing: this currently isn't supported by the
>    pass, but it would be nice to support this.
>
> There are a couple of params added with the patch, the first of these is
> --param=aarch64-ldp-alias-check-limit.  This is needed to avoid
> unbounded quadratic behaviour when the pass performs alias checks to see
> if it is safe to re-order loads over stores, or stores over stores.  A
> lower value will cause the alias analysis to bail out earlier
> (conservatively assuming an alias conflict, so potentially missing
> opportunities, but saving on compile time).  There is likely some tuning
> work needed for the default value of this param to find a good trade-off
> between optimization opportunities and compile time.
>
> The other param added with this patch switches between two strategies
> for making the modes of the two load/store pair arms agree
> (--param=aarch64-ldp-canonicalize-modes).  unify_access_modes tries to
> avoid conversions as much as possible, and just tries to ensure that the
> modes agree on VECTOR_MODE_P.  canonicalize_access_modes by contrast is
> more aggressive in trying to pun accesses such that there is a single
> mode used for a given access size.  The unify_access_modes strategy
> works because for modes of the same size, provided they agree on
> VECTOR_MODE_P, the existing load/store pair patterns allow the modes to
> differ by using distinct mode iterators (see e.g.
> "load_pair_dw_<DX:mode><DX2:mode>").  Unconditionally using
> canonicalize_access_modes would in theory allow us to reduce the number
> of load/store pair patterns, but it currently seems to give slightly
> worse performance than using unify_access_modes.  This needs some further
> investigation, but for now it seems reasonable to use the
> unify_access_modes strategy.
>
> Bootstrapped/regtested as a series on aarch64-linux-gnu, OK for trunk?

This was something I always wanted to look into implementing but never
got around to. So thanks for implementing this.

>
> Thanks,
> Alex
>
> gcc/ChangeLog:
>
>         * config.gcc: Add aarch64-ldp-fusion.o to extra_objs for aarch64; add
>         aarch64-ldp-fusion.cc to target_gtfiles.
>         * config/aarch64/aarch64-passes.def: Add copies of pass_ldp_fusion
>         before and after RA.
>         * config/aarch64/aarch64-protos.h (make_pass_ldp_fusion): Declare.
>         * config/aarch64/aarch64.opt (-mearly-ldp-fusion): New.
>         (-mlate-ldp-fusion): New.
>         (--param=aarch64-ldp-alias-check-limit): New.
>         (--param=aarch64-ldp-canonicalize-modes): New.
>         * config/aarch64/t-aarch64: Add rule for aarch64-ldp-fusion.o.
>         * config/aarch64/aarch64-ldp-fusion.cc: New file.
> ---
>  gcc/config.gcc                           |    4 +-
>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 2378 ++++++++++++++++++++++
>  gcc/config/aarch64/aarch64-passes.def    |    2 +
>  gcc/config/aarch64/aarch64-protos.h      |    1 +
>  gcc/config/aarch64/aarch64.opt           |   20 +
>  gcc/config/aarch64/t-aarch64             |    7 +
>  6 files changed, 2410 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/config/aarch64/aarch64-ldp-fusion.cc
>

Reply via email to