Re: [PATCH v6] aarch64: New RTL optimization pass avoid-store-forwarding.

Richard Biener Wed, 06 Dec 2023 23:35:48 -0800

On Wed, Dec 6, 2023 at 7:44 PM Philipp Tomsich <philipp.toms...@vrull.eu> wrote:
>
> On Wed, 6 Dec 2023 at 23:32, Richard Biener <richard.guent...@gmail.com> 
> wrote:
> >
> > On Wed, Dec 6, 2023 at 2:48 PM Manos Anagnostakis
> > <manos.anagnosta...@vrull.eu> wrote:
> > >
> > > This is an RTL pass that detects store forwarding from stores to larger 
> > > loads (load pairs).
> > >
> > > This optimization is SPEC2017-driven and was found to be beneficial for 
> > > some benchmarks,
> > > through testing on ampere1/ampere1a machines.
> > >
> > > For example, it can transform cases like
> > >
> > > str  d5, [sp, #320]
> > > fmul d5, d31, d29
> > > ldp  d31, d17, [sp, #312] # Large load from small store
> > >
> > > to
> > >
> > > str  d5, [sp, #320]
> > > fmul d5, d31, d29
> > > ldr  d31, [sp, #312]
> > > ldr  d17, [sp, #320]
> > >
> > > Currently, the pass is disabled by default on all architectures and 
> > > enabled by a target-specific option.
> > >
> > > If deemed beneficial enough for a default, it will be enabled on 
> > > ampere1/ampere1a,
> > > or other architectures as well, without needing to be turned on by this 
> > > option.
> >
> > What is aarch64-specific about the pass?
> >
> > I see an increasingly large number of target specific passes pop up 
> > (probably
> > for the excuse we can generalize them if necessary).  But GCC isn't LLVM
> > and this feels like getting out of hand?
>
> We had an OK from Richard Sandiford on the earlier (v5) version with
> v6 just fixing an obvious bug... so I was about to merge this earlier
> just when you commented.
>
> Given that this had months of test exposure on our end, I would prefer
> to move this forward for GCC14 in its current form.
> The project of replacing architecture-specific store-forwarding passes
> with a generalized infrastructure could then be addressed in the GCC15
> timeframe (or beyond)?


It's up to target maintainers, I just picked this pass (randomly) to make this
comment (of course also knowing that STLF fails are a common issue on
pipelined uarchs).

Richard.

>
> --Philipp.
>
> >
> > The x86 backend also has its store-forwarding "pass" as part of mdreorg
> > in ix86_split_stlf_stall_load.
> >
> > Richard.
> >
> > > Bootstrapped and regtested on aarch64-linux.
> > >
> > > gcc/ChangeLog:
> > >
> > >         * config.gcc: Add aarch64-store-forwarding.o to extra_objs.
> > >         * config/aarch64/aarch64-passes.def (INSERT_PASS_AFTER): New pass.
> > >         * config/aarch64/aarch64-protos.h 
> > > (make_pass_avoid_store_forwarding): Declare.
> > >         * config/aarch64/aarch64.opt (mavoid-store-forwarding): New 
> > > option.
> > >         (aarch64-store-forwarding-threshold): New param.
> > >         * config/aarch64/t-aarch64: Add aarch64-store-forwarding.o
> > >         * doc/invoke.texi: Document new option and new param.
> > >         * config/aarch64/aarch64-store-forwarding.cc: New file.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >         * gcc.target/aarch64/ldp_ssll_no_overlap_address.c: New test.
> > >         * gcc.target/aarch64/ldp_ssll_no_overlap_offset.c: New test.
> > >         * gcc.target/aarch64/ldp_ssll_overlap.c: New test.
> > >
> > > Signed-off-by: Manos Anagnostakis <manos.anagnosta...@vrull.eu>
> > > Co-Authored-By: Manolis Tsamis <manolis.tsa...@vrull.eu>
> > > Co-Authored-By: Philipp Tomsich <philipp.toms...@vrull.eu>
> > > ---
> > > Changes in v6:
> > >         - An obvious change. insn_cnt was incremented only on
> > >           stores and not for every insn in the bb. Now restored.
> > >
> > >  gcc/config.gcc                                |   1 +
> > >  gcc/config/aarch64/aarch64-passes.def         |   1 +
> > >  gcc/config/aarch64/aarch64-protos.h           |   1 +
> > >  .../aarch64/aarch64-store-forwarding.cc       | 318 ++++++++++++++++++
> > >  gcc/config/aarch64/aarch64.opt                |   9 +
> > >  gcc/config/aarch64/t-aarch64                  |  10 +
> > >  gcc/doc/invoke.texi                           |  11 +-
> > >  .../aarch64/ldp_ssll_no_overlap_address.c     |  33 ++
> > >  .../aarch64/ldp_ssll_no_overlap_offset.c      |  33 ++
> > >  .../gcc.target/aarch64/ldp_ssll_overlap.c     |  33 ++
> > >  10 files changed, 449 insertions(+), 1 deletion(-)
> > >  create mode 100644 gcc/config/aarch64/aarch64-store-forwarding.cc
> > >  create mode 100644 
> > > gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> > >  create mode 100644 
> > > gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> > >
> > > diff --git a/gcc/config.gcc b/gcc/config.gcc
> > > index 6450448f2f0..7c48429eb82 100644
> > > --- a/gcc/config.gcc
> > > +++ b/gcc/config.gcc
> > > @@ -350,6 +350,7 @@ aarch64*-*-*)
> > >         cxx_target_objs="aarch64-c.o"
> > >         d_target_objs="aarch64-d.o"
> > >         extra_objs="aarch64-builtins.o aarch-common.o 
> > > aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o 
> > > aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o 
> > > aarch64-sve-builtins-sme.o cortex-a57-fma-steering.o 
> > > aarch64-speculation.o falkor-tag-collision-avoidance.o aarch-bti-insert.o 
> > > aarch64-cc-fusion.o"
> > > +       extra_objs="${extra_objs} aarch64-store-forwarding.o"
> > >         target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.cc 
> > > \$(srcdir)/config/aarch64/aarch64-sve-builtins.h 
> > > \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc"
> > >         target_has_targetm_common=yes
> > >         ;;
> > > diff --git a/gcc/config/aarch64/aarch64-passes.def 
> > > b/gcc/config/aarch64/aarch64-passes.def
> > > index 662a13fd5e6..94ced0aebf6 100644
> > > --- a/gcc/config/aarch64/aarch64-passes.def
> > > +++ b/gcc/config/aarch64/aarch64-passes.def
> > > @@ -24,3 +24,4 @@ INSERT_PASS_BEFORE 
> > > (pass_late_thread_prologue_and_epilogue, 1, pass_switch_pstat
> > >  INSERT_PASS_AFTER (pass_machine_reorg, 1, pass_tag_collision_avoidance);
> > >  INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_bti);
> > >  INSERT_PASS_AFTER (pass_if_after_combine, 1, pass_cc_fusion);
> > > +INSERT_PASS_AFTER (pass_peephole2, 1, pass_avoid_store_forwarding);
> > > diff --git a/gcc/config/aarch64/aarch64-protos.h 
> > > b/gcc/config/aarch64/aarch64-protos.h
> > > index 60ff61f6d54..8f5f2ca4710 100644
> > > --- a/gcc/config/aarch64/aarch64-protos.h
> > > +++ b/gcc/config/aarch64/aarch64-protos.h
> > > @@ -1069,6 +1069,7 @@ rtl_opt_pass *make_pass_tag_collision_avoidance 
> > > (gcc::context *);
> > >  rtl_opt_pass *make_pass_insert_bti (gcc::context *ctxt);
> > >  rtl_opt_pass *make_pass_cc_fusion (gcc::context *ctxt);
> > >  rtl_opt_pass *make_pass_switch_pstate_sm (gcc::context *ctxt);
> > > +rtl_opt_pass *make_pass_avoid_store_forwarding (gcc::context *ctxt);
> > >
> > >  poly_uint64 aarch64_regmode_natural_size (machine_mode);
> > >
> > > diff --git a/gcc/config/aarch64/aarch64-store-forwarding.cc 
> > > b/gcc/config/aarch64/aarch64-store-forwarding.cc
> > > new file mode 100644
> > > index 00000000000..8a6faefd8c0
> > > --- /dev/null
> > > +++ b/gcc/config/aarch64/aarch64-store-forwarding.cc
> > > @@ -0,0 +1,318 @@
> > > +/* Avoid store forwarding optimization pass.
> > > +   Copyright (C) 2023 Free Software Foundation, Inc.
> > > +   Contributed by VRULL GmbH.
> > > +
> > > +   This file is part of GCC.
> > > +
> > > +   GCC is free software; you can redistribute it and/or modify it
> > > +   under the terms of the GNU General Public License as published by
> > > +   the Free Software Foundation; either version 3, or (at your option)
> > > +   any later version.
> > > +
> > > +   GCC is distributed in the hope that it will be useful, but
> > > +   WITHOUT ANY WARRANTY; without even the implied warranty of
> > > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > > +   General Public License for more details.
> > > +
> > > +   You should have received a copy of the GNU General Public License
> > > +   along with GCC; see the file COPYING3.  If not see
> > > +   <http://www.gnu.org/licenses/>.  */
> > > +
> > > +#define IN_TARGET_CODE 1
> > > +
> > > +#include "config.h"
> > > +#define INCLUDE_LIST
> > > +#include "system.h"
> > > +#include "coretypes.h"
> > > +#include "backend.h"
> > > +#include "rtl.h"
> > > +#include "alias.h"
> > > +#include "rtlanal.h"
> > > +#include "tree-pass.h"
> > > +#include "cselib.h"
> > > +
> > > +/* This is an RTL pass that detects store forwarding from stores to 
> > > larger
> > > +   loads (load pairs). For example, it can transform cases like
> > > +
> > > +   str  d5, [sp, #320]
> > > +   fmul d5, d31, d29
> > > +   ldp  d31, d17, [sp, #312] # Large load from small store
> > > +
> > > +   to
> > > +
> > > +   str  d5, [sp, #320]
> > > +   fmul d5, d31, d29
> > > +   ldr  d31, [sp, #312]
> > > +   ldr  d17, [sp, #320]
> > > +
> > > +   Design: The pass follows a straightforward design.  It starts by
> > > +   initializing the alias analysis and the cselib.  Both of these are 
> > > used to
> > > +   find stores and larger loads with overlapping addresses, which are
> > > +   candidates for store forwarding optimizations.  It then scans on 
> > > basic block
> > > +   level to find stores that forward to larger loads and handles them
> > > +   accordingly as described in the above example.  Finally, the alias 
> > > analysis
> > > +   and the cselib library are closed.  */
> > > +
> > > +typedef struct
> > > +{
> > > +  rtx_insn *store_insn;
> > > +  rtx store_mem_addr;
> > > +  unsigned int insn_cnt;
> > > +} str_info;
> > > +
> > > +typedef std::list<str_info> list_store_info;
> > > +
> > > +/* Statistics counters.  */
> > > +static unsigned int stats_store_count = 0;
> > > +static unsigned int stats_ldp_count = 0;
> > > +static unsigned int stats_ssll_count = 0;
> > > +static unsigned int stats_transformed_count = 0;
> > > +
> > > +/* Default.  */
> > > +static rtx dummy;
> > > +static bool is_load (rtx expr, rtx &op_1=dummy);
> > > +
> > > +/* Return true if SET expression EXPR is a store; otherwise false.  */
> > > +
> > > +static bool
> > > +is_store (rtx expr)
> > > +{
> > > +  return MEM_P (SET_DEST (expr));
> > > +}
> > > +
> > > +/* Return true if SET expression EXPR is a load; otherwise false.  OP_1 
> > > will
> > > +   contain the MEM operand of the load.  */
> > > +
> > > +static bool
> > > +is_load (rtx expr, rtx &op_1)
> > > +{
> > > +  op_1 = SET_SRC (expr);
> > > +
> > > +  if (GET_CODE (op_1) == ZERO_EXTEND
> > > +      || GET_CODE (op_1) == SIGN_EXTEND)
> > > +    op_1 = XEXP (op_1, 0);
> > > +
> > > +  return MEM_P (op_1);
> > > +}
> > > +
> > > +/* Return true if STORE_MEM_ADDR is forwarding to the address of 
> > > LOAD_MEM;
> > > +   otherwise false.  STORE_MEM_MODE is the mode of the MEM rtx containing
> > > +   STORE_MEM_ADDR.  */
> > > +
> > > +static bool
> > > +is_forwarding (rtx store_mem_addr, rtx load_mem, machine_mode 
> > > store_mem_mode)
> > > +{
> > > +  /* Sometimes we do not have the proper value.  */
> > > +  if (!CSELIB_VAL_PTR (store_mem_addr))
> > > +    return false;
> > > +
> > > +  gcc_checking_assert (MEM_P (load_mem));
> > > +
> > > +  return rtx_equal_for_cselib_1 (store_mem_addr,
> > > +                                get_addr (XEXP (load_mem, 0)),
> > > +                                store_mem_mode, 0);
> > > +}
> > > +
> > > +/* Return true if INSN is a load pair, preceded by a store forwarding to 
> > > it;
> > > +   otherwise false.  STORE_EXPRS contains the stores.  */
> > > +
> > > +static bool
> > > +is_small_store_to_large_load (list_store_info store_exprs, rtx_insn 
> > > *insn)
> > > +{
> > > +  unsigned int load_count = 0;
> > > +  bool forwarding = false;
> > > +  rtx expr = PATTERN (insn);
> > > +
> > > +  if (GET_CODE (expr) != PARALLEL
> > > +      || XVECLEN (expr, 0) != 2)
> > > +    return false;
> > > +
> > > +  for (int i = 0; i < XVECLEN (expr, 0); i++)
> > > +    {
> > > +      rtx op_1;
> > > +      rtx out_exp = XVECEXP (expr, 0, i);
> > > +
> > > +      if (GET_CODE (out_exp) != SET)
> > > +       continue;
> > > +
> > > +      if (!is_load (out_exp, op_1))
> > > +       continue;
> > > +
> > > +      load_count++;
> > > +
> > > +      for (str_info str : store_exprs)
> > > +       {
> > > +         rtx store_insn = str.store_insn;
> > > +
> > > +         if (!is_forwarding (str.store_mem_addr, op_1,
> > > +                             GET_MODE (SET_DEST (PATTERN (store_insn)))))
> > > +           continue;
> > > +
> > > +         if (dump_file)
> > > +           {
> > > +             fprintf (dump_file,
> > > +                      "Store forwarding to PARALLEL with loads:\n");
> > > +             fprintf (dump_file, "  From: ");
> > > +             print_rtl_single (dump_file, store_insn);
> > > +             fprintf (dump_file, "  To: ");
> > > +             print_rtl_single (dump_file, insn);
> > > +           }
> > > +
> > > +         forwarding = true;
> > > +       }
> > > +    }
> > > +
> > > +  if (load_count == 2)
> > > +    stats_ldp_count++;
> > > +
> > > +  return load_count == 2 && forwarding;
> > > +}
> > > +
> > > +/* Break a load pair into its 2 distinct loads, except if the base source
> > > +   address to load from is overwriten in the first load.  INSN should be 
> > > the
> > > +   PARALLEL of the load pair.  */
> > > +
> > > +static void
> > > +break_ldp (rtx_insn *insn)
> > > +{
> > > +  rtx expr = PATTERN (insn);
> > > +
> > > +  gcc_checking_assert (GET_CODE (expr) == PARALLEL && XVECLEN (expr, 0) 
> > > == 2);
> > > +
> > > +  rtx load_0 = XVECEXP (expr, 0, 0);
> > > +  rtx load_1 = XVECEXP (expr, 0, 1);
> > > +
> > > +  gcc_checking_assert (is_load (load_0) && is_load (load_1));
> > > +
> > > +  /* The base address was overwriten in the first load.  */
> > > +  if (reg_mentioned_p (SET_DEST (load_0), SET_SRC (load_1)))
> > > +    return;
> > > +
> > > +  emit_insn_before (load_0, insn);
> > > +  emit_insn_before (load_1, insn);
> > > +  remove_insn (insn);
> > > +
> > > +  stats_transformed_count++;
> > > +}
> > > +
> > > +static void
> > > +scan_and_transform_bb_level ()
> > > +{
> > > +  rtx_insn *insn, *next;
> > > +  basic_block bb;
> > > +  FOR_EACH_BB_FN (bb, cfun)
> > > +    {
> > > +      list_store_info store_exprs;
> > > +      unsigned int insn_cnt = 0;
> > > +      for (insn = BB_HEAD (bb); insn != NEXT_INSN (BB_END (bb)); insn = 
> > > next)
> > > +       {
> > > +         next = NEXT_INSN (insn);
> >
> > You probably want NONDEBUG here, otherwise insn_cnt will depend
> > on -g?
> >
> > > +         /* If we cross a CALL_P insn, clear the list, because the
> > > +            small-store-to-large-load is unlikely to cause performance
> > > +            difference.  */
> > > +         if (CALL_P (insn))
> > > +           store_exprs.clear ();
> > > +
> > > +         if (!NONJUMP_INSN_P (insn))
> > > +           continue;
> > > +
> > > +         cselib_process_insn (insn);
> >
> > is it necessary to process each insn with cselib?  Only loads & stores I 
> > guess?
> >
> > > +         rtx expr = single_set (insn);
> > > +
> > > +         /* If a store is encountered, append it to the store_exprs list 
> > > to
> > > +            check it later.  */
> > > +         if (expr && is_store (expr))
> > > +           {
> > > +             rtx store_mem = SET_DEST (expr);
> > > +             rtx store_mem_addr = get_addr (XEXP (store_mem, 0));
> > > +             machine_mode store_mem_mode = GET_MODE (store_mem);
> > > +             store_mem_addr = cselib_lookup (store_mem_addr,
> > > +                                             store_mem_mode, 1,
> > > +                                             store_mem_mode)->val_rtx;
> > > +             store_exprs.push_back ({ insn, store_mem_addr, insn_cnt });
> > > +             stats_store_count++;
> > > +           }
> > > +
> > > +         /* Check for small-store-to-large-load.  */
> > > +         if (is_small_store_to_large_load (store_exprs, insn))
> > > +           {
> > > +             stats_ssll_count++;
> > > +             break_ldp (insn);
> > > +           }
> > > +
> > > +         /* Pop the first store from the list if it's distance crosses 
> > > the
> > > +            maximum accepted threshold.  The list contains unique values
> > > +            sorted in ascending order, meaning that only one distance 
> > > can be
> > > +            off at a time.  */
> > > +         if (!store_exprs.empty ()
> > > +             && (insn_cnt - store_exprs.front ().insn_cnt
> > > +                > (unsigned int) 
> > > aarch64_store_forwarding_threshold_param))
> > > +           store_exprs.pop_front ();
> > > +
> > > +         insn_cnt++;
> > > +       }
> > > +    }
> > > +}
> > > +
> > > +static void
> > > +execute_avoid_store_forwarding ()
> > > +{
> > > +  init_alias_analysis ();
> > > +  cselib_init (CSELIB_RECORD_MEMORY | CSELIB_PRESERVE_CONSTANTS);
> > > +  scan_and_transform_bb_level ();
> > > +  end_alias_analysis ();
> > > +  cselib_finish ();
> > > +  statistics_counter_event (cfun, "Number of stores identified: ",
> > > +                           stats_store_count);
> > > +  statistics_counter_event (cfun, "Number of load pairs identified: ",
> > > +                           stats_ldp_count);
> > > +  statistics_counter_event (cfun,
> > > +                           "Number of forwarding cases identified: ",
> > > +                           stats_ssll_count);
> > > +  statistics_counter_event (cfun, "Number of trasformed cases: ",
> > > +                           stats_transformed_count);
> > > +}
> > > +
> > > +const pass_data pass_data_avoid_store_forwarding =
> > > +{
> > > +  RTL_PASS, /* type.  */
> > > +  "avoid_store_forwarding", /* name.  */
> > > +  OPTGROUP_NONE, /* optinfo_flags.  */
> > > +  TV_NONE, /* tv_id.  */
> > > +  0, /* properties_required.  */
> > > +  0, /* properties_provided.  */
> > > +  0, /* properties_destroyed.  */
> > > +  0, /* todo_flags_start.  */
> > > +  0 /* todo_flags_finish.  */
> > > +};
> > > +
> > > +class pass_avoid_store_forwarding : public rtl_opt_pass
> > > +{
> > > +public:
> > > +  pass_avoid_store_forwarding (gcc::context *ctxt)
> > > +    : rtl_opt_pass (pass_data_avoid_store_forwarding, ctxt)
> > > +  {}
> > > +
> > > +  /* opt_pass methods: */
> > > +  virtual bool gate (function *)
> > > +    {
> > > +      return aarch64_flag_avoid_store_forwarding;
> > > +    }
> > > +
> > > +  virtual unsigned int execute (function *)
> > > +    {
> > > +      execute_avoid_store_forwarding ();
> > > +      return 0;
> > > +    }
> > > +
> > > +}; // class pass_avoid_store_forwarding
> > > +
> > > +/* Create a new avoid store forwarding pass instance.  */
> > > +
> > > +rtl_opt_pass *
> > > +make_pass_avoid_store_forwarding (gcc::context *ctxt)
> > > +{
> > > +  return new pass_avoid_store_forwarding (ctxt);
> > > +}
> > > diff --git a/gcc/config/aarch64/aarch64.opt 
> > > b/gcc/config/aarch64/aarch64.opt
> > > index f5a518202a1..e4498d53b46 100644
> > > --- a/gcc/config/aarch64/aarch64.opt
> > > +++ b/gcc/config/aarch64/aarch64.opt
> > > @@ -304,6 +304,10 @@ moutline-atomics
> > >  Target Var(aarch64_flag_outline_atomics) Init(2) Save
> > >  Generate local calls to out-of-line atomic operations.
> > >
> > > +mavoid-store-forwarding
> > > +Target Bool Var(aarch64_flag_avoid_store_forwarding) Init(0) Optimization
> > > +Avoid store forwarding to load pairs.
> > > +
> > >  -param=aarch64-sve-compare-costs=
> > >  Target Joined UInteger Var(aarch64_sve_compare_costs) Init(1) 
> > > IntegerRange(0, 1) Param
> > >  When vectorizing for SVE, consider using unpacked vectors for smaller 
> > > elements and use the cost model to pick the cheapest approach.  Also use 
> > > the cost model to choose between SVE and Advanced SIMD vectorization.
> > > @@ -360,3 +364,8 @@ Enum(aarch64_ldp_stp_policy) String(never) 
> > > Value(AARCH64_LDP_STP_POLICY_NEVER)
> > >
> > >  EnumValue
> > >  Enum(aarch64_ldp_stp_policy) String(aligned) 
> > > Value(AARCH64_LDP_STP_POLICY_ALIGNED)
> > > +
> > > +-param=aarch64-store-forwarding-threshold=
> > > +Target Joined UInteger Var(aarch64_store_forwarding_threshold_param) 
> > > Init(20) Param
> > > +Maximum instruction distance allowed between a store and a load pair for 
> > > this to be
> > > +considered a candidate to avoid when using -mavoid-store-forwarding.
> > > diff --git a/gcc/config/aarch64/t-aarch64 b/gcc/config/aarch64/t-aarch64
> > > index 0d96ae3d0b2..5676cdd9585 100644
> > > --- a/gcc/config/aarch64/t-aarch64
> > > +++ b/gcc/config/aarch64/t-aarch64
> > > @@ -194,6 +194,16 @@ aarch64-cc-fusion.o: 
> > > $(srcdir)/config/aarch64/aarch64-cc-fusion.cc \
> > >         $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
> > >                 $(srcdir)/config/aarch64/aarch64-cc-fusion.cc
> > >
> > > +aarch64-store-forwarding.o: \
> > > +    $(srcdir)/config/aarch64/aarch64-store-forwarding.cc \
> > > +    $(CONFIG_H) $(SYSTEM_H) $(TM_H) $(REGS_H) insn-config.h 
> > > $(RTL_BASE_H) \
> > > +    dominance.h cfg.h cfganal.h $(BASIC_BLOCK_H) $(INSN_ATTR_H) 
> > > $(RECOG_H) \
> > > +    output.h hash-map.h $(DF_H) $(OBSTACK_H) $(TARGET_H) $(RTL_H) \
> > > +    $(CONTEXT_H) $(TREE_PASS_H) regrename.h \
> > > +    $(srcdir)/config/aarch64/aarch64-protos.h
> > > +       $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
> > > +               $(srcdir)/config/aarch64/aarch64-store-forwarding.cc
> > > +
> > >  comma=,
> > >  MULTILIB_OPTIONS    = $(subst $(comma),/, $(patsubst %, mabi=%, $(subst 
> > > $(comma),$(comma)mabi=,$(TM_MULTILIB_CONFIG))))
> > >  MULTILIB_DIRNAMES   = $(subst $(comma), ,$(TM_MULTILIB_CONFIG))
> > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > > index 32f535e1ed4..9bf3a83286a 100644
> > > --- a/gcc/doc/invoke.texi
> > > +++ b/gcc/doc/invoke.texi
> > > @@ -801,7 +801,7 @@ Objective-C and Objective-C++ Dialects}.
> > >  -moverride=@var{string}  -mverbose-cost-dump
> > >  -mstack-protector-guard=@var{guard} 
> > > -mstack-protector-guard-reg=@var{sysreg}
> > >  -mstack-protector-guard-offset=@var{offset} -mtrack-speculation
> > > --moutline-atomics }
> > > +-moutline-atomics -mavoid-store-forwarding}
> > >
> > >  @emph{Adapteva Epiphany Options}
> > >  @gccoptlist{-mhalf-reg-file  -mprefer-short-insn-regs
> > > @@ -16774,6 +16774,11 @@ With @option{--param=aarch64-stp-policy=never}, 
> > > do not emit stp.
> > >  With @option{--param=aarch64-stp-policy=aligned}, emit stp only if the
> > >  source pointer is aligned to at least double the alignment of the type.
> > >
> > > +@item aarch64-store-forwarding-threshold
> > > +Maximum allowed instruction distance between a store and a load pair for
> > > +this to be considered a candidate to avoid when using
> > > +@option{-mavoid-store-forwarding}.
> > > +
> > >  @item aarch64-loop-vect-issue-rate-niters
> > >  The tuning for some AArch64 CPUs tries to take both latencies and issue
> > >  rates into account when deciding whether a loop should be vectorized
> > > @@ -20857,6 +20862,10 @@ Generate code which uses only the 
> > > general-purpose registers.  This will prevent
> > >  the compiler from using floating-point and Advanced SIMD registers but 
> > > will not
> > >  impose any restrictions on the assembler.
> > >
> > > +@item -mavoid-store-forwarding
> > > +@itemx -mno-avoid-store-forwarding
> > > +Avoid store forwarding to load pairs.
> > > +
> > >  @opindex mlittle-endian
> > >  @item -mlittle-endian
> > >  Generate little-endian code.  This is the default when GCC is configured 
> > > for an
> > > diff --git 
> > > a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c 
> > > b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> > > new file mode 100644
> > > index 00000000000..b77de6c64b6
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> > > @@ -0,0 +1,33 @@
> > > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> > > +
> > > +#include <stdint.h>
> > > +
> > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > +
> > > +/* Different address, same offset, no overlap  */
> > > +
> > > +#define LDP_SSLL_NO_OVERLAP_ADDRESS(TYPE) \
> > > +TYPE ldp_ssll_no_overlap_address_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE 
> > > *st_arr_2, TYPE i, TYPE dummy){ \
> > > +       TYPE r, y; \
> > > +       st_arr[0] = i; \
> > > +       ld_arr[0] = dummy; \
> > > +       r = st_arr_2[0]; \
> > > +       y = st_arr_2[1]; \
> > > +       return r + y; \
> > > +}
> > > +
> > > +LDP_SSLL_NO_OVERLAP_ADDRESS(uint32_t)
> > > +LDP_SSLL_NO_OVERLAP_ADDRESS(uint64_t)
> > > +LDP_SSLL_NO_OVERLAP_ADDRESS(int32_t)
> > > +LDP_SSLL_NO_OVERLAP_ADDRESS(int64_t)
> > > +LDP_SSLL_NO_OVERLAP_ADDRESS(int)
> > > +LDP_SSLL_NO_OVERLAP_ADDRESS(long)
> > > +LDP_SSLL_NO_OVERLAP_ADDRESS(float)
> > > +LDP_SSLL_NO_OVERLAP_ADDRESS(double)
> > > +LDP_SSLL_NO_OVERLAP_ADDRESS(v4si)
> > > +
> > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> > > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } */
> > > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } */
> > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> > > diff --git 
> > > a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c 
> > > b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> > > new file mode 100644
> > > index 00000000000..f1b3a66abfd
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> > > @@ -0,0 +1,33 @@
> > > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> > > +
> > > +#include <stdint.h>
> > > +
> > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > +
> > > +/* Same address, different offset, no overlap  */
> > > +
> > > +#define LDP_SSLL_NO_OVERLAP_OFFSET(TYPE) \
> > > +TYPE ldp_ssll_no_overlap_offset_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE 
> > > i, TYPE dummy){ \
> > > +       TYPE r, y; \
> > > +       st_arr[0] = i; \
> > > +       ld_arr[0] = dummy; \
> > > +       r = st_arr[10]; \
> > > +       y = st_arr[11]; \
> > > +       return r + y; \
> > > +}
> > > +
> > > +LDP_SSLL_NO_OVERLAP_OFFSET(uint32_t)
> > > +LDP_SSLL_NO_OVERLAP_OFFSET(uint64_t)
> > > +LDP_SSLL_NO_OVERLAP_OFFSET(int32_t)
> > > +LDP_SSLL_NO_OVERLAP_OFFSET(int64_t)
> > > +LDP_SSLL_NO_OVERLAP_OFFSET(int)
> > > +LDP_SSLL_NO_OVERLAP_OFFSET(long)
> > > +LDP_SSLL_NO_OVERLAP_OFFSET(float)
> > > +LDP_SSLL_NO_OVERLAP_OFFSET(double)
> > > +LDP_SSLL_NO_OVERLAP_OFFSET(v4si)
> > > +
> > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> > > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } */
> > > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } */
> > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c 
> > > b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> > > new file mode 100644
> > > index 00000000000..8d5ce5cc87e
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> > > @@ -0,0 +1,33 @@
> > > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> > > +
> > > +#include <stdint.h>
> > > +
> > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > +
> > > +/* Same address, same offset, overlap  */
> > > +
> > > +#define LDP_SSLL_OVERLAP(TYPE) \
> > > +TYPE ldp_ssll_overlap_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE i, TYPE 
> > > dummy){ \
> > > +       TYPE r, y; \
> > > +       st_arr[0] = i; \
> > > +       ld_arr[0] = dummy; \
> > > +       r = st_arr[0]; \
> > > +       y = st_arr[1]; \
> > > +       return r + y; \
> > > +}
> > > +
> > > +LDP_SSLL_OVERLAP(uint32_t)
> > > +LDP_SSLL_OVERLAP(uint64_t)
> > > +LDP_SSLL_OVERLAP(int32_t)
> > > +LDP_SSLL_OVERLAP(int64_t)
> > > +LDP_SSLL_OVERLAP(int)
> > > +LDP_SSLL_OVERLAP(long)
> > > +LDP_SSLL_OVERLAP(float)
> > > +LDP_SSLL_OVERLAP(double)
> > > +LDP_SSLL_OVERLAP(v4si)
> > > +
> > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } } */
> > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } } */
> > > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 0 } } */
> > > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 0 } } */
> > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } } */
> > > --
> > > 2.41.0

Re: [PATCH v6] aarch64: New RTL optimization pass avoid-store-forwarding.

Reply via email to