On Wed, Dec 6, 2023 at 7:44 PM Philipp Tomsich <philipp.toms...@vrull.eu> wrote: > > On Wed, 6 Dec 2023 at 23:32, Richard Biener <richard.guent...@gmail.com> > wrote: > > > > On Wed, Dec 6, 2023 at 2:48 PM Manos Anagnostakis > > <manos.anagnosta...@vrull.eu> wrote: > > > > > > This is an RTL pass that detects store forwarding from stores to larger > > > loads (load pairs). > > > > > > This optimization is SPEC2017-driven and was found to be beneficial for > > > some benchmarks, > > > through testing on ampere1/ampere1a machines. > > > > > > For example, it can transform cases like > > > > > > str d5, [sp, #320] > > > fmul d5, d31, d29 > > > ldp d31, d17, [sp, #312] # Large load from small store > > > > > > to > > > > > > str d5, [sp, #320] > > > fmul d5, d31, d29 > > > ldr d31, [sp, #312] > > > ldr d17, [sp, #320] > > > > > > Currently, the pass is disabled by default on all architectures and > > > enabled by a target-specific option. > > > > > > If deemed beneficial enough for a default, it will be enabled on > > > ampere1/ampere1a, > > > or other architectures as well, without needing to be turned on by this > > > option. > > > > What is aarch64-specific about the pass? > > > > I see an increasingly large number of target specific passes pop up > > (probably > > for the excuse we can generalize them if necessary). But GCC isn't LLVM > > and this feels like getting out of hand? > > We had an OK from Richard Sandiford on the earlier (v5) version with > v6 just fixing an obvious bug... so I was about to merge this earlier > just when you commented. > > Given that this had months of test exposure on our end, I would prefer > to move this forward for GCC14 in its current form. > The project of replacing architecture-specific store-forwarding passes > with a generalized infrastructure could then be addressed in the GCC15 > timeframe (or beyond)?
It's up to target maintainers, I just picked this pass (randomly) to make this comment (of course also knowing that STLF fails are a common issue on pipelined uarchs). Richard. > > --Philipp. > > > > > The x86 backend also has its store-forwarding "pass" as part of mdreorg > > in ix86_split_stlf_stall_load. > > > > Richard. > > > > > Bootstrapped and regtested on aarch64-linux. > > > > > > gcc/ChangeLog: > > > > > > * config.gcc: Add aarch64-store-forwarding.o to extra_objs. > > > * config/aarch64/aarch64-passes.def (INSERT_PASS_AFTER): New pass. > > > * config/aarch64/aarch64-protos.h > > > (make_pass_avoid_store_forwarding): Declare. > > > * config/aarch64/aarch64.opt (mavoid-store-forwarding): New > > > option. > > > (aarch64-store-forwarding-threshold): New param. > > > * config/aarch64/t-aarch64: Add aarch64-store-forwarding.o > > > * doc/invoke.texi: Document new option and new param. > > > * config/aarch64/aarch64-store-forwarding.cc: New file. > > > > > > gcc/testsuite/ChangeLog: > > > > > > * gcc.target/aarch64/ldp_ssll_no_overlap_address.c: New test. > > > * gcc.target/aarch64/ldp_ssll_no_overlap_offset.c: New test. > > > * gcc.target/aarch64/ldp_ssll_overlap.c: New test. > > > > > > Signed-off-by: Manos Anagnostakis <manos.anagnosta...@vrull.eu> > > > Co-Authored-By: Manolis Tsamis <manolis.tsa...@vrull.eu> > > > Co-Authored-By: Philipp Tomsich <philipp.toms...@vrull.eu> > > > --- > > > Changes in v6: > > > - An obvious change. insn_cnt was incremented only on > > > stores and not for every insn in the bb. Now restored. > > > > > > gcc/config.gcc | 1 + > > > gcc/config/aarch64/aarch64-passes.def | 1 + > > > gcc/config/aarch64/aarch64-protos.h | 1 + > > > .../aarch64/aarch64-store-forwarding.cc | 318 ++++++++++++++++++ > > > gcc/config/aarch64/aarch64.opt | 9 + > > > gcc/config/aarch64/t-aarch64 | 10 + > > > gcc/doc/invoke.texi | 11 +- > > > .../aarch64/ldp_ssll_no_overlap_address.c | 33 ++ > > > .../aarch64/ldp_ssll_no_overlap_offset.c | 33 ++ > > > .../gcc.target/aarch64/ldp_ssll_overlap.c | 33 ++ > > > 10 files changed, 449 insertions(+), 1 deletion(-) > > > create mode 100644 gcc/config/aarch64/aarch64-store-forwarding.cc > > > create mode 100644 > > > gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c > > > create mode 100644 > > > gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c > > > create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c > > > > > > diff --git a/gcc/config.gcc b/gcc/config.gcc > > > index 6450448f2f0..7c48429eb82 100644 > > > --- a/gcc/config.gcc > > > +++ b/gcc/config.gcc > > > @@ -350,6 +350,7 @@ aarch64*-*-*) > > > cxx_target_objs="aarch64-c.o" > > > d_target_objs="aarch64-d.o" > > > extra_objs="aarch64-builtins.o aarch-common.o > > > aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o > > > aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o > > > aarch64-sve-builtins-sme.o cortex-a57-fma-steering.o > > > aarch64-speculation.o falkor-tag-collision-avoidance.o aarch-bti-insert.o > > > aarch64-cc-fusion.o" > > > + extra_objs="${extra_objs} aarch64-store-forwarding.o" > > > target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.cc > > > \$(srcdir)/config/aarch64/aarch64-sve-builtins.h > > > \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc" > > > target_has_targetm_common=yes > > > ;; > > > diff --git a/gcc/config/aarch64/aarch64-passes.def > > > b/gcc/config/aarch64/aarch64-passes.def > > > index 662a13fd5e6..94ced0aebf6 100644 > > > --- a/gcc/config/aarch64/aarch64-passes.def > > > +++ b/gcc/config/aarch64/aarch64-passes.def > > > @@ -24,3 +24,4 @@ INSERT_PASS_BEFORE > > > (pass_late_thread_prologue_and_epilogue, 1, pass_switch_pstat > > > INSERT_PASS_AFTER (pass_machine_reorg, 1, pass_tag_collision_avoidance); > > > INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_bti); > > > INSERT_PASS_AFTER (pass_if_after_combine, 1, pass_cc_fusion); > > > +INSERT_PASS_AFTER (pass_peephole2, 1, pass_avoid_store_forwarding); > > > diff --git a/gcc/config/aarch64/aarch64-protos.h > > > b/gcc/config/aarch64/aarch64-protos.h > > > index 60ff61f6d54..8f5f2ca4710 100644 > > > --- a/gcc/config/aarch64/aarch64-protos.h > > > +++ b/gcc/config/aarch64/aarch64-protos.h > > > @@ -1069,6 +1069,7 @@ rtl_opt_pass *make_pass_tag_collision_avoidance > > > (gcc::context *); > > > rtl_opt_pass *make_pass_insert_bti (gcc::context *ctxt); > > > rtl_opt_pass *make_pass_cc_fusion (gcc::context *ctxt); > > > rtl_opt_pass *make_pass_switch_pstate_sm (gcc::context *ctxt); > > > +rtl_opt_pass *make_pass_avoid_store_forwarding (gcc::context *ctxt); > > > > > > poly_uint64 aarch64_regmode_natural_size (machine_mode); > > > > > > diff --git a/gcc/config/aarch64/aarch64-store-forwarding.cc > > > b/gcc/config/aarch64/aarch64-store-forwarding.cc > > > new file mode 100644 > > > index 00000000000..8a6faefd8c0 > > > --- /dev/null > > > +++ b/gcc/config/aarch64/aarch64-store-forwarding.cc > > > @@ -0,0 +1,318 @@ > > > +/* Avoid store forwarding optimization pass. > > > + Copyright (C) 2023 Free Software Foundation, Inc. > > > + Contributed by VRULL GmbH. > > > + > > > + This file is part of GCC. > > > + > > > + GCC is free software; you can redistribute it and/or modify it > > > + under the terms of the GNU General Public License as published by > > > + the Free Software Foundation; either version 3, or (at your option) > > > + any later version. > > > + > > > + GCC is distributed in the hope that it will be useful, but > > > + WITHOUT ANY WARRANTY; without even the implied warranty of > > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > > + General Public License for more details. > > > + > > > + You should have received a copy of the GNU General Public License > > > + along with GCC; see the file COPYING3. If not see > > > + <http://www.gnu.org/licenses/>. */ > > > + > > > +#define IN_TARGET_CODE 1 > > > + > > > +#include "config.h" > > > +#define INCLUDE_LIST > > > +#include "system.h" > > > +#include "coretypes.h" > > > +#include "backend.h" > > > +#include "rtl.h" > > > +#include "alias.h" > > > +#include "rtlanal.h" > > > +#include "tree-pass.h" > > > +#include "cselib.h" > > > + > > > +/* This is an RTL pass that detects store forwarding from stores to > > > larger > > > + loads (load pairs). For example, it can transform cases like > > > + > > > + str d5, [sp, #320] > > > + fmul d5, d31, d29 > > > + ldp d31, d17, [sp, #312] # Large load from small store > > > + > > > + to > > > + > > > + str d5, [sp, #320] > > > + fmul d5, d31, d29 > > > + ldr d31, [sp, #312] > > > + ldr d17, [sp, #320] > > > + > > > + Design: The pass follows a straightforward design. It starts by > > > + initializing the alias analysis and the cselib. Both of these are > > > used to > > > + find stores and larger loads with overlapping addresses, which are > > > + candidates for store forwarding optimizations. It then scans on > > > basic block > > > + level to find stores that forward to larger loads and handles them > > > + accordingly as described in the above example. Finally, the alias > > > analysis > > > + and the cselib library are closed. */ > > > + > > > +typedef struct > > > +{ > > > + rtx_insn *store_insn; > > > + rtx store_mem_addr; > > > + unsigned int insn_cnt; > > > +} str_info; > > > + > > > +typedef std::list<str_info> list_store_info; > > > + > > > +/* Statistics counters. */ > > > +static unsigned int stats_store_count = 0; > > > +static unsigned int stats_ldp_count = 0; > > > +static unsigned int stats_ssll_count = 0; > > > +static unsigned int stats_transformed_count = 0; > > > + > > > +/* Default. */ > > > +static rtx dummy; > > > +static bool is_load (rtx expr, rtx &op_1=dummy); > > > + > > > +/* Return true if SET expression EXPR is a store; otherwise false. */ > > > + > > > +static bool > > > +is_store (rtx expr) > > > +{ > > > + return MEM_P (SET_DEST (expr)); > > > +} > > > + > > > +/* Return true if SET expression EXPR is a load; otherwise false. OP_1 > > > will > > > + contain the MEM operand of the load. */ > > > + > > > +static bool > > > +is_load (rtx expr, rtx &op_1) > > > +{ > > > + op_1 = SET_SRC (expr); > > > + > > > + if (GET_CODE (op_1) == ZERO_EXTEND > > > + || GET_CODE (op_1) == SIGN_EXTEND) > > > + op_1 = XEXP (op_1, 0); > > > + > > > + return MEM_P (op_1); > > > +} > > > + > > > +/* Return true if STORE_MEM_ADDR is forwarding to the address of > > > LOAD_MEM; > > > + otherwise false. STORE_MEM_MODE is the mode of the MEM rtx containing > > > + STORE_MEM_ADDR. */ > > > + > > > +static bool > > > +is_forwarding (rtx store_mem_addr, rtx load_mem, machine_mode > > > store_mem_mode) > > > +{ > > > + /* Sometimes we do not have the proper value. */ > > > + if (!CSELIB_VAL_PTR (store_mem_addr)) > > > + return false; > > > + > > > + gcc_checking_assert (MEM_P (load_mem)); > > > + > > > + return rtx_equal_for_cselib_1 (store_mem_addr, > > > + get_addr (XEXP (load_mem, 0)), > > > + store_mem_mode, 0); > > > +} > > > + > > > +/* Return true if INSN is a load pair, preceded by a store forwarding to > > > it; > > > + otherwise false. STORE_EXPRS contains the stores. */ > > > + > > > +static bool > > > +is_small_store_to_large_load (list_store_info store_exprs, rtx_insn > > > *insn) > > > +{ > > > + unsigned int load_count = 0; > > > + bool forwarding = false; > > > + rtx expr = PATTERN (insn); > > > + > > > + if (GET_CODE (expr) != PARALLEL > > > + || XVECLEN (expr, 0) != 2) > > > + return false; > > > + > > > + for (int i = 0; i < XVECLEN (expr, 0); i++) > > > + { > > > + rtx op_1; > > > + rtx out_exp = XVECEXP (expr, 0, i); > > > + > > > + if (GET_CODE (out_exp) != SET) > > > + continue; > > > + > > > + if (!is_load (out_exp, op_1)) > > > + continue; > > > + > > > + load_count++; > > > + > > > + for (str_info str : store_exprs) > > > + { > > > + rtx store_insn = str.store_insn; > > > + > > > + if (!is_forwarding (str.store_mem_addr, op_1, > > > + GET_MODE (SET_DEST (PATTERN (store_insn))))) > > > + continue; > > > + > > > + if (dump_file) > > > + { > > > + fprintf (dump_file, > > > + "Store forwarding to PARALLEL with loads:\n"); > > > + fprintf (dump_file, " From: "); > > > + print_rtl_single (dump_file, store_insn); > > > + fprintf (dump_file, " To: "); > > > + print_rtl_single (dump_file, insn); > > > + } > > > + > > > + forwarding = true; > > > + } > > > + } > > > + > > > + if (load_count == 2) > > > + stats_ldp_count++; > > > + > > > + return load_count == 2 && forwarding; > > > +} > > > + > > > +/* Break a load pair into its 2 distinct loads, except if the base source > > > + address to load from is overwriten in the first load. INSN should be > > > the > > > + PARALLEL of the load pair. */ > > > + > > > +static void > > > +break_ldp (rtx_insn *insn) > > > +{ > > > + rtx expr = PATTERN (insn); > > > + > > > + gcc_checking_assert (GET_CODE (expr) == PARALLEL && XVECLEN (expr, 0) > > > == 2); > > > + > > > + rtx load_0 = XVECEXP (expr, 0, 0); > > > + rtx load_1 = XVECEXP (expr, 0, 1); > > > + > > > + gcc_checking_assert (is_load (load_0) && is_load (load_1)); > > > + > > > + /* The base address was overwriten in the first load. */ > > > + if (reg_mentioned_p (SET_DEST (load_0), SET_SRC (load_1))) > > > + return; > > > + > > > + emit_insn_before (load_0, insn); > > > + emit_insn_before (load_1, insn); > > > + remove_insn (insn); > > > + > > > + stats_transformed_count++; > > > +} > > > + > > > +static void > > > +scan_and_transform_bb_level () > > > +{ > > > + rtx_insn *insn, *next; > > > + basic_block bb; > > > + FOR_EACH_BB_FN (bb, cfun) > > > + { > > > + list_store_info store_exprs; > > > + unsigned int insn_cnt = 0; > > > + for (insn = BB_HEAD (bb); insn != NEXT_INSN (BB_END (bb)); insn = > > > next) > > > + { > > > + next = NEXT_INSN (insn); > > > > You probably want NONDEBUG here, otherwise insn_cnt will depend > > on -g? > > > > > + /* If we cross a CALL_P insn, clear the list, because the > > > + small-store-to-large-load is unlikely to cause performance > > > + difference. */ > > > + if (CALL_P (insn)) > > > + store_exprs.clear (); > > > + > > > + if (!NONJUMP_INSN_P (insn)) > > > + continue; > > > + > > > + cselib_process_insn (insn); > > > > is it necessary to process each insn with cselib? Only loads & stores I > > guess? > > > > > + rtx expr = single_set (insn); > > > + > > > + /* If a store is encountered, append it to the store_exprs list > > > to > > > + check it later. */ > > > + if (expr && is_store (expr)) > > > + { > > > + rtx store_mem = SET_DEST (expr); > > > + rtx store_mem_addr = get_addr (XEXP (store_mem, 0)); > > > + machine_mode store_mem_mode = GET_MODE (store_mem); > > > + store_mem_addr = cselib_lookup (store_mem_addr, > > > + store_mem_mode, 1, > > > + store_mem_mode)->val_rtx; > > > + store_exprs.push_back ({ insn, store_mem_addr, insn_cnt }); > > > + stats_store_count++; > > > + } > > > + > > > + /* Check for small-store-to-large-load. */ > > > + if (is_small_store_to_large_load (store_exprs, insn)) > > > + { > > > + stats_ssll_count++; > > > + break_ldp (insn); > > > + } > > > + > > > + /* Pop the first store from the list if it's distance crosses > > > the > > > + maximum accepted threshold. The list contains unique values > > > + sorted in ascending order, meaning that only one distance > > > can be > > > + off at a time. */ > > > + if (!store_exprs.empty () > > > + && (insn_cnt - store_exprs.front ().insn_cnt > > > + > (unsigned int) > > > aarch64_store_forwarding_threshold_param)) > > > + store_exprs.pop_front (); > > > + > > > + insn_cnt++; > > > + } > > > + } > > > +} > > > + > > > +static void > > > +execute_avoid_store_forwarding () > > > +{ > > > + init_alias_analysis (); > > > + cselib_init (CSELIB_RECORD_MEMORY | CSELIB_PRESERVE_CONSTANTS); > > > + scan_and_transform_bb_level (); > > > + end_alias_analysis (); > > > + cselib_finish (); > > > + statistics_counter_event (cfun, "Number of stores identified: ", > > > + stats_store_count); > > > + statistics_counter_event (cfun, "Number of load pairs identified: ", > > > + stats_ldp_count); > > > + statistics_counter_event (cfun, > > > + "Number of forwarding cases identified: ", > > > + stats_ssll_count); > > > + statistics_counter_event (cfun, "Number of trasformed cases: ", > > > + stats_transformed_count); > > > +} > > > + > > > +const pass_data pass_data_avoid_store_forwarding = > > > +{ > > > + RTL_PASS, /* type. */ > > > + "avoid_store_forwarding", /* name. */ > > > + OPTGROUP_NONE, /* optinfo_flags. */ > > > + TV_NONE, /* tv_id. */ > > > + 0, /* properties_required. */ > > > + 0, /* properties_provided. */ > > > + 0, /* properties_destroyed. */ > > > + 0, /* todo_flags_start. */ > > > + 0 /* todo_flags_finish. */ > > > +}; > > > + > > > +class pass_avoid_store_forwarding : public rtl_opt_pass > > > +{ > > > +public: > > > + pass_avoid_store_forwarding (gcc::context *ctxt) > > > + : rtl_opt_pass (pass_data_avoid_store_forwarding, ctxt) > > > + {} > > > + > > > + /* opt_pass methods: */ > > > + virtual bool gate (function *) > > > + { > > > + return aarch64_flag_avoid_store_forwarding; > > > + } > > > + > > > + virtual unsigned int execute (function *) > > > + { > > > + execute_avoid_store_forwarding (); > > > + return 0; > > > + } > > > + > > > +}; // class pass_avoid_store_forwarding > > > + > > > +/* Create a new avoid store forwarding pass instance. */ > > > + > > > +rtl_opt_pass * > > > +make_pass_avoid_store_forwarding (gcc::context *ctxt) > > > +{ > > > + return new pass_avoid_store_forwarding (ctxt); > > > +} > > > diff --git a/gcc/config/aarch64/aarch64.opt > > > b/gcc/config/aarch64/aarch64.opt > > > index f5a518202a1..e4498d53b46 100644 > > > --- a/gcc/config/aarch64/aarch64.opt > > > +++ b/gcc/config/aarch64/aarch64.opt > > > @@ -304,6 +304,10 @@ moutline-atomics > > > Target Var(aarch64_flag_outline_atomics) Init(2) Save > > > Generate local calls to out-of-line atomic operations. > > > > > > +mavoid-store-forwarding > > > +Target Bool Var(aarch64_flag_avoid_store_forwarding) Init(0) Optimization > > > +Avoid store forwarding to load pairs. > > > + > > > -param=aarch64-sve-compare-costs= > > > Target Joined UInteger Var(aarch64_sve_compare_costs) Init(1) > > > IntegerRange(0, 1) Param > > > When vectorizing for SVE, consider using unpacked vectors for smaller > > > elements and use the cost model to pick the cheapest approach. Also use > > > the cost model to choose between SVE and Advanced SIMD vectorization. > > > @@ -360,3 +364,8 @@ Enum(aarch64_ldp_stp_policy) String(never) > > > Value(AARCH64_LDP_STP_POLICY_NEVER) > > > > > > EnumValue > > > Enum(aarch64_ldp_stp_policy) String(aligned) > > > Value(AARCH64_LDP_STP_POLICY_ALIGNED) > > > + > > > +-param=aarch64-store-forwarding-threshold= > > > +Target Joined UInteger Var(aarch64_store_forwarding_threshold_param) > > > Init(20) Param > > > +Maximum instruction distance allowed between a store and a load pair for > > > this to be > > > +considered a candidate to avoid when using -mavoid-store-forwarding. > > > diff --git a/gcc/config/aarch64/t-aarch64 b/gcc/config/aarch64/t-aarch64 > > > index 0d96ae3d0b2..5676cdd9585 100644 > > > --- a/gcc/config/aarch64/t-aarch64 > > > +++ b/gcc/config/aarch64/t-aarch64 > > > @@ -194,6 +194,16 @@ aarch64-cc-fusion.o: > > > $(srcdir)/config/aarch64/aarch64-cc-fusion.cc \ > > > $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \ > > > $(srcdir)/config/aarch64/aarch64-cc-fusion.cc > > > > > > +aarch64-store-forwarding.o: \ > > > + $(srcdir)/config/aarch64/aarch64-store-forwarding.cc \ > > > + $(CONFIG_H) $(SYSTEM_H) $(TM_H) $(REGS_H) insn-config.h > > > $(RTL_BASE_H) \ > > > + dominance.h cfg.h cfganal.h $(BASIC_BLOCK_H) $(INSN_ATTR_H) > > > $(RECOG_H) \ > > > + output.h hash-map.h $(DF_H) $(OBSTACK_H) $(TARGET_H) $(RTL_H) \ > > > + $(CONTEXT_H) $(TREE_PASS_H) regrename.h \ > > > + $(srcdir)/config/aarch64/aarch64-protos.h > > > + $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \ > > > + $(srcdir)/config/aarch64/aarch64-store-forwarding.cc > > > + > > > comma=, > > > MULTILIB_OPTIONS = $(subst $(comma),/, $(patsubst %, mabi=%, $(subst > > > $(comma),$(comma)mabi=,$(TM_MULTILIB_CONFIG)))) > > > MULTILIB_DIRNAMES = $(subst $(comma), ,$(TM_MULTILIB_CONFIG)) > > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > > > index 32f535e1ed4..9bf3a83286a 100644 > > > --- a/gcc/doc/invoke.texi > > > +++ b/gcc/doc/invoke.texi > > > @@ -801,7 +801,7 @@ Objective-C and Objective-C++ Dialects}. > > > -moverride=@var{string} -mverbose-cost-dump > > > -mstack-protector-guard=@var{guard} > > > -mstack-protector-guard-reg=@var{sysreg} > > > -mstack-protector-guard-offset=@var{offset} -mtrack-speculation > > > --moutline-atomics } > > > +-moutline-atomics -mavoid-store-forwarding} > > > > > > @emph{Adapteva Epiphany Options} > > > @gccoptlist{-mhalf-reg-file -mprefer-short-insn-regs > > > @@ -16774,6 +16774,11 @@ With @option{--param=aarch64-stp-policy=never}, > > > do not emit stp. > > > With @option{--param=aarch64-stp-policy=aligned}, emit stp only if the > > > source pointer is aligned to at least double the alignment of the type. > > > > > > +@item aarch64-store-forwarding-threshold > > > +Maximum allowed instruction distance between a store and a load pair for > > > +this to be considered a candidate to avoid when using > > > +@option{-mavoid-store-forwarding}. > > > + > > > @item aarch64-loop-vect-issue-rate-niters > > > The tuning for some AArch64 CPUs tries to take both latencies and issue > > > rates into account when deciding whether a loop should be vectorized > > > @@ -20857,6 +20862,10 @@ Generate code which uses only the > > > general-purpose registers. This will prevent > > > the compiler from using floating-point and Advanced SIMD registers but > > > will not > > > impose any restrictions on the assembler. > > > > > > +@item -mavoid-store-forwarding > > > +@itemx -mno-avoid-store-forwarding > > > +Avoid store forwarding to load pairs. > > > + > > > @opindex mlittle-endian > > > @item -mlittle-endian > > > Generate little-endian code. This is the default when GCC is configured > > > for an > > > diff --git > > > a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c > > > b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c > > > new file mode 100644 > > > index 00000000000..b77de6c64b6 > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c > > > @@ -0,0 +1,33 @@ > > > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */ > > > + > > > +#include <stdint.h> > > > + > > > +typedef int v4si __attribute__ ((vector_size (16))); > > > + > > > +/* Different address, same offset, no overlap */ > > > + > > > +#define LDP_SSLL_NO_OVERLAP_ADDRESS(TYPE) \ > > > +TYPE ldp_ssll_no_overlap_address_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE > > > *st_arr_2, TYPE i, TYPE dummy){ \ > > > + TYPE r, y; \ > > > + st_arr[0] = i; \ > > > + ld_arr[0] = dummy; \ > > > + r = st_arr_2[0]; \ > > > + y = st_arr_2[1]; \ > > > + return r + y; \ > > > +} > > > + > > > +LDP_SSLL_NO_OVERLAP_ADDRESS(uint32_t) > > > +LDP_SSLL_NO_OVERLAP_ADDRESS(uint64_t) > > > +LDP_SSLL_NO_OVERLAP_ADDRESS(int32_t) > > > +LDP_SSLL_NO_OVERLAP_ADDRESS(int64_t) > > > +LDP_SSLL_NO_OVERLAP_ADDRESS(int) > > > +LDP_SSLL_NO_OVERLAP_ADDRESS(long) > > > +LDP_SSLL_NO_OVERLAP_ADDRESS(float) > > > +LDP_SSLL_NO_OVERLAP_ADDRESS(double) > > > +LDP_SSLL_NO_OVERLAP_ADDRESS(v4si) > > > + > > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */ > > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */ > > > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } */ > > > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } */ > > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */ > > > diff --git > > > a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c > > > b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c > > > new file mode 100644 > > > index 00000000000..f1b3a66abfd > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c > > > @@ -0,0 +1,33 @@ > > > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */ > > > + > > > +#include <stdint.h> > > > + > > > +typedef int v4si __attribute__ ((vector_size (16))); > > > + > > > +/* Same address, different offset, no overlap */ > > > + > > > +#define LDP_SSLL_NO_OVERLAP_OFFSET(TYPE) \ > > > +TYPE ldp_ssll_no_overlap_offset_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE > > > i, TYPE dummy){ \ > > > + TYPE r, y; \ > > > + st_arr[0] = i; \ > > > + ld_arr[0] = dummy; \ > > > + r = st_arr[10]; \ > > > + y = st_arr[11]; \ > > > + return r + y; \ > > > +} > > > + > > > +LDP_SSLL_NO_OVERLAP_OFFSET(uint32_t) > > > +LDP_SSLL_NO_OVERLAP_OFFSET(uint64_t) > > > +LDP_SSLL_NO_OVERLAP_OFFSET(int32_t) > > > +LDP_SSLL_NO_OVERLAP_OFFSET(int64_t) > > > +LDP_SSLL_NO_OVERLAP_OFFSET(int) > > > +LDP_SSLL_NO_OVERLAP_OFFSET(long) > > > +LDP_SSLL_NO_OVERLAP_OFFSET(float) > > > +LDP_SSLL_NO_OVERLAP_OFFSET(double) > > > +LDP_SSLL_NO_OVERLAP_OFFSET(v4si) > > > + > > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */ > > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */ > > > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } */ > > > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } */ > > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */ > > > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c > > > b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c > > > new file mode 100644 > > > index 00000000000..8d5ce5cc87e > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c > > > @@ -0,0 +1,33 @@ > > > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */ > > > + > > > +#include <stdint.h> > > > + > > > +typedef int v4si __attribute__ ((vector_size (16))); > > > + > > > +/* Same address, same offset, overlap */ > > > + > > > +#define LDP_SSLL_OVERLAP(TYPE) \ > > > +TYPE ldp_ssll_overlap_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE i, TYPE > > > dummy){ \ > > > + TYPE r, y; \ > > > + st_arr[0] = i; \ > > > + ld_arr[0] = dummy; \ > > > + r = st_arr[0]; \ > > > + y = st_arr[1]; \ > > > + return r + y; \ > > > +} > > > + > > > +LDP_SSLL_OVERLAP(uint32_t) > > > +LDP_SSLL_OVERLAP(uint64_t) > > > +LDP_SSLL_OVERLAP(int32_t) > > > +LDP_SSLL_OVERLAP(int64_t) > > > +LDP_SSLL_OVERLAP(int) > > > +LDP_SSLL_OVERLAP(long) > > > +LDP_SSLL_OVERLAP(float) > > > +LDP_SSLL_OVERLAP(double) > > > +LDP_SSLL_OVERLAP(v4si) > > > + > > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } } */ > > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } } */ > > > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 0 } } */ > > > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 0 } } */ > > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } } */ > > > -- > > > 2.41.0