stp patterns

Alex Coplan Tue, 05 Dec 2023 03:01:42 -0800

Thanks for the review, I've posted a v2 here which addresses this feedback:
https://gcc.gnu.org/pipermail/gcc-patches/2023-December/639361.html


On 21/11/2023 16:04, Richard Sandiford wrote:
> Alex Coplan <alex.cop...@arm.com> writes:
> > This patch overhauls the load/store pair patterns with two main goals:
> >
> > 1. Fixing a correctness issue (the current patterns are not RA-friendly).
> > 2. Allowing more flexibility in which operand modes are supported, and which
> >    combinations of modes are allowed in the two arms of the load/store pair,
> >    while reducing the number of patterns required both in the source and in
> >    the generated code.
> >
> > The correctness issue (1) is due to the fact that the current patterns have
> > two independent memory operands tied together only by a predicate on the 
> > insns.
> > Since LRA only looks at the constraints, one of the memory operands can get
> > reloaded without the other one being changed, leading to the insn becoming
> > unrecognizable after reload.
> >
> > We fix this issue by changing the patterns such that they only ever have one
> > memory operand representing the entire pair.  For the store case, we use an
> > unspec to logically concatenate the register operands before storing them.
> > For the load case, we use unspecs to extract the "lanes" from the pair mem,
> > with the second occurrence of the mem matched using a match_dup (such that 
> > there
> > is still really only one memory operand as far as the RA is concerned).
> >
> > In terms of the modes used for the pair memory operands, we canonicalize
> > these to V2x4QImode, V2x8QImode, and V2x16QImode.  These modes have not
> > only the correct size but also correct alignment requirement for a
> > memory operand representing an entire load/store pair.  Unlike the other
> > two, V2x4QImode didn't previously exist, so had to be added with the
> > patch.
> >
> > As with the previous patch generalizing the writeback patterns, this
> > patch aims to be flexible in the combinations of modes supported by the
> > patterns without requiring a large number of generated patterns by using
> > distinct mode iterators.
> >
> > The new scheme means we only need a single (generated) pattern for each
> > load/store operation of a given operand size.  For the 4-byte and 8-byte
> > operand cases, we use the GPI iterator to synthesize the two patterns.
> > The 16-byte case is implemented as a separate pattern in the source (due
> > to only having a single possible alternative).
> >
> > Since the UNSPEC patterns can't be interpreted by the dwarf2cfi code,
> > we add REG_CFA_OFFSET notes to the store pair insns emitted by
> > aarch64_save_callee_saves, so that correct CFI information can still be
> > generated.  Furthermore, we now unconditionally generate these CFA
> > notes on frame-related insns emitted by aarch64_save_callee_saves.
> > This is done in case that the load/store pair pass forms these into
> > pairs, in which case the CFA notes would be needed.
> >
> > We also adjust the ldp/stp peepholes to generate the new form.  This is
> > done by switching the generation to use the
> > aarch64_gen_{load,store}_pair interface, making it easier to change the
> > form in the future if needed.  (Likewise, the upcoming aarch64
> > load/store pair pass also makes use of this interface).
> >
> > This patch also adds an "ldpstp" attribute to the non-writeback
> > load/store pair patterns, which is used by the post-RA load/store pair
> > pass to identify existing patterns and see if they can be promoted to
> > writeback variants.
> >
> > One potential concern with using unspecs for the patterns is that it can 
> > block
> > optimization by the generic RTL passes.  This patch series tries to mitigate
> > this in two ways:
> >  1. The pre-RA load/store pair pass runs very late in the pre-RA pipeline.
> >  2. A later patch in the series adjusts the aarch64 mem{cpy,set} expansion 
> > to
> >     emit individual loads/stores instead of ldp/stp.  These should then be
> >     formed back into load/store pairs much later in the RTL pipeline by the
> >     new load/store pair pass.
> >
> > Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?
> >
> > Thanks,
> > Alex
> >
> > gcc/ChangeLog:
> >
> >     * config/aarch64/aarch64-ldpstp.md: Abstract ldp/stp
> >     representation from peepholes, allowing use of new form.
> >     * config/aarch64/aarch64-modes.def (V2x4QImode): Define.
> >     * config/aarch64/aarch64-protos.h
> >     (aarch64_finish_ldpstp_peephole): Declare.
> >     (aarch64_swap_ldrstr_operands): Delete declaration.
> >     (aarch64_gen_load_pair): Declare.
> >     (aarch64_gen_store_pair): Declare.
> >     * config/aarch64/aarch64-simd.md (load_pair<DREG:mode><DREG2:mode>):
> >     Delete.
> >     (vec_store_pair<DREG:mode><DREG2:mode>): Delete.
> >     (load_pair<VQ:mode><VQ2:mode>): Delete.
> >     (vec_store_pair<VQ:mode><VQ2:mode>): Delete.
> >     * config/aarch64/aarch64.cc (aarch64_pair_mode_for_mode): New.
> >     (aarch64_gen_store_pair): Adjust to use new unspec form of stp.
> >     Drop second mem from parameters.
> >     (aarch64_gen_load_pair): Likewise.
> >     (aarch64_pair_mem_from_base): New.
> >     (aarch64_save_callee_saves): Emit REG_CFA_OFFSET notes for
> >     frame-related saves.  Adjust call to aarch64_gen_store_pair
> >     (aarch64_restore_callee_saves): Adjust calls to
> >     aarch64_gen_load_pair to account for change in interface.
> >     (aarch64_process_components): Likewise.
> >     (aarch64_classify_address): Handle 32-byte pair mems in
> >     LDP_STP_N case.
> >     (aarch64_print_operand): Likewise.
> >     (aarch64_copy_one_block_and_progress_pointers): Adjust calls to
> >     account for change in aarch64_gen_{load,store}_pair interface.
> >     (aarch64_set_one_block_and_progress_pointer): Likewise.
> >     (aarch64_finish_ldpstp_peephole): New.
> >     (aarch64_gen_adjusted_ldpstp): Adjust to use generation helper.
> >     * config/aarch64/aarch64.md (ldpstp): New attribute.
> >     (load_pair_sw_<SX:mode><SX2:mode>): Delete.
> >     (load_pair_dw_<DX:mode><DX2:mode>): Delete.
> >     (load_pair_dw_<TX:mode><TX2:mode>): Delete.
> >     (*load_pair_<ldst_sz>): New.
> >     (*load_pair_16): New.
> >     (store_pair_sw_<SX:mode><SX2:mode>): Delete.
> >     (store_pair_dw_<DX:mode><DX2:mode>): Delete.
> >     (store_pair_dw_<TX:mode><TX2:mode>): Delete.
> >     (*store_pair_<ldst_sz>): New.
> >     (*store_pair_16): New.
> >     (*load_pair_extendsidi2_aarch64): Adjust to use new form.
> >     (*zero_extendsidi2_aarch64): Likewise.
> >     * config/aarch64/iterators.md (VPAIR): New.
> >     * config/aarch64/predicates.md (aarch64_mem_pair_operand): Change to
> >     a special predicate derived from aarch64_mem_pair_operator.
> > ---
> >  gcc/config/aarch64/aarch64-ldpstp.md |  66 +++----
> >  gcc/config/aarch64/aarch64-modes.def |   6 +-
> >  gcc/config/aarch64/aarch64-protos.h  |   5 +-
> >  gcc/config/aarch64/aarch64-simd.md   |  60 -------
> >  gcc/config/aarch64/aarch64.cc        | 257 +++++++++++++++------------
> >  gcc/config/aarch64/aarch64.md        | 188 +++++++++-----------
> >  gcc/config/aarch64/iterators.md      |   3 +
> >  gcc/config/aarch64/predicates.md     |  10 +-
> >  8 files changed, 270 insertions(+), 325 deletions(-)
> >
> > diff --git a/gcc/config/aarch64/aarch64-ldpstp.md 
> > b/gcc/config/aarch64/aarch64-ldpstp.md
> > index 1ee7c73ff0c..dc39af85254 100644
> > --- a/gcc/config/aarch64/aarch64-ldpstp.md
> > +++ b/gcc/config/aarch64/aarch64-ldpstp.md
> > @@ -24,10 +24,10 @@ (define_peephole2
> >     (set (match_operand:GPI 2 "register_operand" "")
> >     (match_operand:GPI 3 "memory_operand" ""))]
> >    "aarch64_operands_ok_for_ldpstp (operands, true, <MODE>mode)"
> > -  [(parallel [(set (match_dup 0) (match_dup 1))
> > -         (set (match_dup 2) (match_dup 3))])]
> > +  [(const_int 0)]
> >  {
> > -  aarch64_swap_ldrstr_operands (operands, true);
> > +  aarch64_finish_ldpstp_peephole (operands, true);
> > +  DONE;
> >  })
> >  
> >  (define_peephole2
> > @@ -36,10 +36,10 @@ (define_peephole2
> >     (set (match_operand:GPI 2 "memory_operand" "")
> >     (match_operand:GPI 3 "aarch64_reg_or_zero" ""))]
> >    "aarch64_operands_ok_for_ldpstp (operands, false, <MODE>mode)"
> > -  [(parallel [(set (match_dup 0) (match_dup 1))
> > -         (set (match_dup 2) (match_dup 3))])]
> > +  [(const_int 0)]
> >  {
> > -  aarch64_swap_ldrstr_operands (operands, false);
> > +  aarch64_finish_ldpstp_peephole (operands, false);
> > +  DONE;
> >  })
> >  
> >  (define_peephole2
> > @@ -48,10 +48,10 @@ (define_peephole2
> >     (set (match_operand:GPF 2 "register_operand" "")
> >     (match_operand:GPF 3 "memory_operand" ""))]
> >    "aarch64_operands_ok_for_ldpstp (operands, true, <MODE>mode)"
> > -  [(parallel [(set (match_dup 0) (match_dup 1))
> > -         (set (match_dup 2) (match_dup 3))])]
> > +  [(const_int 0)]
> >  {
> > -  aarch64_swap_ldrstr_operands (operands, true);
> > +  aarch64_finish_ldpstp_peephole (operands, true);
> > +  DONE;
> >  })
> >  
> >  (define_peephole2
> > @@ -60,10 +60,10 @@ (define_peephole2
> >     (set (match_operand:GPF 2 "memory_operand" "")
> >     (match_operand:GPF 3 "aarch64_reg_or_fp_zero" ""))]
> >    "aarch64_operands_ok_for_ldpstp (operands, false, <MODE>mode)"
> > -  [(parallel [(set (match_dup 0) (match_dup 1))
> > -         (set (match_dup 2) (match_dup 3))])]
> > +  [(const_int 0)]
> >  {
> > -  aarch64_swap_ldrstr_operands (operands, false);
> > +  aarch64_finish_ldpstp_peephole (operands, false);
> > +  DONE;
> >  })
> >  
> >  (define_peephole2
> > @@ -72,10 +72,10 @@ (define_peephole2
> >     (set (match_operand:DREG2 2 "register_operand" "")
> >     (match_operand:DREG2 3 "memory_operand" ""))]
> >    "aarch64_operands_ok_for_ldpstp (operands, true, <DREG:MODE>mode)"
> > -  [(parallel [(set (match_dup 0) (match_dup 1))
> > -         (set (match_dup 2) (match_dup 3))])]
> > +  [(const_int 0)]
> >  {
> > -  aarch64_swap_ldrstr_operands (operands, true);
> > +  aarch64_finish_ldpstp_peephole (operands, true);
> > +  DONE;
> >  })
> >  
> >  (define_peephole2
> > @@ -84,10 +84,10 @@ (define_peephole2
> >     (set (match_operand:DREG2 2 "memory_operand" "")
> >     (match_operand:DREG2 3 "register_operand" ""))]
> >    "aarch64_operands_ok_for_ldpstp (operands, false, <DREG:MODE>mode)"
> > -  [(parallel [(set (match_dup 0) (match_dup 1))
> > -         (set (match_dup 2) (match_dup 3))])]
> > +  [(const_int 0)]
> >  {
> > -  aarch64_swap_ldrstr_operands (operands, false);
> > +  aarch64_finish_ldpstp_peephole (operands, false);
> > +  DONE;
> >  })
> >  
> >  (define_peephole2
> > @@ -99,10 +99,10 @@ (define_peephole2
> >     && aarch64_operands_ok_for_ldpstp (operands, true, <VQ:MODE>mode)
> >     && (aarch64_tune_params.extra_tuning_flags
> >     & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS) == 0"
> > -  [(parallel [(set (match_dup 0) (match_dup 1))
> > -         (set (match_dup 2) (match_dup 3))])]
> > +  [(const_int 0)]
> >  {
> > -  aarch64_swap_ldrstr_operands (operands, true);
> > +  aarch64_finish_ldpstp_peephole (operands, true);
> > +  DONE;
> >  })
> >  
> >  (define_peephole2
> > @@ -114,10 +114,10 @@ (define_peephole2
> >     && aarch64_operands_ok_for_ldpstp (operands, false, <VQ:MODE>mode)
> >     && (aarch64_tune_params.extra_tuning_flags
> >     & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS) == 0"
> > -  [(parallel [(set (match_dup 0) (match_dup 1))
> > -         (set (match_dup 2) (match_dup 3))])]
> > +  [(const_int 0)]
> >  {
> > -  aarch64_swap_ldrstr_operands (operands, false);
> > +  aarch64_finish_ldpstp_peephole (operands, false);
> > +  DONE;
> >  })
> >  
> >  
> > @@ -129,10 +129,10 @@ (define_peephole2
> >     (set (match_operand:DI 2 "register_operand" "")
> >     (sign_extend:DI (match_operand:SI 3 "memory_operand" "")))]
> >    "aarch64_operands_ok_for_ldpstp (operands, true, SImode)"
> > -  [(parallel [(set (match_dup 0) (sign_extend:DI (match_dup 1)))
> > -         (set (match_dup 2) (sign_extend:DI (match_dup 3)))])]
> > +  [(const_int 0)]
> >  {
> > -  aarch64_swap_ldrstr_operands (operands, true);
> > +  aarch64_finish_ldpstp_peephole (operands, true, SIGN_EXTEND);
> > +  DONE;
> >  })
> >  
> >  (define_peephole2
> > @@ -141,10 +141,10 @@ (define_peephole2
> >     (set (match_operand:DI 2 "register_operand" "")
> >     (zero_extend:DI (match_operand:SI 3 "memory_operand" "")))]
> >    "aarch64_operands_ok_for_ldpstp (operands, true, SImode)"
> > -  [(parallel [(set (match_dup 0) (zero_extend:DI (match_dup 1)))
> > -         (set (match_dup 2) (zero_extend:DI (match_dup 3)))])]
> > +  [(const_int 0)]
> >  {
> > -  aarch64_swap_ldrstr_operands (operands, true);
> > +  aarch64_finish_ldpstp_peephole (operands, true, ZERO_EXTEND);
> > +  DONE;
> >  })
> >  
> >  ;; Handle storing of a floating point zero with integer data.
> > @@ -163,10 +163,10 @@ (define_peephole2
> >     (set (match_operand:<FCVT_TARGET> 2 "memory_operand" "")
> >     (match_operand:<FCVT_TARGET> 3 "aarch64_reg_zero_or_fp_zero" ""))]
> >    "aarch64_operands_ok_for_ldpstp (operands, false, <V_INT_EQUIV>mode)"
> > -  [(parallel [(set (match_dup 0) (match_dup 1))
> > -         (set (match_dup 2) (match_dup 3))])]
> > +  [(const_int 0)]
> >  {
> > -  aarch64_swap_ldrstr_operands (operands, false);
> > +  aarch64_finish_ldpstp_peephole (operands, false);
> > +  DONE;
> >  })
> >  
> >  ;; Handle consecutive load/store whose offset is out of the range
> > diff --git a/gcc/config/aarch64/aarch64-modes.def 
> > b/gcc/config/aarch64/aarch64-modes.def
> > index 6b4f4e17dd5..1e0d770f72f 100644
> > --- a/gcc/config/aarch64/aarch64-modes.def
> > +++ b/gcc/config/aarch64/aarch64-modes.def
> > @@ -93,9 +93,13 @@ INT_MODE (XI, 64);
> >  
> >  /* V8DI mode.  */
> >  VECTOR_MODE_WITH_PREFIX (V, INT, DI, 8, 5);
> > -
> >  ADJUST_ALIGNMENT (V8DI, 8);
> >  
> > +/* V2x4QImode.  Used in load/store pair patterns.  */
> > +VECTOR_MODE_WITH_PREFIX (V2x, INT, QI, 4, 5);
> > +ADJUST_NUNITS (V2x4QI, 8);
> > +ADJUST_ALIGNMENT (V2x4QI, 4);
> > +
> >  /* Define Advanced SIMD modes for structures of 2, 3 and 4 d-registers.  */
> >  #define ADV_SIMD_D_REG_STRUCT_MODES(NVECS, VB, VH, VS, VD) \
> >    VECTOR_MODES_WITH_PREFIX (V##NVECS##x, INT, 8, 3); \
> > diff --git a/gcc/config/aarch64/aarch64-protos.h 
> > b/gcc/config/aarch64/aarch64-protos.h
> > index e463fd5c817..2ab54f244a7 100644
> > --- a/gcc/config/aarch64/aarch64-protos.h
> > +++ b/gcc/config/aarch64/aarch64-protos.h
> > @@ -967,6 +967,8 @@ void aarch64_split_compare_and_swap (rtx op[]);
> >  void aarch64_split_atomic_op (enum rtx_code, rtx, rtx, rtx, rtx, rtx, rtx);
> >  
> >  bool aarch64_gen_adjusted_ldpstp (rtx *, bool, machine_mode, RTX_CODE);
> > +void aarch64_finish_ldpstp_peephole (rtx *, bool,
> > +                                enum rtx_code = (enum rtx_code)0);
> >  
> >  void aarch64_expand_sve_vec_cmp_int (rtx, rtx_code, rtx, rtx);
> >  bool aarch64_expand_sve_vec_cmp_float (rtx, rtx_code, rtx, rtx, bool);
> > @@ -1022,8 +1024,9 @@ bool aarch64_mergeable_load_pair_p (machine_mode, 
> > rtx, rtx);
> >  bool aarch64_operands_ok_for_ldpstp (rtx *, bool, machine_mode);
> >  bool aarch64_operands_adjust_ok_for_ldpstp (rtx *, bool, machine_mode);
> >  bool aarch64_mem_ok_with_ldpstp_policy_model (rtx, bool, machine_mode);
> > -void aarch64_swap_ldrstr_operands (rtx *, bool);
> >  bool aarch64_ldpstp_operand_mode_p (machine_mode);
> > +rtx aarch64_gen_load_pair (rtx, rtx, rtx, enum rtx_code = (enum 
> > rtx_code)0);
> > +rtx aarch64_gen_store_pair (rtx, rtx, rtx);
> >  
> >  extern void aarch64_asm_output_pool_epilogue (FILE *, const char *,
> >                                           tree, HOST_WIDE_INT);
> > diff --git a/gcc/config/aarch64/aarch64-simd.md 
> > b/gcc/config/aarch64/aarch64-simd.md
> > index c6f2d582837..6f5080ab030 100644
> > --- a/gcc/config/aarch64/aarch64-simd.md
> > +++ b/gcc/config/aarch64/aarch64-simd.md
> > @@ -231,38 +231,6 @@ (define_insn "aarch64_store_lane0<mode>"
> >    [(set_attr "type" "neon_store1_1reg<q>")]
> >  )
> >  
> > -(define_insn "load_pair<DREG:mode><DREG2:mode>"
> > -  [(set (match_operand:DREG 0 "register_operand")
> > -   (match_operand:DREG 1 "aarch64_mem_pair_operand"))
> > -   (set (match_operand:DREG2 2 "register_operand")
> > -   (match_operand:DREG2 3 "memory_operand"))]
> > -  "TARGET_FLOAT
> > -   && rtx_equal_p (XEXP (operands[3], 0),
> > -              plus_constant (Pmode,
> > -                             XEXP (operands[1], 0),
> > -                             GET_MODE_SIZE (<DREG:MODE>mode)))"
> > -  {@ [ cons: =0 , 1   , =2 , 3 ; attrs: type ]
> > -     [ w        , Ump , w  , m ; neon_ldp    ] ldp\t%d0, %d2, %z1
> > -     [ r        , Ump , r  , m ; load_16     ] ldp\t%x0, %x2, %z1
> > -  }
> > -)
> > -
> > -(define_insn "vec_store_pair<DREG:mode><DREG2:mode>"
> > -  [(set (match_operand:DREG 0 "aarch64_mem_pair_operand")
> > -   (match_operand:DREG 1 "register_operand"))
> > -   (set (match_operand:DREG2 2 "memory_operand")
> > -   (match_operand:DREG2 3 "register_operand"))]
> > -  "TARGET_FLOAT
> > -   && rtx_equal_p (XEXP (operands[2], 0),
> > -              plus_constant (Pmode,
> > -                             XEXP (operands[0], 0),
> > -                             GET_MODE_SIZE (<DREG:MODE>mode)))"
> > -  {@ [ cons: =0 , 1 , =2 , 3 ; attrs: type ]
> > -     [ Ump      , w , m  , w ; neon_stp    ] stp\t%d1, %d3, %z0
> > -     [ Ump      , r , m  , r ; store_16    ] stp\t%x1, %x3, %z0
> > -  }
> > -)
> > -
> >  (define_insn "aarch64_simd_stp<mode>"
> >    [(set (match_operand:VP_2E 0 "aarch64_mem_pair_lanes_operand")
> >     (vec_duplicate:VP_2E (match_operand:<VEL> 1 "register_operand")))]
> > @@ -273,34 +241,6 @@ (define_insn "aarch64_simd_stp<mode>"
> >    }
> >  )
> >  
> > -(define_insn "load_pair<VQ:mode><VQ2:mode>"
> > -  [(set (match_operand:VQ 0 "register_operand" "=w")
> > -   (match_operand:VQ 1 "aarch64_mem_pair_operand" "Ump"))
> > -   (set (match_operand:VQ2 2 "register_operand" "=w")
> > -   (match_operand:VQ2 3 "memory_operand" "m"))]
> > -  "TARGET_FLOAT
> > -    && rtx_equal_p (XEXP (operands[3], 0),
> > -               plus_constant (Pmode,
> > -                          XEXP (operands[1], 0),
> > -                          GET_MODE_SIZE (<VQ:MODE>mode)))"
> > -  "ldp\\t%q0, %q2, %z1"
> > -  [(set_attr "type" "neon_ldp_q")]
> > -)
> > -
> > -(define_insn "vec_store_pair<VQ:mode><VQ2:mode>"
> > -  [(set (match_operand:VQ 0 "aarch64_mem_pair_operand" "=Ump")
> > -   (match_operand:VQ 1 "register_operand" "w"))
> > -   (set (match_operand:VQ2 2 "memory_operand" "=m")
> > -   (match_operand:VQ2 3 "register_operand" "w"))]
> > -  "TARGET_FLOAT
> > -   && rtx_equal_p (XEXP (operands[2], 0),
> > -              plus_constant (Pmode,
> > -                             XEXP (operands[0], 0),
> > -                             GET_MODE_SIZE (<VQ:MODE>mode)))"
> > -  "stp\\t%q1, %q3, %z0"
> > -  [(set_attr "type" "neon_stp_q")]
> > -)
> > -
> >  (define_expand "@aarch64_split_simd_mov<mode>"
> >    [(set (match_operand:VQMOV 0)
> >     (match_operand:VQMOV 1))]
> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > index ccf081d2a16..1f6094bf1bc 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -9056,59 +9056,81 @@ aarch64_pop_regs (unsigned regno1, unsigned regno2, 
> > HOST_WIDE_INT adjustment,
> >      }
> >  }
> >  
> > -/* Generate and return a store pair instruction of mode MODE to store
> > -   register REG1 to MEM1 and register REG2 to MEM2.  */
> > +static machine_mode
> > +aarch64_pair_mode_for_mode (machine_mode mode)
> > +{
> > +  if (known_eq (GET_MODE_SIZE (mode), 4))
> > +    return E_V2x4QImode;
> > +  else if (known_eq (GET_MODE_SIZE (mode), 8))
> > +    return E_V2x8QImode;
> > +  else if (known_eq (GET_MODE_SIZE (mode), 16))
> > +    return E_V2x16QImode;
> > +  else
> > +    gcc_unreachable ();
> > +}
> 
> Missing function comment.  There should be no need to use E_ outside switches.

Fixed, thanks.

> 
> >  
> >  static rtx
> > -aarch64_gen_store_pair (machine_mode mode, rtx mem1, rtx reg1, rtx mem2,
> > -                   rtx reg2)
> > +aarch64_pair_mem_from_base (rtx mem)
> >  {
> > -  switch (mode)
> > -    {
> > -    case E_DImode:
> > -      return gen_store_pair_dw_didi (mem1, reg1, mem2, reg2);
> > -
> > -    case E_DFmode:
> > -      return gen_store_pair_dw_dfdf (mem1, reg1, mem2, reg2);
> > -
> > -    case E_TFmode:
> > -      return gen_store_pair_dw_tftf (mem1, reg1, mem2, reg2);
> > +  auto pair_mode = aarch64_pair_mode_for_mode (GET_MODE (mem));
> > +  mem = adjust_bitfield_address_nv (mem, pair_mode, 0);
> > +  gcc_assert (aarch64_mem_pair_lanes_operand (mem, pair_mode));
> > +  return mem;
> > +}
> >  
> > -    case E_V4SImode:
> > -      return gen_vec_store_pairv4siv4si (mem1, reg1, mem2, reg2);
> > +/* Generate and return a store pair instruction to store REG1 and REG2
> > +   into memory starting at BASE_MEM.  All three rtxes should have modes of 
> > the
> > +   same size.  */
> >  
> > -    case E_V16QImode:
> > -      return gen_vec_store_pairv16qiv16qi (mem1, reg1, mem2, reg2);
> > +rtx
> > +aarch64_gen_store_pair (rtx base_mem, rtx reg1, rtx reg2)
> > +{
> > +  rtx pair_mem = aarch64_pair_mem_from_base (base_mem);
> >  
> > -    default:
> > -      gcc_unreachable ();
> > -    }
> > +  return gen_rtx_SET (pair_mem,
> > +                 gen_rtx_UNSPEC (GET_MODE (pair_mem),
> > +                                 gen_rtvec (2, reg1, reg2),
> > +                                 UNSPEC_STP));
> >  }
> >  
> > -/* Generate and regurn a load pair isntruction of mode MODE to load 
> > register
> > -   REG1 from MEM1 and register REG2 from MEM2.  */
> > +/* Generate and return a load pair instruction to load a pair of
> > +   registers starting at BASE_MEM into REG1 and REG2.  If CODE is
> > +   UNKNOWN, all three rtxes should have modes of the same size.
> > +   Otherwise, CODE is {SIGN,ZERO}_EXTEND, base_mem should be in SImode,
> > +   and REG{1,2} should be in DImode.  */
> >  
> > -static rtx
> > -aarch64_gen_load_pair (machine_mode mode, rtx reg1, rtx mem1, rtx reg2,
> > -                  rtx mem2)
> > +rtx
> > +aarch64_gen_load_pair (rtx reg1, rtx reg2, rtx base_mem, enum rtx_code 
> > code)
> >  {
> > -  switch (mode)
> > -    {
> > -    case E_DImode:
> > -      return gen_load_pair_dw_didi (reg1, mem1, reg2, mem2);
> > +  rtx pair_mem = aarch64_pair_mem_from_base (base_mem);
> >  
> > -    case E_DFmode:
> > -      return gen_load_pair_dw_dfdf (reg1, mem1, reg2, mem2);
> > -
> > -    case E_TFmode:
> > -      return gen_load_pair_dw_tftf (reg1, mem1, reg2, mem2);
> > +  const bool any_extend_p = (code == ZERO_EXTEND || code == SIGN_EXTEND);
> > +  if (any_extend_p)
> > +    {
> > +      gcc_checking_assert (GET_MODE (base_mem) == SImode);
> > +      gcc_checking_assert (GET_MODE (reg1) == DImode);
> > +      gcc_checking_assert (GET_MODE (reg2) == DImode);
> 
> Not a personal preference, but I think single asserts with && are
> preferred.

Ah, that's a shame.  Different asserts allow you to see which one failed from
the backtrace.  Anyway, I've collapsed these in the latest version.

> 
> > +    }
> > +  else
> > +    gcc_assert (code == UNKNOWN);
> > +
> > +  rtx unspecs[2] = {
> > +    gen_rtx_UNSPEC (any_extend_p ? SImode : GET_MODE (reg1),
> > +               gen_rtvec (1, pair_mem),
> > +               UNSPEC_LDP_FST),
> > +    gen_rtx_UNSPEC (any_extend_p ? SImode : GET_MODE (reg2),
> 
> IIUC, the unspec modes could both be GET_MODE (base_mem)

I don't think so.  In the non-extending case we allow pairs loading to
registers in distinct modes, provided the modes are of the same size.
So I think we should respect the modes of the registers, and allow the
unspec to hide the mode change.  Does that make sense?

> 
> > +               gen_rtvec (1, copy_rtx (pair_mem)),
> > +               UNSPEC_LDP_SND)
> > +  };
> >  
> > -    case E_V4SImode:
> > -      return gen_load_pairv4siv4si (reg1, mem1, reg2, mem2);
> > +  if (any_extend_p)
> > +    for (int i = 0; i < 2; i++)
> > +      unspecs[i] = gen_rtx_fmt_e (code, DImode, unspecs[i]);
> >  
> > -    default:
> > -      gcc_unreachable ();
> > -    }
> > +  return gen_rtx_PARALLEL (VOIDmode,
> > +                      gen_rtvec (2,
> > +                                 gen_rtx_SET (reg1, unspecs[0]),
> > +                                 gen_rtx_SET (reg2, unspecs[1])));
> >  }
> >  
> >  /* Return TRUE if return address signing should be enabled for the current
> > @@ -9321,8 +9343,19 @@ aarch64_save_callee_saves (poly_int64 bytes_below_sp,
> >       offset -= fp_offset;
> >     }
> >        rtx mem = gen_frame_mem (mode, plus_constant (Pmode, base_rtx, 
> > offset));
> > -      bool need_cfa_note_p = (base_rtx != stack_pointer_rtx);
> >  
> > +      rtx cfa_base = stack_pointer_rtx;
> > +      poly_int64 cfa_offset = sp_offset;
> 
> I don't think we need both cfa_offset and sp_offset.  sp_offset in the
> current code only exists for CFI purposes.

Fixed, thanks.

> 
> > +
> > +      if (hard_fp_valid_p && frame_pointer_needed)
> > +   {
> > +     cfa_base = hard_frame_pointer_rtx;
> > +     cfa_offset += (bytes_below_sp - frame.bytes_below_hard_fp);
> > +   }
> > +
> > +      rtx cfa_mem = gen_frame_mem (mode,
> > +                              plus_constant (Pmode,
> > +                                             cfa_base, cfa_offset));
> >        unsigned int regno2;
> >        if (!aarch64_sve_mode_p (mode)
> >       && i + 1 < regs.size ()
> > @@ -9331,45 +9364,37 @@ aarch64_save_callee_saves (poly_int64 
> > bytes_below_sp,
> >                    frame.reg_offset[regno2] - frame.reg_offset[regno]))
> >     {
> >       rtx reg2 = gen_rtx_REG (mode, regno2);
> > -     rtx mem2;
> >  
> >       offset += GET_MODE_SIZE (mode);
> > -     mem2 = gen_frame_mem (mode, plus_constant (Pmode, base_rtx, offset));
> > -     insn = emit_insn (aarch64_gen_store_pair (mode, mem, reg, mem2,
> > -                                               reg2));
> > -
> > -     /* The first part of a frame-related parallel insn is
> > -        always assumed to be relevant to the frame
> > -        calculations; subsequent parts, are only
> > -        frame-related if explicitly marked.  */
> > +     insn = emit_insn (aarch64_gen_store_pair (mem, reg, reg2));
> > +
> >       if (aarch64_emit_cfi_for_reg_p (regno2))
> >         {
> > -         if (need_cfa_note_p)
> > -           aarch64_add_cfa_expression (insn, reg2, stack_pointer_rtx,
> > -                                       sp_offset + GET_MODE_SIZE (mode));
> > -         else
> > -           RTX_FRAME_RELATED_P (XVECEXP (PATTERN (insn), 0, 1)) = 1;
> > +         rtx cfa_mem2 = adjust_address_nv (cfa_mem,
> > +                                           Pmode,
> > +                                           GET_MODE_SIZE (mode));
> 
> Think this should use get_frame_mem directly, rather than moving beyond
> the bounds of the original mem.

Done.

> 
> > +         add_reg_note (insn, REG_CFA_OFFSET,
> > +                       gen_rtx_SET (cfa_mem2, reg2));
> >         }
> >  
> >       regno = regno2;
> >       ++i;
> >     }
> >        else if (mode == VNx2DImode && BYTES_BIG_ENDIAN)
> > -   {
> > -     insn = emit_insn (gen_aarch64_pred_mov (mode, mem, ptrue, reg));
> > -     need_cfa_note_p = true;
> > -   }
> > +   insn = emit_insn (gen_aarch64_pred_mov (mode, mem, ptrue, reg));
> >        else if (aarch64_sve_mode_p (mode))
> >     insn = emit_insn (gen_rtx_SET (mem, reg));
> >        else
> >     insn = emit_move_insn (mem, reg);
> >  
> >        RTX_FRAME_RELATED_P (insn) = frame_related_p;
> > -      if (frame_related_p && need_cfa_note_p)
> > -   aarch64_add_cfa_expression (insn, reg, stack_pointer_rtx, sp_offset);
> > +
> > +      if (frame_related_p)
> > +   add_reg_note (insn, REG_CFA_OFFSET, gen_rtx_SET (cfa_mem, reg));
> 
> For the record, I might need to add back some CFA_EXPRESSIONs for
> locally-streaming SME functions, to ensure that the CFI code doesn't
> aggregate SVE saves across a change in the VG DWARF register.
> But it's probably easier to do that once the patch is in,
> since having a note on all insns will help to ensure consistency.
> 
> >      }
> >  }
> >  
> > +
> 
> Stray extra whitespace.

Fixed.

> 
> >  /* Emit code to restore the callee registers in REGS, ignoring pop 
> > candidates
> >     and any other registers that are handled separately.  Write the 
> > appropriate
> >     REG_CFA_RESTORE notes into CFI_OPS.
> > @@ -9425,12 +9450,7 @@ aarch64_restore_callee_saves (poly_int64 
> > bytes_below_sp,
> >                    frame.reg_offset[regno2] - frame.reg_offset[regno]))
> >     {
> >       rtx reg2 = gen_rtx_REG (mode, regno2);
> > -     rtx mem2;
> > -
> > -     offset += GET_MODE_SIZE (mode);
> > -     mem2 = gen_frame_mem (mode, plus_constant (Pmode, base_rtx, offset));
> > -     emit_insn (aarch64_gen_load_pair (mode, reg, mem, reg2, mem2));
> > -
> > +     emit_insn (aarch64_gen_load_pair (reg, reg2, mem));
> >       *cfi_ops = alloc_reg_note (REG_CFA_RESTORE, reg2, *cfi_ops);
> >       regno = regno2;
> >       ++i;
> > @@ -9762,9 +9782,9 @@ aarch64_process_components (sbitmap components, bool 
> > prologue_p)
> >                          : gen_rtx_SET (reg2, mem2);
> >  
> >        if (prologue_p)
> > -   insn = emit_insn (aarch64_gen_store_pair (mode, mem, reg, mem2, reg2));
> > +   insn = emit_insn (aarch64_gen_store_pair (mem, reg, reg2));
> >        else
> > -   insn = emit_insn (aarch64_gen_load_pair (mode, reg, mem, reg2, mem2));
> > +   insn = emit_insn (aarch64_gen_load_pair (reg, reg2, mem));
> >  
> >        if (frame_related_p || frame_related2_p)
> >     {
> > @@ -10983,12 +11003,18 @@ aarch64_classify_address (struct 
> > aarch64_address_info *info,
> >       mode of the corresponding addressing mode is half of that.  */
> >    if (type == ADDR_QUERY_LDP_STP_N)
> >      {
> > -      if (known_eq (GET_MODE_SIZE (mode), 16))
> > +      if (known_eq (GET_MODE_SIZE (mode), 32))
> > +   mode = V16QImode;
> > +      else if (known_eq (GET_MODE_SIZE (mode), 16))
> >     mode = DFmode;
> >        else if (known_eq (GET_MODE_SIZE (mode), 8))
> >     mode = SFmode;
> >        else
> >     return false;
> > +
> > +      /* This isn't really an Advanced SIMD struct mode, but a mode
> > +    used to represent the complete mem in a load/store pair.  */
> > +      advsimd_struct_p = false;
> >      }
> >  
> >    bool allow_reg_index_p = (!load_store_pair_p
> > @@ -12609,7 +12635,8 @@ aarch64_print_operand (FILE *f, rtx x, int code)
> >     if (!MEM_P (x)
> >         || (code == 'y'
> >             && maybe_ne (GET_MODE_SIZE (mode), 8)
> > -           && maybe_ne (GET_MODE_SIZE (mode), 16)))
> > +           && maybe_ne (GET_MODE_SIZE (mode), 16)
> > +           && maybe_ne (GET_MODE_SIZE (mode), 32)))
> >       {
> >         output_operand_lossage ("invalid operand for '%%%c'", code);
> >         return;
> > @@ -25431,10 +25458,8 @@ aarch64_copy_one_block_and_progress_pointers (rtx 
> > *src, rtx *dst,
> >        *src = adjust_address (*src, mode, 0);
> >        *dst = adjust_address (*dst, mode, 0);
> >        /* Emit the memcpy.  */
> > -      emit_insn (aarch64_gen_load_pair (mode, reg1, *src, reg2,
> > -                                   aarch64_progress_pointer (*src)));
> > -      emit_insn (aarch64_gen_store_pair (mode, *dst, reg1,
> > -                                    aarch64_progress_pointer (*dst), 
> > reg2));
> > +      emit_insn (aarch64_gen_load_pair (reg1, reg2, *src));
> > +      emit_insn (aarch64_gen_store_pair (*dst, reg1, reg2));
> >        /* Move the pointers forward.  */
> >        *src = aarch64_move_pointer (*src, 32);
> >        *dst = aarch64_move_pointer (*dst, 32);
> > @@ -25613,8 +25638,7 @@ aarch64_set_one_block_and_progress_pointer (rtx 
> > src, rtx *dst,
> >        /* "Cast" the *dst to the correct mode.  */
> >        *dst = adjust_address (*dst, mode, 0);
> >        /* Emit the memset.  */
> > -      emit_insn (aarch64_gen_store_pair (mode, *dst, src,
> > -                                    aarch64_progress_pointer (*dst), src));
> > +      emit_insn (aarch64_gen_store_pair (*dst, src, src));
> >  
> >        /* Move the pointers forward.  */
> >        *dst = aarch64_move_pointer (*dst, 32);
> > @@ -26812,6 +26836,22 @@ aarch64_swap_ldrstr_operands (rtx* operands, bool 
> > load)
> >      }
> >  }
> >  
> > +void
> > +aarch64_finish_ldpstp_peephole (rtx *operands, bool load_p, enum rtx_code 
> > code)
> 
> Missing function comment.

Fixed.

> 
> > +{
> > +  aarch64_swap_ldrstr_operands (operands, load_p);
> > +
> > +  if (load_p)
> > +    emit_insn (aarch64_gen_load_pair (operands[0], operands[2],
> > +                                 operands[1], code));
> > +  else
> > +    {
> > +      gcc_assert (code == UNKNOWN);
> > +      emit_insn (aarch64_gen_store_pair (operands[0], operands[1],
> > +                                    operands[3]));
> > +    }
> > +}
> > +
> >  /* Taking X and Y to be HOST_WIDE_INT pointers, return the result of a
> >     comparison between the two.  */
> >  int
> > @@ -26993,8 +27033,8 @@ bool
> >  aarch64_gen_adjusted_ldpstp (rtx *operands, bool load,
> >                          machine_mode mode, RTX_CODE code)
> >  {
> > -  rtx base, offset_1, offset_3, t1, t2;
> > -  rtx mem_1, mem_2, mem_3, mem_4;
> > +  rtx base, offset_1, offset_3;
> > +  rtx mem_1, mem_2;
> >    rtx temp_operands[8];
> >    HOST_WIDE_INT off_val_1, off_val_3, base_off, new_off_1, new_off_3,
> >             stp_off_upper_limit, stp_off_lower_limit, msize;
> > @@ -27019,21 +27059,17 @@ aarch64_gen_adjusted_ldpstp (rtx *operands, bool 
> > load,
> >    if (load)
> >      {
> >        mem_1 = copy_rtx (temp_operands[1]);
> > -      mem_2 = copy_rtx (temp_operands[3]);
> > -      mem_3 = copy_rtx (temp_operands[5]);
> > -      mem_4 = copy_rtx (temp_operands[7]);
> > +      mem_2 = copy_rtx (temp_operands[5]);
> >      }
> >    else
> >      {
> >        mem_1 = copy_rtx (temp_operands[0]);
> > -      mem_2 = copy_rtx (temp_operands[2]);
> > -      mem_3 = copy_rtx (temp_operands[4]);
> > -      mem_4 = copy_rtx (temp_operands[6]);
> > +      mem_2 = copy_rtx (temp_operands[4]);
> >        gcc_assert (code == UNKNOWN);
> >      }
> >  
> >    extract_base_offset_in_addr (mem_1, &base, &offset_1);
> > -  extract_base_offset_in_addr (mem_3, &base, &offset_3);
> > +  extract_base_offset_in_addr (mem_2, &base, &offset_3);
> 
> mem_2 with offset_3 feels a bit awkward.  Might be worth using mem_3 instead,
> so that the memory and register numbers are in sync.

I went with mem_1 and mem_2 for now.  I think it looks fairly consistent with
that change, WDYT?

> 
> I suppose we still need Ump for the extending loads, is that right?
> Are there any other uses left?

There is a use of satisfies_constraint_Ump in aarch64_process_components, but
that's it.

How does the new version look?

Thanks,
Alex

> 
> Thanks,
> Richard
> 
> >    gcc_assert (base != NULL_RTX && offset_1 != NULL_RTX
> >           && offset_3 != NULL_RTX);
> >  
> > @@ -27097,63 +27133,48 @@ aarch64_gen_adjusted_ldpstp (rtx *operands, bool 
> > load,
> >    replace_equiv_address_nv (mem_1, plus_constant (Pmode, operands[8],
> >                                               new_off_1), true);
> >    replace_equiv_address_nv (mem_2, plus_constant (Pmode, operands[8],
> > -                                             new_off_1 + msize), true);
> > -  replace_equiv_address_nv (mem_3, plus_constant (Pmode, operands[8],
> >                                               new_off_3), true);
> > -  replace_equiv_address_nv (mem_4, plus_constant (Pmode, operands[8],
> > -                                             new_off_3 + msize), true);
> >  
> >    if (!aarch64_mem_pair_operand (mem_1, mode)
> > -      || !aarch64_mem_pair_operand (mem_3, mode))
> > +      || !aarch64_mem_pair_operand (mem_2, mode))
> >      return false;
> >  
> > -  if (code == ZERO_EXTEND)
> > -    {
> > -      mem_1 = gen_rtx_ZERO_EXTEND (DImode, mem_1);
> > -      mem_2 = gen_rtx_ZERO_EXTEND (DImode, mem_2);
> > -      mem_3 = gen_rtx_ZERO_EXTEND (DImode, mem_3);
> > -      mem_4 = gen_rtx_ZERO_EXTEND (DImode, mem_4);
> > -    }
> > -  else if (code == SIGN_EXTEND)
> > -    {
> > -      mem_1 = gen_rtx_SIGN_EXTEND (DImode, mem_1);
> > -      mem_2 = gen_rtx_SIGN_EXTEND (DImode, mem_2);
> > -      mem_3 = gen_rtx_SIGN_EXTEND (DImode, mem_3);
> > -      mem_4 = gen_rtx_SIGN_EXTEND (DImode, mem_4);
> > -    }
> > -
> >    if (load)
> >      {
> >        operands[0] = temp_operands[0];
> >        operands[1] = mem_1;
> >        operands[2] = temp_operands[2];
> > -      operands[3] = mem_2;
> >        operands[4] = temp_operands[4];
> > -      operands[5] = mem_3;
> > +      operands[5] = mem_2;
> >        operands[6] = temp_operands[6];
> > -      operands[7] = mem_4;
> >      }
> >    else
> >      {
> >        operands[0] = mem_1;
> >        operands[1] = temp_operands[1];
> > -      operands[2] = mem_2;
> >        operands[3] = temp_operands[3];
> > -      operands[4] = mem_3;
> > +      operands[4] = mem_2;
> >        operands[5] = temp_operands[5];
> > -      operands[6] = mem_4;
> >        operands[7] = temp_operands[7];
> >      }
> >  
> >    /* Emit adjusting instruction.  */
> >    emit_insn (gen_rtx_SET (operands[8], plus_constant (DImode, base, 
> > base_off)));
> >    /* Emit ldp/stp instructions.  */
> > -  t1 = gen_rtx_SET (operands[0], operands[1]);
> > -  t2 = gen_rtx_SET (operands[2], operands[3]);
> > -  emit_insn (gen_rtx_PARALLEL (VOIDmode, gen_rtvec (2, t1, t2)));
> > -  t1 = gen_rtx_SET (operands[4], operands[5]);
> > -  t2 = gen_rtx_SET (operands[6], operands[7]);
> > -  emit_insn (gen_rtx_PARALLEL (VOIDmode, gen_rtvec (2, t1, t2)));
> > +  if (load)
> > +    {
> > +      emit_insn (aarch64_gen_load_pair (operands[0], operands[2],
> > +                                   operands[1], code));
> > +      emit_insn (aarch64_gen_load_pair (operands[4], operands[6],
> > +                                   operands[5], code));
> > +    }
> > +  else
> > +    {
> > +      emit_insn (aarch64_gen_store_pair (operands[0], operands[1],
> > +                                    operands[3]));
> > +      emit_insn (aarch64_gen_store_pair (operands[4], operands[5],
> > +                                    operands[7]));
> > +    }
> >    return true;
> >  }
> >  
> > diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> > index c92a51690c5..ffb6b0ba749 100644
> > --- a/gcc/config/aarch64/aarch64.md
> > +++ b/gcc/config/aarch64/aarch64.md
> > @@ -175,6 +175,9 @@ (define_c_enum "unspec" [
> >      UNSPEC_GOTSMALLTLS
> >      UNSPEC_GOTTINYPIC
> >      UNSPEC_GOTTINYTLS
> > +    UNSPEC_STP
> > +    UNSPEC_LDP_FST
> > +    UNSPEC_LDP_SND
> >      UNSPEC_LD1
> >      UNSPEC_LD2
> >      UNSPEC_LD2_DREG
> > @@ -453,6 +456,11 @@ (define_attr "predicated" "yes,no" (const_string "no"))
> >  ;; may chose to hold the tracking state encoded in SP.
> >  (define_attr "speculation_barrier" "true,false" (const_string "false"))
> >  
> > +;; Attribute use to identify load pair and store pair instructions.
> > +;; Currently the attribute is only applied to the non-writeback ldp/stp
> > +;; patterns.
> > +(define_attr "ldpstp" "ldp,stp,none" (const_string "none"))
> > +
> >  ;; -------------------------------------------------------------------
> >  ;; Pipeline descriptions and scheduling
> >  ;; -------------------------------------------------------------------
> > @@ -1735,100 +1743,62 @@ (define_expand "setmemdi"
> >    FAIL;
> >  })
> >  
> > -;; Operands 1 and 3 are tied together by the final condition; so we allow
> > -;; fairly lax checking on the second memory operation.
> > -(define_insn "load_pair_sw_<SX:mode><SX2:mode>"
> > -  [(set (match_operand:SX 0 "register_operand")
> > -   (match_operand:SX 1 "aarch64_mem_pair_operand"))
> > -   (set (match_operand:SX2 2 "register_operand")
> > -   (match_operand:SX2 3 "memory_operand"))]
> > -   "rtx_equal_p (XEXP (operands[3], 0),
> > -            plus_constant (Pmode,
> > -                           XEXP (operands[1], 0),
> > -                           GET_MODE_SIZE (<SX:MODE>mode)))"
> > -  {@ [ cons: =0 , 1   , =2 , 3 ; attrs: type     , arch ]
> > -     [ r        , Ump , r  , m ; load_8          , *    ] ldp\t%w0, %w2, 
> > %z1
> > -     [ w        , Ump , w  , m ; neon_load1_2reg , fp   ] ldp\t%s0, %s2, 
> > %z1
> > -  }
> > -)
> > -
> > -;; Storing different modes that can still be merged
> > -(define_insn "load_pair_dw_<DX:mode><DX2:mode>"
> > -  [(set (match_operand:DX 0 "register_operand")
> > -   (match_operand:DX 1 "aarch64_mem_pair_operand"))
> > -   (set (match_operand:DX2 2 "register_operand")
> > -   (match_operand:DX2 3 "memory_operand"))]
> > -   "rtx_equal_p (XEXP (operands[3], 0),
> > -            plus_constant (Pmode,
> > -                           XEXP (operands[1], 0),
> > -                           GET_MODE_SIZE (<DX:MODE>mode)))"
> > -  {@ [ cons: =0 , 1   , =2 , 3 ; attrs: type     , arch ]
> > -     [ r        , Ump , r  , m ; load_16         , *    ] ldp\t%x0, %x2, 
> > %z1
> > -     [ w        , Ump , w  , m ; neon_load1_2reg , fp   ] ldp\t%d0, %d2, 
> > %z1
> > -  }
> > -)
> > -
> > -(define_insn "load_pair_dw_<TX:mode><TX2:mode>"
> > -  [(set (match_operand:TX 0 "register_operand" "=w")
> > -   (match_operand:TX 1 "aarch64_mem_pair_operand" "Ump"))
> > -   (set (match_operand:TX2 2 "register_operand" "=w")
> > -   (match_operand:TX2 3 "memory_operand" "m"))]
> > -   "TARGET_SIMD
> > -    && rtx_equal_p (XEXP (operands[3], 0),
> > -               plus_constant (Pmode,
> > -                              XEXP (operands[1], 0),
> > -                              GET_MODE_SIZE (<TX:MODE>mode)))"
> > -  "ldp\\t%q0, %q2, %z1"
> > +(define_insn "*load_pair_<ldst_sz>"
> > +  [(set (match_operand:GPI 0 "aarch64_ldp_reg_operand")
> > +   (unspec [
> > +     (match_operand:<VPAIR> 1 "aarch64_mem_pair_lanes_operand")
> > +   ] UNSPEC_LDP_FST))
> > +   (set (match_operand:GPI 2 "aarch64_ldp_reg_operand")
> > +   (unspec [
> > +     (match_dup 1)
> > +   ] UNSPEC_LDP_SND))]
> > +  ""
> > +  {@ [cons: =0, 1,   =2; attrs: type,         arch]
> > +     [          r, Umn,  r; load_<ldpstp_sz>, *   ] ldp\t%<w>0, %<w>2, %y1
> > +     [          w, Umn,  w; neon_load1_2reg,  fp  ] ldp\t%<v>0, %<v>2, %y1
> > +  }
> > +  [(set_attr "ldpstp" "ldp")]
> > +)
> > +
> > +(define_insn "*load_pair_16"
> > +  [(set (match_operand:TI 0 "aarch64_ldp_reg_operand" "=w")
> > +   (unspec [
> > +     (match_operand:V2x16QI 1 "aarch64_mem_pair_lanes_operand" "Umn")
> > +   ] UNSPEC_LDP_FST))
> > +   (set (match_operand:TI 2 "aarch64_ldp_reg_operand" "=w")
> > +   (unspec [
> > +     (match_dup 1)
> > +   ] UNSPEC_LDP_SND))]
> > +  "TARGET_FLOAT"
> > +  "ldp\\t%q0, %q2, %y1"
> >    [(set_attr "type" "neon_ldp_q")
> > -   (set_attr "fp" "yes")]
> > -)
> > -
> > -;; Operands 0 and 2 are tied together by the final condition; so we allow
> > -;; fairly lax checking on the second memory operation.
> > -(define_insn "store_pair_sw_<SX:mode><SX2:mode>"
> > -  [(set (match_operand:SX 0 "aarch64_mem_pair_operand")
> > -   (match_operand:SX 1 "aarch64_reg_zero_or_fp_zero"))
> > -   (set (match_operand:SX2 2 "memory_operand")
> > -   (match_operand:SX2 3 "aarch64_reg_zero_or_fp_zero"))]
> > -   "rtx_equal_p (XEXP (operands[2], 0),
> > -            plus_constant (Pmode,
> > -                           XEXP (operands[0], 0),
> > -                           GET_MODE_SIZE (<SX:MODE>mode)))"
> > -  {@ [ cons: =0 , 1   , =2 , 3   ; attrs: type      , arch ]
> > -     [ Ump      , rYZ , m  , rYZ ; store_8          , *    ] stp\t%w1, 
> > %w3, %z0
> > -     [ Ump      , w   , m  , w   ; neon_store1_2reg , fp   ] stp\t%s1, 
> > %s3, %z0
> > -  }
> > -)
> > -
> > -;; Storing different modes that can still be merged
> > -(define_insn "store_pair_dw_<DX:mode><DX2:mode>"
> > -  [(set (match_operand:DX 0 "aarch64_mem_pair_operand")
> > -   (match_operand:DX 1 "aarch64_reg_zero_or_fp_zero"))
> > -   (set (match_operand:DX2 2 "memory_operand")
> > -   (match_operand:DX2 3 "aarch64_reg_zero_or_fp_zero"))]
> > -   "rtx_equal_p (XEXP (operands[2], 0),
> > -            plus_constant (Pmode,
> > -                           XEXP (operands[0], 0),
> > -                           GET_MODE_SIZE (<DX:MODE>mode)))"
> > -  {@ [ cons: =0 , 1   , =2 , 3   ; attrs: type      , arch ]
> > -     [ Ump      , rYZ , m  , rYZ ; store_16         , *    ] stp\t%x1, 
> > %x3, %z0
> > -     [ Ump      , w   , m  , w   ; neon_store1_2reg , fp   ] stp\t%d1, 
> > %d3, %z0
> > -  }
> > -)
> > -
> > -(define_insn "store_pair_dw_<TX:mode><TX2:mode>"
> > -  [(set (match_operand:TX 0 "aarch64_mem_pair_operand" "=Ump")
> > -   (match_operand:TX 1 "register_operand" "w"))
> > -   (set (match_operand:TX2 2 "memory_operand" "=m")
> > -   (match_operand:TX2 3 "register_operand" "w"))]
> > -   "TARGET_SIMD &&
> > -    rtx_equal_p (XEXP (operands[2], 0),
> > -            plus_constant (Pmode,
> > -                           XEXP (operands[0], 0),
> > -                           GET_MODE_SIZE (TFmode)))"
> > -  "stp\\t%q1, %q3, %z0"
> > +   (set_attr "fp" "yes")
> > +   (set_attr "ldpstp" "ldp")]
> > +)
> > +
> > +(define_insn "*store_pair_<ldst_sz>"
> > +  [(set (match_operand:<VPAIR> 0 "aarch64_mem_pair_lanes_operand")
> > +   (unspec:<VPAIR>
> > +     [(match_operand:GPI 1 "aarch64_stp_reg_operand")
> > +      (match_operand:GPI 2 "aarch64_stp_reg_operand")] UNSPEC_STP))]
> > +  ""
> > +  {@ [cons:  =0,   1,   2; attrs: type      , arch]
> > +     [         Umn, rYZ, rYZ; store_<ldpstp_sz>, *   ] stp\t%<w>1, %<w>2, 
> > %y0
> > +     [         Umn,   w,   w; neon_store1_2reg , fp  ] stp\t%<v>1, %<v>2, 
> > %y0
> > +  }
> > +  [(set_attr "ldpstp" "stp")]
> > +)
> > +
> > +(define_insn "*store_pair_16"
> > +  [(set (match_operand:V2x16QI 0 "aarch64_mem_pair_lanes_operand" "=Umn")
> > +   (unspec:V2x16QI
> > +     [(match_operand:TI 1 "aarch64_ldp_reg_operand" "w")
> > +      (match_operand:TI 2 "aarch64_ldp_reg_operand" "w")] UNSPEC_STP))]
> > +  "TARGET_FLOAT"
> > +  "stp\t%q1, %q2, %y0"
> >    [(set_attr "type" "neon_stp_q")
> > -   (set_attr "fp" "yes")]
> > +   (set_attr "fp" "yes")
> > +   (set_attr "ldpstp" "stp")]
> >  )
> >  
> >  ;; Writeback load/store pair patterns.
> > @@ -2074,14 +2044,15 @@ (define_insn "*extendsidi2_aarch64"
> >  
> >  (define_insn "*load_pair_extendsidi2_aarch64"
> >    [(set (match_operand:DI 0 "register_operand" "=r")
> > -   (sign_extend:DI (match_operand:SI 1 "aarch64_mem_pair_operand" "Ump")))
> > +   (sign_extend:DI (unspec:SI [
> > +     (match_operand:V2x4QI 1 "aarch64_mem_pair_lanes_operand" "Umn")
> > +   ] UNSPEC_LDP_FST)))
> >     (set (match_operand:DI 2 "register_operand" "=r")
> > -   (sign_extend:DI (match_operand:SI 3 "memory_operand" "m")))]
> > -  "rtx_equal_p (XEXP (operands[3], 0),
> > -           plus_constant (Pmode,
> > -                          XEXP (operands[1], 0),
> > -                          GET_MODE_SIZE (SImode)))"
> > -  "ldpsw\\t%0, %2, %z1"
> > +   (sign_extend:DI (unspec:SI [
> > +     (match_dup 1)
> > +   ] UNSPEC_LDP_SND)))]
> > +  ""
> > +  "ldpsw\\t%0, %2, %y1"
> >    [(set_attr "type" "load_8")]
> >  )
> >  
> > @@ -2101,16 +2072,17 @@ (define_insn "*zero_extendsidi2_aarch64"
> >  
> >  (define_insn "*load_pair_zero_extendsidi2_aarch64"
> >    [(set (match_operand:DI 0 "register_operand")
> > -   (zero_extend:DI (match_operand:SI 1 "aarch64_mem_pair_operand")))
> > +   (zero_extend:DI (unspec:SI [
> > +     (match_operand:V2x4QI 1 "aarch64_mem_pair_lanes_operand")
> > +   ] UNSPEC_LDP_FST)))
> >     (set (match_operand:DI 2 "register_operand")
> > -   (zero_extend:DI (match_operand:SI 3 "memory_operand")))]
> > -  "rtx_equal_p (XEXP (operands[3], 0),
> > -           plus_constant (Pmode,
> > -                          XEXP (operands[1], 0),
> > -                          GET_MODE_SIZE (SImode)))"
> > -  {@ [ cons: =0 , 1   , =2 , 3 ; attrs: type     , arch ]
> > -     [ r        , Ump , r  , m ; load_8          , *    ] ldp\t%w0, %w2, 
> > %z1
> > -     [ w        , Ump , w  , m ; neon_load1_2reg , fp   ] ldp\t%s0, %s2, 
> > %z1
> > +   (zero_extend:DI (unspec:SI [
> > +     (match_dup 1)
> > +   ] UNSPEC_LDP_SND)))]
> > +  ""
> > +  {@ [ cons: =0 , 1   , =2; attrs: type    , arch]
> > +     [ r   , Umn , r ; load_8         , *   ] ldp\t%w0, %w2, %y1
> > +     [ w   , Umn , w ; neon_load1_2reg, fp  ] ldp\t%s0, %s2, %y1
> >    }
> >  )
> >  
> > diff --git a/gcc/config/aarch64/iterators.md 
> > b/gcc/config/aarch64/iterators.md
> > index a920de99ffc..fd8dd6db349 100644
> > --- a/gcc/config/aarch64/iterators.md
> > +++ b/gcc/config/aarch64/iterators.md
> > @@ -1435,6 +1435,9 @@ (define_mode_attr VDBL [(V8QI "V16QI") (V4HI "V8HI")
> >                     (SI   "V2SI")  (SF   "V2SF")
> >                     (DI   "V2DI")  (DF   "V2DF")])
> >  
> > +;; Load/store pair mode.
> > +(define_mode_attr VPAIR [(SI "V2x4QI") (DI "V2x8QI")])
> > +
> >  ;; Register suffix for double-length mode.
> >  (define_mode_attr Vdtype [(V4HF "8h") (V2SF "4s")])
> >  
> > diff --git a/gcc/config/aarch64/predicates.md 
> > b/gcc/config/aarch64/predicates.md
> > index b647e5af7c6..80f2e03d8de 100644
> > --- a/gcc/config/aarch64/predicates.md
> > +++ b/gcc/config/aarch64/predicates.md
> > @@ -266,10 +266,12 @@ (define_special_predicate "aarch64_mem_pair_operator"
> >        (match_test "known_eq (GET_MODE_SIZE (mode),
> >                          GET_MODE_SIZE (GET_MODE (op)))"))))
> >  
> > -(define_predicate "aarch64_mem_pair_operand"
> > -  (and (match_code "mem")
> > -       (match_test "aarch64_legitimate_address_p (mode, XEXP (op, 0), 
> > false,
> > -                                             ADDR_QUERY_LDP_STP)")))
> > +;; Like aarch64_mem_pair_operator, but additionally check the
> > +;; address is suitable.
> > +(define_special_predicate "aarch64_mem_pair_operand"
> > +  (and (match_operand 0 "aarch64_mem_pair_operator")
> > +       (match_test "aarch64_legitimate_address_p (GET_MODE (op), XEXP (op, 
> > 0),
> > +                                             false, ADDR_QUERY_LDP_STP)")))
> >  
> >  (define_predicate "pmode_plus_operator"
> >    (and (match_code "plus")

Re: [PATCH 09/11] aarch64: Rewrite non-writeback ldp/stp patterns

Reply via email to