stp patterns

Richard Sandiford Tue, 21 Nov 2023 08:04:30 -0800

Alex Coplan <alex.cop...@arm.com> writes:
> This patch overhauls the load/store pair patterns with two main goals:
>
> 1. Fixing a correctness issue (the current patterns are not RA-friendly).
> 2. Allowing more flexibility in which operand modes are supported, and which
>    combinations of modes are allowed in the two arms of the load/store pair,
>    while reducing the number of patterns required both in the source and in
>    the generated code.
>
> The correctness issue (1) is due to the fact that the current patterns have
> two independent memory operands tied together only by a predicate on the 
> insns.
> Since LRA only looks at the constraints, one of the memory operands can get
> reloaded without the other one being changed, leading to the insn becoming
> unrecognizable after reload.
>
> We fix this issue by changing the patterns such that they only ever have one
> memory operand representing the entire pair.  For the store case, we use an
> unspec to logically concatenate the register operands before storing them.
> For the load case, we use unspecs to extract the "lanes" from the pair mem,
> with the second occurrence of the mem matched using a match_dup (such that 
> there
> is still really only one memory operand as far as the RA is concerned).
>
> In terms of the modes used for the pair memory operands, we canonicalize
> these to V2x4QImode, V2x8QImode, and V2x16QImode.  These modes have not
> only the correct size but also correct alignment requirement for a
> memory operand representing an entire load/store pair.  Unlike the other
> two, V2x4QImode didn't previously exist, so had to be added with the
> patch.
>
> As with the previous patch generalizing the writeback patterns, this
> patch aims to be flexible in the combinations of modes supported by the
> patterns without requiring a large number of generated patterns by using
> distinct mode iterators.
>
> The new scheme means we only need a single (generated) pattern for each
> load/store operation of a given operand size.  For the 4-byte and 8-byte
> operand cases, we use the GPI iterator to synthesize the two patterns.
> The 16-byte case is implemented as a separate pattern in the source (due
> to only having a single possible alternative).
>
> Since the UNSPEC patterns can't be interpreted by the dwarf2cfi code,
> we add REG_CFA_OFFSET notes to the store pair insns emitted by
> aarch64_save_callee_saves, so that correct CFI information can still be
> generated.  Furthermore, we now unconditionally generate these CFA
> notes on frame-related insns emitted by aarch64_save_callee_saves.
> This is done in case that the load/store pair pass forms these into
> pairs, in which case the CFA notes would be needed.
>
> We also adjust the ldp/stp peepholes to generate the new form.  This is
> done by switching the generation to use the
> aarch64_gen_{load,store}_pair interface, making it easier to change the
> form in the future if needed.  (Likewise, the upcoming aarch64
> load/store pair pass also makes use of this interface).
>
> This patch also adds an "ldpstp" attribute to the non-writeback
> load/store pair patterns, which is used by the post-RA load/store pair
> pass to identify existing patterns and see if they can be promoted to
> writeback variants.
>
> One potential concern with using unspecs for the patterns is that it can block
> optimization by the generic RTL passes.  This patch series tries to mitigate
> this in two ways:
>  1. The pre-RA load/store pair pass runs very late in the pre-RA pipeline.
>  2. A later patch in the series adjusts the aarch64 mem{cpy,set} expansion to
>     emit individual loads/stores instead of ldp/stp.  These should then be
>     formed back into load/store pairs much later in the RTL pipeline by the
>     new load/store pair pass.
>
> Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?
>
> Thanks,
> Alex
>
> gcc/ChangeLog:
>
>       * config/aarch64/aarch64-ldpstp.md: Abstract ldp/stp
>       representation from peepholes, allowing use of new form.
>       * config/aarch64/aarch64-modes.def (V2x4QImode): Define.
>       * config/aarch64/aarch64-protos.h
>       (aarch64_finish_ldpstp_peephole): Declare.
>       (aarch64_swap_ldrstr_operands): Delete declaration.
>       (aarch64_gen_load_pair): Declare.
>       (aarch64_gen_store_pair): Declare.
>       * config/aarch64/aarch64-simd.md (load_pair<DREG:mode><DREG2:mode>):
>       Delete.
>       (vec_store_pair<DREG:mode><DREG2:mode>): Delete.
>       (load_pair<VQ:mode><VQ2:mode>): Delete.
>       (vec_store_pair<VQ:mode><VQ2:mode>): Delete.
>       * config/aarch64/aarch64.cc (aarch64_pair_mode_for_mode): New.
>       (aarch64_gen_store_pair): Adjust to use new unspec form of stp.
>       Drop second mem from parameters.
>       (aarch64_gen_load_pair): Likewise.
>       (aarch64_pair_mem_from_base): New.
>       (aarch64_save_callee_saves): Emit REG_CFA_OFFSET notes for
>       frame-related saves.  Adjust call to aarch64_gen_store_pair
>       (aarch64_restore_callee_saves): Adjust calls to
>       aarch64_gen_load_pair to account for change in interface.
>       (aarch64_process_components): Likewise.
>       (aarch64_classify_address): Handle 32-byte pair mems in
>       LDP_STP_N case.
>       (aarch64_print_operand): Likewise.
>       (aarch64_copy_one_block_and_progress_pointers): Adjust calls to
>       account for change in aarch64_gen_{load,store}_pair interface.
>       (aarch64_set_one_block_and_progress_pointer): Likewise.
>       (aarch64_finish_ldpstp_peephole): New.
>       (aarch64_gen_adjusted_ldpstp): Adjust to use generation helper.
>       * config/aarch64/aarch64.md (ldpstp): New attribute.
>       (load_pair_sw_<SX:mode><SX2:mode>): Delete.
>       (load_pair_dw_<DX:mode><DX2:mode>): Delete.
>       (load_pair_dw_<TX:mode><TX2:mode>): Delete.
>       (*load_pair_<ldst_sz>): New.
>       (*load_pair_16): New.
>       (store_pair_sw_<SX:mode><SX2:mode>): Delete.
>       (store_pair_dw_<DX:mode><DX2:mode>): Delete.
>       (store_pair_dw_<TX:mode><TX2:mode>): Delete.
>       (*store_pair_<ldst_sz>): New.
>       (*store_pair_16): New.
>       (*load_pair_extendsidi2_aarch64): Adjust to use new form.
>       (*zero_extendsidi2_aarch64): Likewise.
>       * config/aarch64/iterators.md (VPAIR): New.
>       * config/aarch64/predicates.md (aarch64_mem_pair_operand): Change to
>       a special predicate derived from aarch64_mem_pair_operator.
> ---
>  gcc/config/aarch64/aarch64-ldpstp.md |  66 +++----
>  gcc/config/aarch64/aarch64-modes.def |   6 +-
>  gcc/config/aarch64/aarch64-protos.h  |   5 +-
>  gcc/config/aarch64/aarch64-simd.md   |  60 -------
>  gcc/config/aarch64/aarch64.cc        | 257 +++++++++++++++------------
>  gcc/config/aarch64/aarch64.md        | 188 +++++++++-----------
>  gcc/config/aarch64/iterators.md      |   3 +
>  gcc/config/aarch64/predicates.md     |  10 +-
>  8 files changed, 270 insertions(+), 325 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64-ldpstp.md 
> b/gcc/config/aarch64/aarch64-ldpstp.md
> index 1ee7c73ff0c..dc39af85254 100644
> --- a/gcc/config/aarch64/aarch64-ldpstp.md
> +++ b/gcc/config/aarch64/aarch64-ldpstp.md
> @@ -24,10 +24,10 @@ (define_peephole2
>     (set (match_operand:GPI 2 "register_operand" "")
>       (match_operand:GPI 3 "memory_operand" ""))]
>    "aarch64_operands_ok_for_ldpstp (operands, true, <MODE>mode)"
> -  [(parallel [(set (match_dup 0) (match_dup 1))
> -           (set (match_dup 2) (match_dup 3))])]
> +  [(const_int 0)]
>  {
> -  aarch64_swap_ldrstr_operands (operands, true);
> +  aarch64_finish_ldpstp_peephole (operands, true);
> +  DONE;
>  })
>  
>  (define_peephole2
> @@ -36,10 +36,10 @@ (define_peephole2
>     (set (match_operand:GPI 2 "memory_operand" "")
>       (match_operand:GPI 3 "aarch64_reg_or_zero" ""))]
>    "aarch64_operands_ok_for_ldpstp (operands, false, <MODE>mode)"
> -  [(parallel [(set (match_dup 0) (match_dup 1))
> -           (set (match_dup 2) (match_dup 3))])]
> +  [(const_int 0)]
>  {
> -  aarch64_swap_ldrstr_operands (operands, false);
> +  aarch64_finish_ldpstp_peephole (operands, false);
> +  DONE;
>  })
>  
>  (define_peephole2
> @@ -48,10 +48,10 @@ (define_peephole2
>     (set (match_operand:GPF 2 "register_operand" "")
>       (match_operand:GPF 3 "memory_operand" ""))]
>    "aarch64_operands_ok_for_ldpstp (operands, true, <MODE>mode)"
> -  [(parallel [(set (match_dup 0) (match_dup 1))
> -           (set (match_dup 2) (match_dup 3))])]
> +  [(const_int 0)]
>  {
> -  aarch64_swap_ldrstr_operands (operands, true);
> +  aarch64_finish_ldpstp_peephole (operands, true);
> +  DONE;
>  })
>  
>  (define_peephole2
> @@ -60,10 +60,10 @@ (define_peephole2
>     (set (match_operand:GPF 2 "memory_operand" "")
>       (match_operand:GPF 3 "aarch64_reg_or_fp_zero" ""))]
>    "aarch64_operands_ok_for_ldpstp (operands, false, <MODE>mode)"
> -  [(parallel [(set (match_dup 0) (match_dup 1))
> -           (set (match_dup 2) (match_dup 3))])]
> +  [(const_int 0)]
>  {
> -  aarch64_swap_ldrstr_operands (operands, false);
> +  aarch64_finish_ldpstp_peephole (operands, false);
> +  DONE;
>  })
>  
>  (define_peephole2
> @@ -72,10 +72,10 @@ (define_peephole2
>     (set (match_operand:DREG2 2 "register_operand" "")
>       (match_operand:DREG2 3 "memory_operand" ""))]
>    "aarch64_operands_ok_for_ldpstp (operands, true, <DREG:MODE>mode)"
> -  [(parallel [(set (match_dup 0) (match_dup 1))
> -           (set (match_dup 2) (match_dup 3))])]
> +  [(const_int 0)]
>  {
> -  aarch64_swap_ldrstr_operands (operands, true);
> +  aarch64_finish_ldpstp_peephole (operands, true);
> +  DONE;
>  })
>  
>  (define_peephole2
> @@ -84,10 +84,10 @@ (define_peephole2
>     (set (match_operand:DREG2 2 "memory_operand" "")
>       (match_operand:DREG2 3 "register_operand" ""))]
>    "aarch64_operands_ok_for_ldpstp (operands, false, <DREG:MODE>mode)"
> -  [(parallel [(set (match_dup 0) (match_dup 1))
> -           (set (match_dup 2) (match_dup 3))])]
> +  [(const_int 0)]
>  {
> -  aarch64_swap_ldrstr_operands (operands, false);
> +  aarch64_finish_ldpstp_peephole (operands, false);
> +  DONE;
>  })
>  
>  (define_peephole2
> @@ -99,10 +99,10 @@ (define_peephole2
>     && aarch64_operands_ok_for_ldpstp (operands, true, <VQ:MODE>mode)
>     && (aarch64_tune_params.extra_tuning_flags
>       & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS) == 0"
> -  [(parallel [(set (match_dup 0) (match_dup 1))
> -           (set (match_dup 2) (match_dup 3))])]
> +  [(const_int 0)]
>  {
> -  aarch64_swap_ldrstr_operands (operands, true);
> +  aarch64_finish_ldpstp_peephole (operands, true);
> +  DONE;
>  })
>  
>  (define_peephole2
> @@ -114,10 +114,10 @@ (define_peephole2
>     && aarch64_operands_ok_for_ldpstp (operands, false, <VQ:MODE>mode)
>     && (aarch64_tune_params.extra_tuning_flags
>       & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS) == 0"
> -  [(parallel [(set (match_dup 0) (match_dup 1))
> -           (set (match_dup 2) (match_dup 3))])]
> +  [(const_int 0)]
>  {
> -  aarch64_swap_ldrstr_operands (operands, false);
> +  aarch64_finish_ldpstp_peephole (operands, false);
> +  DONE;
>  })
>  
>  
> @@ -129,10 +129,10 @@ (define_peephole2
>     (set (match_operand:DI 2 "register_operand" "")
>       (sign_extend:DI (match_operand:SI 3 "memory_operand" "")))]
>    "aarch64_operands_ok_for_ldpstp (operands, true, SImode)"
> -  [(parallel [(set (match_dup 0) (sign_extend:DI (match_dup 1)))
> -           (set (match_dup 2) (sign_extend:DI (match_dup 3)))])]
> +  [(const_int 0)]
>  {
> -  aarch64_swap_ldrstr_operands (operands, true);
> +  aarch64_finish_ldpstp_peephole (operands, true, SIGN_EXTEND);
> +  DONE;
>  })
>  
>  (define_peephole2
> @@ -141,10 +141,10 @@ (define_peephole2
>     (set (match_operand:DI 2 "register_operand" "")
>       (zero_extend:DI (match_operand:SI 3 "memory_operand" "")))]
>    "aarch64_operands_ok_for_ldpstp (operands, true, SImode)"
> -  [(parallel [(set (match_dup 0) (zero_extend:DI (match_dup 1)))
> -           (set (match_dup 2) (zero_extend:DI (match_dup 3)))])]
> +  [(const_int 0)]
>  {
> -  aarch64_swap_ldrstr_operands (operands, true);
> +  aarch64_finish_ldpstp_peephole (operands, true, ZERO_EXTEND);
> +  DONE;
>  })
>  
>  ;; Handle storing of a floating point zero with integer data.
> @@ -163,10 +163,10 @@ (define_peephole2
>     (set (match_operand:<FCVT_TARGET> 2 "memory_operand" "")
>       (match_operand:<FCVT_TARGET> 3 "aarch64_reg_zero_or_fp_zero" ""))]
>    "aarch64_operands_ok_for_ldpstp (operands, false, <V_INT_EQUIV>mode)"
> -  [(parallel [(set (match_dup 0) (match_dup 1))
> -           (set (match_dup 2) (match_dup 3))])]
> +  [(const_int 0)]
>  {
> -  aarch64_swap_ldrstr_operands (operands, false);
> +  aarch64_finish_ldpstp_peephole (operands, false);
> +  DONE;
>  })
>  
>  ;; Handle consecutive load/store whose offset is out of the range
> diff --git a/gcc/config/aarch64/aarch64-modes.def 
> b/gcc/config/aarch64/aarch64-modes.def
> index 6b4f4e17dd5..1e0d770f72f 100644
> --- a/gcc/config/aarch64/aarch64-modes.def
> +++ b/gcc/config/aarch64/aarch64-modes.def
> @@ -93,9 +93,13 @@ INT_MODE (XI, 64);
>  
>  /* V8DI mode.  */
>  VECTOR_MODE_WITH_PREFIX (V, INT, DI, 8, 5);
> -
>  ADJUST_ALIGNMENT (V8DI, 8);
>  
> +/* V2x4QImode.  Used in load/store pair patterns.  */
> +VECTOR_MODE_WITH_PREFIX (V2x, INT, QI, 4, 5);
> +ADJUST_NUNITS (V2x4QI, 8);
> +ADJUST_ALIGNMENT (V2x4QI, 4);
> +
>  /* Define Advanced SIMD modes for structures of 2, 3 and 4 d-registers.  */
>  #define ADV_SIMD_D_REG_STRUCT_MODES(NVECS, VB, VH, VS, VD) \
>    VECTOR_MODES_WITH_PREFIX (V##NVECS##x, INT, 8, 3); \
> diff --git a/gcc/config/aarch64/aarch64-protos.h 
> b/gcc/config/aarch64/aarch64-protos.h
> index e463fd5c817..2ab54f244a7 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -967,6 +967,8 @@ void aarch64_split_compare_and_swap (rtx op[]);
>  void aarch64_split_atomic_op (enum rtx_code, rtx, rtx, rtx, rtx, rtx, rtx);
>  
>  bool aarch64_gen_adjusted_ldpstp (rtx *, bool, machine_mode, RTX_CODE);
> +void aarch64_finish_ldpstp_peephole (rtx *, bool,
> +                                  enum rtx_code = (enum rtx_code)0);
>  
>  void aarch64_expand_sve_vec_cmp_int (rtx, rtx_code, rtx, rtx);
>  bool aarch64_expand_sve_vec_cmp_float (rtx, rtx_code, rtx, rtx, bool);
> @@ -1022,8 +1024,9 @@ bool aarch64_mergeable_load_pair_p (machine_mode, rtx, 
> rtx);
>  bool aarch64_operands_ok_for_ldpstp (rtx *, bool, machine_mode);
>  bool aarch64_operands_adjust_ok_for_ldpstp (rtx *, bool, machine_mode);
>  bool aarch64_mem_ok_with_ldpstp_policy_model (rtx, bool, machine_mode);
> -void aarch64_swap_ldrstr_operands (rtx *, bool);
>  bool aarch64_ldpstp_operand_mode_p (machine_mode);
> +rtx aarch64_gen_load_pair (rtx, rtx, rtx, enum rtx_code = (enum rtx_code)0);
> +rtx aarch64_gen_store_pair (rtx, rtx, rtx);
>  
>  extern void aarch64_asm_output_pool_epilogue (FILE *, const char *,
>                                             tree, HOST_WIDE_INT);
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index c6f2d582837..6f5080ab030 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -231,38 +231,6 @@ (define_insn "aarch64_store_lane0<mode>"
>    [(set_attr "type" "neon_store1_1reg<q>")]
>  )
>  
> -(define_insn "load_pair<DREG:mode><DREG2:mode>"
> -  [(set (match_operand:DREG 0 "register_operand")
> -     (match_operand:DREG 1 "aarch64_mem_pair_operand"))
> -   (set (match_operand:DREG2 2 "register_operand")
> -     (match_operand:DREG2 3 "memory_operand"))]
> -  "TARGET_FLOAT
> -   && rtx_equal_p (XEXP (operands[3], 0),
> -                plus_constant (Pmode,
> -                               XEXP (operands[1], 0),
> -                               GET_MODE_SIZE (<DREG:MODE>mode)))"
> -  {@ [ cons: =0 , 1   , =2 , 3 ; attrs: type ]
> -     [ w        , Ump , w  , m ; neon_ldp    ] ldp\t%d0, %d2, %z1
> -     [ r        , Ump , r  , m ; load_16     ] ldp\t%x0, %x2, %z1
> -  }
> -)
> -
> -(define_insn "vec_store_pair<DREG:mode><DREG2:mode>"
> -  [(set (match_operand:DREG 0 "aarch64_mem_pair_operand")
> -     (match_operand:DREG 1 "register_operand"))
> -   (set (match_operand:DREG2 2 "memory_operand")
> -     (match_operand:DREG2 3 "register_operand"))]
> -  "TARGET_FLOAT
> -   && rtx_equal_p (XEXP (operands[2], 0),
> -                plus_constant (Pmode,
> -                               XEXP (operands[0], 0),
> -                               GET_MODE_SIZE (<DREG:MODE>mode)))"
> -  {@ [ cons: =0 , 1 , =2 , 3 ; attrs: type ]
> -     [ Ump      , w , m  , w ; neon_stp    ] stp\t%d1, %d3, %z0
> -     [ Ump      , r , m  , r ; store_16    ] stp\t%x1, %x3, %z0
> -  }
> -)
> -
>  (define_insn "aarch64_simd_stp<mode>"
>    [(set (match_operand:VP_2E 0 "aarch64_mem_pair_lanes_operand")
>       (vec_duplicate:VP_2E (match_operand:<VEL> 1 "register_operand")))]
> @@ -273,34 +241,6 @@ (define_insn "aarch64_simd_stp<mode>"
>    }
>  )
>  
> -(define_insn "load_pair<VQ:mode><VQ2:mode>"
> -  [(set (match_operand:VQ 0 "register_operand" "=w")
> -     (match_operand:VQ 1 "aarch64_mem_pair_operand" "Ump"))
> -   (set (match_operand:VQ2 2 "register_operand" "=w")
> -     (match_operand:VQ2 3 "memory_operand" "m"))]
> -  "TARGET_FLOAT
> -    && rtx_equal_p (XEXP (operands[3], 0),
> -                 plus_constant (Pmode,
> -                            XEXP (operands[1], 0),
> -                            GET_MODE_SIZE (<VQ:MODE>mode)))"
> -  "ldp\\t%q0, %q2, %z1"
> -  [(set_attr "type" "neon_ldp_q")]
> -)
> -
> -(define_insn "vec_store_pair<VQ:mode><VQ2:mode>"
> -  [(set (match_operand:VQ 0 "aarch64_mem_pair_operand" "=Ump")
> -     (match_operand:VQ 1 "register_operand" "w"))
> -   (set (match_operand:VQ2 2 "memory_operand" "=m")
> -     (match_operand:VQ2 3 "register_operand" "w"))]
> -  "TARGET_FLOAT
> -   && rtx_equal_p (XEXP (operands[2], 0),
> -                plus_constant (Pmode,
> -                               XEXP (operands[0], 0),
> -                               GET_MODE_SIZE (<VQ:MODE>mode)))"
> -  "stp\\t%q1, %q3, %z0"
> -  [(set_attr "type" "neon_stp_q")]
> -)
> -
>  (define_expand "@aarch64_split_simd_mov<mode>"
>    [(set (match_operand:VQMOV 0)
>       (match_operand:VQMOV 1))]
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index ccf081d2a16..1f6094bf1bc 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -9056,59 +9056,81 @@ aarch64_pop_regs (unsigned regno1, unsigned regno2, 
> HOST_WIDE_INT adjustment,
>      }
>  }
>  
> -/* Generate and return a store pair instruction of mode MODE to store
> -   register REG1 to MEM1 and register REG2 to MEM2.  */
> +static machine_mode
> +aarch64_pair_mode_for_mode (machine_mode mode)
> +{
> +  if (known_eq (GET_MODE_SIZE (mode), 4))
> +    return E_V2x4QImode;
> +  else if (known_eq (GET_MODE_SIZE (mode), 8))
> +    return E_V2x8QImode;
> +  else if (known_eq (GET_MODE_SIZE (mode), 16))
> +    return E_V2x16QImode;
> +  else
> +    gcc_unreachable ();
> +}


Missing function comment.  There should be no need to use E_ outside switches.

>  
>  static rtx
> -aarch64_gen_store_pair (machine_mode mode, rtx mem1, rtx reg1, rtx mem2,
> -                     rtx reg2)
> +aarch64_pair_mem_from_base (rtx mem)
>  {
> -  switch (mode)
> -    {
> -    case E_DImode:
> -      return gen_store_pair_dw_didi (mem1, reg1, mem2, reg2);
> -
> -    case E_DFmode:
> -      return gen_store_pair_dw_dfdf (mem1, reg1, mem2, reg2);
> -
> -    case E_TFmode:
> -      return gen_store_pair_dw_tftf (mem1, reg1, mem2, reg2);
> +  auto pair_mode = aarch64_pair_mode_for_mode (GET_MODE (mem));
> +  mem = adjust_bitfield_address_nv (mem, pair_mode, 0);
> +  gcc_assert (aarch64_mem_pair_lanes_operand (mem, pair_mode));
> +  return mem;
> +}
>  
> -    case E_V4SImode:
> -      return gen_vec_store_pairv4siv4si (mem1, reg1, mem2, reg2);
> +/* Generate and return a store pair instruction to store REG1 and REG2
> +   into memory starting at BASE_MEM.  All three rtxes should have modes of 
> the
> +   same size.  */
>  
> -    case E_V16QImode:
> -      return gen_vec_store_pairv16qiv16qi (mem1, reg1, mem2, reg2);
> +rtx
> +aarch64_gen_store_pair (rtx base_mem, rtx reg1, rtx reg2)
> +{
> +  rtx pair_mem = aarch64_pair_mem_from_base (base_mem);
>  
> -    default:
> -      gcc_unreachable ();
> -    }
> +  return gen_rtx_SET (pair_mem,
> +                   gen_rtx_UNSPEC (GET_MODE (pair_mem),
> +                                   gen_rtvec (2, reg1, reg2),
> +                                   UNSPEC_STP));
>  }
>  
> -/* Generate and regurn a load pair isntruction of mode MODE to load register
> -   REG1 from MEM1 and register REG2 from MEM2.  */
> +/* Generate and return a load pair instruction to load a pair of
> +   registers starting at BASE_MEM into REG1 and REG2.  If CODE is
> +   UNKNOWN, all three rtxes should have modes of the same size.
> +   Otherwise, CODE is {SIGN,ZERO}_EXTEND, base_mem should be in SImode,
> +   and REG{1,2} should be in DImode.  */
>  
> -static rtx
> -aarch64_gen_load_pair (machine_mode mode, rtx reg1, rtx mem1, rtx reg2,
> -                    rtx mem2)
> +rtx
> +aarch64_gen_load_pair (rtx reg1, rtx reg2, rtx base_mem, enum rtx_code code)
>  {
> -  switch (mode)
> -    {
> -    case E_DImode:
> -      return gen_load_pair_dw_didi (reg1, mem1, reg2, mem2);
> +  rtx pair_mem = aarch64_pair_mem_from_base (base_mem);
>  
> -    case E_DFmode:
> -      return gen_load_pair_dw_dfdf (reg1, mem1, reg2, mem2);
> -
> -    case E_TFmode:
> -      return gen_load_pair_dw_tftf (reg1, mem1, reg2, mem2);
> +  const bool any_extend_p = (code == ZERO_EXTEND || code == SIGN_EXTEND);
> +  if (any_extend_p)
> +    {
> +      gcc_checking_assert (GET_MODE (base_mem) == SImode);
> +      gcc_checking_assert (GET_MODE (reg1) == DImode);
> +      gcc_checking_assert (GET_MODE (reg2) == DImode);

Not a personal preference, but I think single asserts with && are
preferred.

> +    }
> +  else
> +    gcc_assert (code == UNKNOWN);
> +
> +  rtx unspecs[2] = {
> +    gen_rtx_UNSPEC (any_extend_p ? SImode : GET_MODE (reg1),
> +                 gen_rtvec (1, pair_mem),
> +                 UNSPEC_LDP_FST),
> +    gen_rtx_UNSPEC (any_extend_p ? SImode : GET_MODE (reg2),

IIUC, the unspec modes could both be GET_MODE (base_mem)

> +                 gen_rtvec (1, copy_rtx (pair_mem)),
> +                 UNSPEC_LDP_SND)
> +  };
>  
> -    case E_V4SImode:
> -      return gen_load_pairv4siv4si (reg1, mem1, reg2, mem2);
> +  if (any_extend_p)
> +    for (int i = 0; i < 2; i++)
> +      unspecs[i] = gen_rtx_fmt_e (code, DImode, unspecs[i]);
>  
> -    default:
> -      gcc_unreachable ();
> -    }
> +  return gen_rtx_PARALLEL (VOIDmode,
> +                        gen_rtvec (2,
> +                                   gen_rtx_SET (reg1, unspecs[0]),
> +                                   gen_rtx_SET (reg2, unspecs[1])));
>  }
>  
>  /* Return TRUE if return address signing should be enabled for the current
> @@ -9321,8 +9343,19 @@ aarch64_save_callee_saves (poly_int64 bytes_below_sp,
>         offset -= fp_offset;
>       }
>        rtx mem = gen_frame_mem (mode, plus_constant (Pmode, base_rtx, 
> offset));
> -      bool need_cfa_note_p = (base_rtx != stack_pointer_rtx);
>  
> +      rtx cfa_base = stack_pointer_rtx;
> +      poly_int64 cfa_offset = sp_offset;

I don't think we need both cfa_offset and sp_offset.  sp_offset in the
current code only exists for CFI purposes.

> +
> +      if (hard_fp_valid_p && frame_pointer_needed)
> +     {
> +       cfa_base = hard_frame_pointer_rtx;
> +       cfa_offset += (bytes_below_sp - frame.bytes_below_hard_fp);
> +     }
> +
> +      rtx cfa_mem = gen_frame_mem (mode,
> +                                plus_constant (Pmode,
> +                                               cfa_base, cfa_offset));
>        unsigned int regno2;
>        if (!aarch64_sve_mode_p (mode)
>         && i + 1 < regs.size ()
> @@ -9331,45 +9364,37 @@ aarch64_save_callee_saves (poly_int64 bytes_below_sp,
>                      frame.reg_offset[regno2] - frame.reg_offset[regno]))
>       {
>         rtx reg2 = gen_rtx_REG (mode, regno2);
> -       rtx mem2;
>  
>         offset += GET_MODE_SIZE (mode);
> -       mem2 = gen_frame_mem (mode, plus_constant (Pmode, base_rtx, offset));
> -       insn = emit_insn (aarch64_gen_store_pair (mode, mem, reg, mem2,
> -                                                 reg2));
> -
> -       /* The first part of a frame-related parallel insn is
> -          always assumed to be relevant to the frame
> -          calculations; subsequent parts, are only
> -          frame-related if explicitly marked.  */
> +       insn = emit_insn (aarch64_gen_store_pair (mem, reg, reg2));
> +
>         if (aarch64_emit_cfi_for_reg_p (regno2))
>           {
> -           if (need_cfa_note_p)
> -             aarch64_add_cfa_expression (insn, reg2, stack_pointer_rtx,
> -                                         sp_offset + GET_MODE_SIZE (mode));
> -           else
> -             RTX_FRAME_RELATED_P (XVECEXP (PATTERN (insn), 0, 1)) = 1;
> +           rtx cfa_mem2 = adjust_address_nv (cfa_mem,
> +                                             Pmode,
> +                                             GET_MODE_SIZE (mode));

Think this should use get_frame_mem directly, rather than moving beyond
the bounds of the original mem.

> +           add_reg_note (insn, REG_CFA_OFFSET,
> +                         gen_rtx_SET (cfa_mem2, reg2));
>           }
>  
>         regno = regno2;
>         ++i;
>       }
>        else if (mode == VNx2DImode && BYTES_BIG_ENDIAN)
> -     {
> -       insn = emit_insn (gen_aarch64_pred_mov (mode, mem, ptrue, reg));
> -       need_cfa_note_p = true;
> -     }
> +     insn = emit_insn (gen_aarch64_pred_mov (mode, mem, ptrue, reg));
>        else if (aarch64_sve_mode_p (mode))
>       insn = emit_insn (gen_rtx_SET (mem, reg));
>        else
>       insn = emit_move_insn (mem, reg);
>  
>        RTX_FRAME_RELATED_P (insn) = frame_related_p;
> -      if (frame_related_p && need_cfa_note_p)
> -     aarch64_add_cfa_expression (insn, reg, stack_pointer_rtx, sp_offset);
> +
> +      if (frame_related_p)
> +     add_reg_note (insn, REG_CFA_OFFSET, gen_rtx_SET (cfa_mem, reg));

For the record, I might need to add back some CFA_EXPRESSIONs for
locally-streaming SME functions, to ensure that the CFI code doesn't
aggregate SVE saves across a change in the VG DWARF register.
But it's probably easier to do that once the patch is in,
since having a note on all insns will help to ensure consistency.

>      }
>  }
>  
> +

Stray extra whitespace.

>  /* Emit code to restore the callee registers in REGS, ignoring pop candidates
>     and any other registers that are handled separately.  Write the 
> appropriate
>     REG_CFA_RESTORE notes into CFI_OPS.
> @@ -9425,12 +9450,7 @@ aarch64_restore_callee_saves (poly_int64 
> bytes_below_sp,
>                      frame.reg_offset[regno2] - frame.reg_offset[regno]))
>       {
>         rtx reg2 = gen_rtx_REG (mode, regno2);
> -       rtx mem2;
> -
> -       offset += GET_MODE_SIZE (mode);
> -       mem2 = gen_frame_mem (mode, plus_constant (Pmode, base_rtx, offset));
> -       emit_insn (aarch64_gen_load_pair (mode, reg, mem, reg2, mem2));
> -
> +       emit_insn (aarch64_gen_load_pair (reg, reg2, mem));
>         *cfi_ops = alloc_reg_note (REG_CFA_RESTORE, reg2, *cfi_ops);
>         regno = regno2;
>         ++i;
> @@ -9762,9 +9782,9 @@ aarch64_process_components (sbitmap components, bool 
> prologue_p)
>                            : gen_rtx_SET (reg2, mem2);
>  
>        if (prologue_p)
> -     insn = emit_insn (aarch64_gen_store_pair (mode, mem, reg, mem2, reg2));
> +     insn = emit_insn (aarch64_gen_store_pair (mem, reg, reg2));
>        else
> -     insn = emit_insn (aarch64_gen_load_pair (mode, reg, mem, reg2, mem2));
> +     insn = emit_insn (aarch64_gen_load_pair (reg, reg2, mem));
>  
>        if (frame_related_p || frame_related2_p)
>       {
> @@ -10983,12 +11003,18 @@ aarch64_classify_address (struct 
> aarch64_address_info *info,
>       mode of the corresponding addressing mode is half of that.  */
>    if (type == ADDR_QUERY_LDP_STP_N)
>      {
> -      if (known_eq (GET_MODE_SIZE (mode), 16))
> +      if (known_eq (GET_MODE_SIZE (mode), 32))
> +     mode = V16QImode;
> +      else if (known_eq (GET_MODE_SIZE (mode), 16))
>       mode = DFmode;
>        else if (known_eq (GET_MODE_SIZE (mode), 8))
>       mode = SFmode;
>        else
>       return false;
> +
> +      /* This isn't really an Advanced SIMD struct mode, but a mode
> +      used to represent the complete mem in a load/store pair.  */
> +      advsimd_struct_p = false;
>      }
>  
>    bool allow_reg_index_p = (!load_store_pair_p
> @@ -12609,7 +12635,8 @@ aarch64_print_operand (FILE *f, rtx x, int code)
>       if (!MEM_P (x)
>           || (code == 'y'
>               && maybe_ne (GET_MODE_SIZE (mode), 8)
> -             && maybe_ne (GET_MODE_SIZE (mode), 16)))
> +             && maybe_ne (GET_MODE_SIZE (mode), 16)
> +             && maybe_ne (GET_MODE_SIZE (mode), 32)))
>         {
>           output_operand_lossage ("invalid operand for '%%%c'", code);
>           return;
> @@ -25431,10 +25458,8 @@ aarch64_copy_one_block_and_progress_pointers (rtx 
> *src, rtx *dst,
>        *src = adjust_address (*src, mode, 0);
>        *dst = adjust_address (*dst, mode, 0);
>        /* Emit the memcpy.  */
> -      emit_insn (aarch64_gen_load_pair (mode, reg1, *src, reg2,
> -                                     aarch64_progress_pointer (*src)));
> -      emit_insn (aarch64_gen_store_pair (mode, *dst, reg1,
> -                                      aarch64_progress_pointer (*dst), 
> reg2));
> +      emit_insn (aarch64_gen_load_pair (reg1, reg2, *src));
> +      emit_insn (aarch64_gen_store_pair (*dst, reg1, reg2));
>        /* Move the pointers forward.  */
>        *src = aarch64_move_pointer (*src, 32);
>        *dst = aarch64_move_pointer (*dst, 32);
> @@ -25613,8 +25638,7 @@ aarch64_set_one_block_and_progress_pointer (rtx src, 
> rtx *dst,
>        /* "Cast" the *dst to the correct mode.  */
>        *dst = adjust_address (*dst, mode, 0);
>        /* Emit the memset.  */
> -      emit_insn (aarch64_gen_store_pair (mode, *dst, src,
> -                                      aarch64_progress_pointer (*dst), src));
> +      emit_insn (aarch64_gen_store_pair (*dst, src, src));
>  
>        /* Move the pointers forward.  */
>        *dst = aarch64_move_pointer (*dst, 32);
> @@ -26812,6 +26836,22 @@ aarch64_swap_ldrstr_operands (rtx* operands, bool 
> load)
>      }
>  }
>  
> +void
> +aarch64_finish_ldpstp_peephole (rtx *operands, bool load_p, enum rtx_code 
> code)

Missing function comment.

> +{
> +  aarch64_swap_ldrstr_operands (operands, load_p);
> +
> +  if (load_p)
> +    emit_insn (aarch64_gen_load_pair (operands[0], operands[2],
> +                                   operands[1], code));
> +  else
> +    {
> +      gcc_assert (code == UNKNOWN);
> +      emit_insn (aarch64_gen_store_pair (operands[0], operands[1],
> +                                      operands[3]));
> +    }
> +}
> +
>  /* Taking X and Y to be HOST_WIDE_INT pointers, return the result of a
>     comparison between the two.  */
>  int
> @@ -26993,8 +27033,8 @@ bool
>  aarch64_gen_adjusted_ldpstp (rtx *operands, bool load,
>                            machine_mode mode, RTX_CODE code)
>  {
> -  rtx base, offset_1, offset_3, t1, t2;
> -  rtx mem_1, mem_2, mem_3, mem_4;
> +  rtx base, offset_1, offset_3;
> +  rtx mem_1, mem_2;
>    rtx temp_operands[8];
>    HOST_WIDE_INT off_val_1, off_val_3, base_off, new_off_1, new_off_3,
>               stp_off_upper_limit, stp_off_lower_limit, msize;
> @@ -27019,21 +27059,17 @@ aarch64_gen_adjusted_ldpstp (rtx *operands, bool 
> load,
>    if (load)
>      {
>        mem_1 = copy_rtx (temp_operands[1]);
> -      mem_2 = copy_rtx (temp_operands[3]);
> -      mem_3 = copy_rtx (temp_operands[5]);
> -      mem_4 = copy_rtx (temp_operands[7]);
> +      mem_2 = copy_rtx (temp_operands[5]);
>      }
>    else
>      {
>        mem_1 = copy_rtx (temp_operands[0]);
> -      mem_2 = copy_rtx (temp_operands[2]);
> -      mem_3 = copy_rtx (temp_operands[4]);
> -      mem_4 = copy_rtx (temp_operands[6]);
> +      mem_2 = copy_rtx (temp_operands[4]);
>        gcc_assert (code == UNKNOWN);
>      }
>  
>    extract_base_offset_in_addr (mem_1, &base, &offset_1);
> -  extract_base_offset_in_addr (mem_3, &base, &offset_3);
> +  extract_base_offset_in_addr (mem_2, &base, &offset_3);

mem_2 with offset_3 feels a bit awkward.  Might be worth using mem_3 instead,
so that the memory and register numbers are in sync.

I suppose we still need Ump for the extending loads, is that right?
Are there any other uses left?

Thanks,
Richard

>    gcc_assert (base != NULL_RTX && offset_1 != NULL_RTX
>             && offset_3 != NULL_RTX);
>  
> @@ -27097,63 +27133,48 @@ aarch64_gen_adjusted_ldpstp (rtx *operands, bool 
> load,
>    replace_equiv_address_nv (mem_1, plus_constant (Pmode, operands[8],
>                                                 new_off_1), true);
>    replace_equiv_address_nv (mem_2, plus_constant (Pmode, operands[8],
> -                                               new_off_1 + msize), true);
> -  replace_equiv_address_nv (mem_3, plus_constant (Pmode, operands[8],
>                                                 new_off_3), true);
> -  replace_equiv_address_nv (mem_4, plus_constant (Pmode, operands[8],
> -                                               new_off_3 + msize), true);
>  
>    if (!aarch64_mem_pair_operand (mem_1, mode)
> -      || !aarch64_mem_pair_operand (mem_3, mode))
> +      || !aarch64_mem_pair_operand (mem_2, mode))
>      return false;
>  
> -  if (code == ZERO_EXTEND)
> -    {
> -      mem_1 = gen_rtx_ZERO_EXTEND (DImode, mem_1);
> -      mem_2 = gen_rtx_ZERO_EXTEND (DImode, mem_2);
> -      mem_3 = gen_rtx_ZERO_EXTEND (DImode, mem_3);
> -      mem_4 = gen_rtx_ZERO_EXTEND (DImode, mem_4);
> -    }
> -  else if (code == SIGN_EXTEND)
> -    {
> -      mem_1 = gen_rtx_SIGN_EXTEND (DImode, mem_1);
> -      mem_2 = gen_rtx_SIGN_EXTEND (DImode, mem_2);
> -      mem_3 = gen_rtx_SIGN_EXTEND (DImode, mem_3);
> -      mem_4 = gen_rtx_SIGN_EXTEND (DImode, mem_4);
> -    }
> -
>    if (load)
>      {
>        operands[0] = temp_operands[0];
>        operands[1] = mem_1;
>        operands[2] = temp_operands[2];
> -      operands[3] = mem_2;
>        operands[4] = temp_operands[4];
> -      operands[5] = mem_3;
> +      operands[5] = mem_2;
>        operands[6] = temp_operands[6];
> -      operands[7] = mem_4;
>      }
>    else
>      {
>        operands[0] = mem_1;
>        operands[1] = temp_operands[1];
> -      operands[2] = mem_2;
>        operands[3] = temp_operands[3];
> -      operands[4] = mem_3;
> +      operands[4] = mem_2;
>        operands[5] = temp_operands[5];
> -      operands[6] = mem_4;
>        operands[7] = temp_operands[7];
>      }
>  
>    /* Emit adjusting instruction.  */
>    emit_insn (gen_rtx_SET (operands[8], plus_constant (DImode, base, 
> base_off)));
>    /* Emit ldp/stp instructions.  */
> -  t1 = gen_rtx_SET (operands[0], operands[1]);
> -  t2 = gen_rtx_SET (operands[2], operands[3]);
> -  emit_insn (gen_rtx_PARALLEL (VOIDmode, gen_rtvec (2, t1, t2)));
> -  t1 = gen_rtx_SET (operands[4], operands[5]);
> -  t2 = gen_rtx_SET (operands[6], operands[7]);
> -  emit_insn (gen_rtx_PARALLEL (VOIDmode, gen_rtvec (2, t1, t2)));
> +  if (load)
> +    {
> +      emit_insn (aarch64_gen_load_pair (operands[0], operands[2],
> +                                     operands[1], code));
> +      emit_insn (aarch64_gen_load_pair (operands[4], operands[6],
> +                                     operands[5], code));
> +    }
> +  else
> +    {
> +      emit_insn (aarch64_gen_store_pair (operands[0], operands[1],
> +                                      operands[3]));
> +      emit_insn (aarch64_gen_store_pair (operands[4], operands[5],
> +                                      operands[7]));
> +    }
>    return true;
>  }
>  
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index c92a51690c5..ffb6b0ba749 100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -175,6 +175,9 @@ (define_c_enum "unspec" [
>      UNSPEC_GOTSMALLTLS
>      UNSPEC_GOTTINYPIC
>      UNSPEC_GOTTINYTLS
> +    UNSPEC_STP
> +    UNSPEC_LDP_FST
> +    UNSPEC_LDP_SND
>      UNSPEC_LD1
>      UNSPEC_LD2
>      UNSPEC_LD2_DREG
> @@ -453,6 +456,11 @@ (define_attr "predicated" "yes,no" (const_string "no"))
>  ;; may chose to hold the tracking state encoded in SP.
>  (define_attr "speculation_barrier" "true,false" (const_string "false"))
>  
> +;; Attribute use to identify load pair and store pair instructions.
> +;; Currently the attribute is only applied to the non-writeback ldp/stp
> +;; patterns.
> +(define_attr "ldpstp" "ldp,stp,none" (const_string "none"))
> +
>  ;; -------------------------------------------------------------------
>  ;; Pipeline descriptions and scheduling
>  ;; -------------------------------------------------------------------
> @@ -1735,100 +1743,62 @@ (define_expand "setmemdi"
>    FAIL;
>  })
>  
> -;; Operands 1 and 3 are tied together by the final condition; so we allow
> -;; fairly lax checking on the second memory operation.
> -(define_insn "load_pair_sw_<SX:mode><SX2:mode>"
> -  [(set (match_operand:SX 0 "register_operand")
> -     (match_operand:SX 1 "aarch64_mem_pair_operand"))
> -   (set (match_operand:SX2 2 "register_operand")
> -     (match_operand:SX2 3 "memory_operand"))]
> -   "rtx_equal_p (XEXP (operands[3], 0),
> -              plus_constant (Pmode,
> -                             XEXP (operands[1], 0),
> -                             GET_MODE_SIZE (<SX:MODE>mode)))"
> -  {@ [ cons: =0 , 1   , =2 , 3 ; attrs: type     , arch ]
> -     [ r        , Ump , r  , m ; load_8          , *    ] ldp\t%w0, %w2, %z1
> -     [ w        , Ump , w  , m ; neon_load1_2reg , fp   ] ldp\t%s0, %s2, %z1
> -  }
> -)
> -
> -;; Storing different modes that can still be merged
> -(define_insn "load_pair_dw_<DX:mode><DX2:mode>"
> -  [(set (match_operand:DX 0 "register_operand")
> -     (match_operand:DX 1 "aarch64_mem_pair_operand"))
> -   (set (match_operand:DX2 2 "register_operand")
> -     (match_operand:DX2 3 "memory_operand"))]
> -   "rtx_equal_p (XEXP (operands[3], 0),
> -              plus_constant (Pmode,
> -                             XEXP (operands[1], 0),
> -                             GET_MODE_SIZE (<DX:MODE>mode)))"
> -  {@ [ cons: =0 , 1   , =2 , 3 ; attrs: type     , arch ]
> -     [ r        , Ump , r  , m ; load_16         , *    ] ldp\t%x0, %x2, %z1
> -     [ w        , Ump , w  , m ; neon_load1_2reg , fp   ] ldp\t%d0, %d2, %z1
> -  }
> -)
> -
> -(define_insn "load_pair_dw_<TX:mode><TX2:mode>"
> -  [(set (match_operand:TX 0 "register_operand" "=w")
> -     (match_operand:TX 1 "aarch64_mem_pair_operand" "Ump"))
> -   (set (match_operand:TX2 2 "register_operand" "=w")
> -     (match_operand:TX2 3 "memory_operand" "m"))]
> -   "TARGET_SIMD
> -    && rtx_equal_p (XEXP (operands[3], 0),
> -                 plus_constant (Pmode,
> -                                XEXP (operands[1], 0),
> -                                GET_MODE_SIZE (<TX:MODE>mode)))"
> -  "ldp\\t%q0, %q2, %z1"
> +(define_insn "*load_pair_<ldst_sz>"
> +  [(set (match_operand:GPI 0 "aarch64_ldp_reg_operand")
> +     (unspec [
> +       (match_operand:<VPAIR> 1 "aarch64_mem_pair_lanes_operand")
> +     ] UNSPEC_LDP_FST))
> +   (set (match_operand:GPI 2 "aarch64_ldp_reg_operand")
> +     (unspec [
> +       (match_dup 1)
> +     ] UNSPEC_LDP_SND))]
> +  ""
> +  {@ [cons: =0, 1,   =2; attrs: type,           arch]
> +     [            r, Umn,  r; load_<ldpstp_sz>, *   ] ldp\t%<w>0, %<w>2, %y1
> +     [            w, Umn,  w; neon_load1_2reg,  fp  ] ldp\t%<v>0, %<v>2, %y1
> +  }
> +  [(set_attr "ldpstp" "ldp")]
> +)
> +
> +(define_insn "*load_pair_16"
> +  [(set (match_operand:TI 0 "aarch64_ldp_reg_operand" "=w")
> +     (unspec [
> +       (match_operand:V2x16QI 1 "aarch64_mem_pair_lanes_operand" "Umn")
> +     ] UNSPEC_LDP_FST))
> +   (set (match_operand:TI 2 "aarch64_ldp_reg_operand" "=w")
> +     (unspec [
> +       (match_dup 1)
> +     ] UNSPEC_LDP_SND))]
> +  "TARGET_FLOAT"
> +  "ldp\\t%q0, %q2, %y1"
>    [(set_attr "type" "neon_ldp_q")
> -   (set_attr "fp" "yes")]
> -)
> -
> -;; Operands 0 and 2 are tied together by the final condition; so we allow
> -;; fairly lax checking on the second memory operation.
> -(define_insn "store_pair_sw_<SX:mode><SX2:mode>"
> -  [(set (match_operand:SX 0 "aarch64_mem_pair_operand")
> -     (match_operand:SX 1 "aarch64_reg_zero_or_fp_zero"))
> -   (set (match_operand:SX2 2 "memory_operand")
> -     (match_operand:SX2 3 "aarch64_reg_zero_or_fp_zero"))]
> -   "rtx_equal_p (XEXP (operands[2], 0),
> -              plus_constant (Pmode,
> -                             XEXP (operands[0], 0),
> -                             GET_MODE_SIZE (<SX:MODE>mode)))"
> -  {@ [ cons: =0 , 1   , =2 , 3   ; attrs: type      , arch ]
> -     [ Ump      , rYZ , m  , rYZ ; store_8          , *    ] stp\t%w1, %w3, 
> %z0
> -     [ Ump      , w   , m  , w   ; neon_store1_2reg , fp   ] stp\t%s1, %s3, 
> %z0
> -  }
> -)
> -
> -;; Storing different modes that can still be merged
> -(define_insn "store_pair_dw_<DX:mode><DX2:mode>"
> -  [(set (match_operand:DX 0 "aarch64_mem_pair_operand")
> -     (match_operand:DX 1 "aarch64_reg_zero_or_fp_zero"))
> -   (set (match_operand:DX2 2 "memory_operand")
> -     (match_operand:DX2 3 "aarch64_reg_zero_or_fp_zero"))]
> -   "rtx_equal_p (XEXP (operands[2], 0),
> -              plus_constant (Pmode,
> -                             XEXP (operands[0], 0),
> -                             GET_MODE_SIZE (<DX:MODE>mode)))"
> -  {@ [ cons: =0 , 1   , =2 , 3   ; attrs: type      , arch ]
> -     [ Ump      , rYZ , m  , rYZ ; store_16         , *    ] stp\t%x1, %x3, 
> %z0
> -     [ Ump      , w   , m  , w   ; neon_store1_2reg , fp   ] stp\t%d1, %d3, 
> %z0
> -  }
> -)
> -
> -(define_insn "store_pair_dw_<TX:mode><TX2:mode>"
> -  [(set (match_operand:TX 0 "aarch64_mem_pair_operand" "=Ump")
> -     (match_operand:TX 1 "register_operand" "w"))
> -   (set (match_operand:TX2 2 "memory_operand" "=m")
> -     (match_operand:TX2 3 "register_operand" "w"))]
> -   "TARGET_SIMD &&
> -    rtx_equal_p (XEXP (operands[2], 0),
> -              plus_constant (Pmode,
> -                             XEXP (operands[0], 0),
> -                             GET_MODE_SIZE (TFmode)))"
> -  "stp\\t%q1, %q3, %z0"
> +   (set_attr "fp" "yes")
> +   (set_attr "ldpstp" "ldp")]
> +)
> +
> +(define_insn "*store_pair_<ldst_sz>"
> +  [(set (match_operand:<VPAIR> 0 "aarch64_mem_pair_lanes_operand")
> +     (unspec:<VPAIR>
> +       [(match_operand:GPI 1 "aarch64_stp_reg_operand")
> +        (match_operand:GPI 2 "aarch64_stp_reg_operand")] UNSPEC_STP))]
> +  ""
> +  {@ [cons:  =0,   1,   2; attrs: type      , arch]
> +     [           Umn, rYZ, rYZ; store_<ldpstp_sz>, *   ] stp\t%<w>1, %<w>2, 
> %y0
> +     [           Umn,   w,   w; neon_store1_2reg , fp  ] stp\t%<v>1, %<v>2, 
> %y0
> +  }
> +  [(set_attr "ldpstp" "stp")]
> +)
> +
> +(define_insn "*store_pair_16"
> +  [(set (match_operand:V2x16QI 0 "aarch64_mem_pair_lanes_operand" "=Umn")
> +     (unspec:V2x16QI
> +       [(match_operand:TI 1 "aarch64_ldp_reg_operand" "w")
> +        (match_operand:TI 2 "aarch64_ldp_reg_operand" "w")] UNSPEC_STP))]
> +  "TARGET_FLOAT"
> +  "stp\t%q1, %q2, %y0"
>    [(set_attr "type" "neon_stp_q")
> -   (set_attr "fp" "yes")]
> +   (set_attr "fp" "yes")
> +   (set_attr "ldpstp" "stp")]
>  )
>  
>  ;; Writeback load/store pair patterns.
> @@ -2074,14 +2044,15 @@ (define_insn "*extendsidi2_aarch64"
>  
>  (define_insn "*load_pair_extendsidi2_aarch64"
>    [(set (match_operand:DI 0 "register_operand" "=r")
> -     (sign_extend:DI (match_operand:SI 1 "aarch64_mem_pair_operand" "Ump")))
> +     (sign_extend:DI (unspec:SI [
> +       (match_operand:V2x4QI 1 "aarch64_mem_pair_lanes_operand" "Umn")
> +     ] UNSPEC_LDP_FST)))
>     (set (match_operand:DI 2 "register_operand" "=r")
> -     (sign_extend:DI (match_operand:SI 3 "memory_operand" "m")))]
> -  "rtx_equal_p (XEXP (operands[3], 0),
> -             plus_constant (Pmode,
> -                            XEXP (operands[1], 0),
> -                            GET_MODE_SIZE (SImode)))"
> -  "ldpsw\\t%0, %2, %z1"
> +     (sign_extend:DI (unspec:SI [
> +       (match_dup 1)
> +     ] UNSPEC_LDP_SND)))]
> +  ""
> +  "ldpsw\\t%0, %2, %y1"
>    [(set_attr "type" "load_8")]
>  )
>  
> @@ -2101,16 +2072,17 @@ (define_insn "*zero_extendsidi2_aarch64"
>  
>  (define_insn "*load_pair_zero_extendsidi2_aarch64"
>    [(set (match_operand:DI 0 "register_operand")
> -     (zero_extend:DI (match_operand:SI 1 "aarch64_mem_pair_operand")))
> +     (zero_extend:DI (unspec:SI [
> +       (match_operand:V2x4QI 1 "aarch64_mem_pair_lanes_operand")
> +     ] UNSPEC_LDP_FST)))
>     (set (match_operand:DI 2 "register_operand")
> -     (zero_extend:DI (match_operand:SI 3 "memory_operand")))]
> -  "rtx_equal_p (XEXP (operands[3], 0),
> -             plus_constant (Pmode,
> -                            XEXP (operands[1], 0),
> -                            GET_MODE_SIZE (SImode)))"
> -  {@ [ cons: =0 , 1   , =2 , 3 ; attrs: type     , arch ]
> -     [ r        , Ump , r  , m ; load_8          , *    ] ldp\t%w0, %w2, %z1
> -     [ w        , Ump , w  , m ; neon_load1_2reg , fp   ] ldp\t%s0, %s2, %z1
> +     (zero_extend:DI (unspec:SI [
> +       (match_dup 1)
> +     ] UNSPEC_LDP_SND)))]
> +  ""
> +  {@ [ cons: =0 , 1   , =2; attrs: type    , arch]
> +     [ r     , Umn , r ; load_8         , *   ] ldp\t%w0, %w2, %y1
> +     [ w     , Umn , w ; neon_load1_2reg, fp  ] ldp\t%s0, %s2, %y1
>    }
>  )
>  
> diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
> index a920de99ffc..fd8dd6db349 100644
> --- a/gcc/config/aarch64/iterators.md
> +++ b/gcc/config/aarch64/iterators.md
> @@ -1435,6 +1435,9 @@ (define_mode_attr VDBL [(V8QI "V16QI") (V4HI "V8HI")
>                       (SI   "V2SI")  (SF   "V2SF")
>                       (DI   "V2DI")  (DF   "V2DF")])
>  
> +;; Load/store pair mode.
> +(define_mode_attr VPAIR [(SI "V2x4QI") (DI "V2x8QI")])
> +
>  ;; Register suffix for double-length mode.
>  (define_mode_attr Vdtype [(V4HF "8h") (V2SF "4s")])
>  
> diff --git a/gcc/config/aarch64/predicates.md 
> b/gcc/config/aarch64/predicates.md
> index b647e5af7c6..80f2e03d8de 100644
> --- a/gcc/config/aarch64/predicates.md
> +++ b/gcc/config/aarch64/predicates.md
> @@ -266,10 +266,12 @@ (define_special_predicate "aarch64_mem_pair_operator"
>        (match_test "known_eq (GET_MODE_SIZE (mode),
>                            GET_MODE_SIZE (GET_MODE (op)))"))))
>  
> -(define_predicate "aarch64_mem_pair_operand"
> -  (and (match_code "mem")
> -       (match_test "aarch64_legitimate_address_p (mode, XEXP (op, 0), false,
> -                                               ADDR_QUERY_LDP_STP)")))
> +;; Like aarch64_mem_pair_operator, but additionally check the
> +;; address is suitable.
> +(define_special_predicate "aarch64_mem_pair_operand"
> +  (and (match_operand 0 "aarch64_mem_pair_operator")
> +       (match_test "aarch64_legitimate_address_p (GET_MODE (op), XEXP (op, 
> 0),
> +                                               false, ADDR_QUERY_LDP_STP)")))
>  
>  (define_predicate "pmode_plus_operator"
>    (and (match_code "plus")

Re: [PATCH 09/11] aarch64: Rewrite non-writeback ldp/stp patterns

Reply via email to