Alex Coplan <[email protected]> writes:
> This patch overhauls the load/store pair patterns with two main goals:
>
> 1. Fixing a correctness issue (the current patterns are not RA-friendly).
> 2. Allowing more flexibility in which operand modes are supported, and which
> combinations of modes are allowed in the two arms of the load/store pair,
> while reducing the number of patterns required both in the source and in
> the generated code.
>
> The correctness issue (1) is due to the fact that the current patterns have
> two independent memory operands tied together only by a predicate on the
> insns.
> Since LRA only looks at the constraints, one of the memory operands can get
> reloaded without the other one being changed, leading to the insn becoming
> unrecognizable after reload.
>
> We fix this issue by changing the patterns such that they only ever have one
> memory operand representing the entire pair. For the store case, we use an
> unspec to logically concatenate the register operands before storing them.
> For the load case, we use unspecs to extract the "lanes" from the pair mem,
> with the second occurrence of the mem matched using a match_dup (such that
> there
> is still really only one memory operand as far as the RA is concerned).
>
> In terms of the modes used for the pair memory operands, we canonicalize
> these to V2x4QImode, V2x8QImode, and V2x16QImode. These modes have not
> only the correct size but also correct alignment requirement for a
> memory operand representing an entire load/store pair. Unlike the other
> two, V2x4QImode didn't previously exist, so had to be added with the
> patch.
>
> As with the previous patch generalizing the writeback patterns, this
> patch aims to be flexible in the combinations of modes supported by the
> patterns without requiring a large number of generated patterns by using
> distinct mode iterators.
>
> The new scheme means we only need a single (generated) pattern for each
> load/store operation of a given operand size. For the 4-byte and 8-byte
> operand cases, we use the GPI iterator to synthesize the two patterns.
> The 16-byte case is implemented as a separate pattern in the source (due
> to only having a single possible alternative).
>
> Since the UNSPEC patterns can't be interpreted by the dwarf2cfi code,
> we add REG_CFA_OFFSET notes to the store pair insns emitted by
> aarch64_save_callee_saves, so that correct CFI information can still be
> generated. Furthermore, we now unconditionally generate these CFA
> notes on frame-related insns emitted by aarch64_save_callee_saves.
> This is done in case that the load/store pair pass forms these into
> pairs, in which case the CFA notes would be needed.
>
> We also adjust the ldp/stp peepholes to generate the new form. This is
> done by switching the generation to use the
> aarch64_gen_{load,store}_pair interface, making it easier to change the
> form in the future if needed. (Likewise, the upcoming aarch64
> load/store pair pass also makes use of this interface).
>
> This patch also adds an "ldpstp" attribute to the non-writeback
> load/store pair patterns, which is used by the post-RA load/store pair
> pass to identify existing patterns and see if they can be promoted to
> writeback variants.
>
> One potential concern with using unspecs for the patterns is that it can block
> optimization by the generic RTL passes. This patch series tries to mitigate
> this in two ways:
> 1. The pre-RA load/store pair pass runs very late in the pre-RA pipeline.
> 2. A later patch in the series adjusts the aarch64 mem{cpy,set} expansion to
> emit individual loads/stores instead of ldp/stp. These should then be
> formed back into load/store pairs much later in the RTL pipeline by the
> new load/store pair pass.
>
> Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?
>
> Thanks,
> Alex
>
> gcc/ChangeLog:
>
> * config/aarch64/aarch64-ldpstp.md: Abstract ldp/stp
> representation from peepholes, allowing use of new form.
> * config/aarch64/aarch64-modes.def (V2x4QImode): Define.
> * config/aarch64/aarch64-protos.h
> (aarch64_finish_ldpstp_peephole): Declare.
> (aarch64_swap_ldrstr_operands): Delete declaration.
> (aarch64_gen_load_pair): Declare.
> (aarch64_gen_store_pair): Declare.
> * config/aarch64/aarch64-simd.md (load_pair<DREG:mode><DREG2:mode>):
> Delete.
> (vec_store_pair<DREG:mode><DREG2:mode>): Delete.
> (load_pair<VQ:mode><VQ2:mode>): Delete.
> (vec_store_pair<VQ:mode><VQ2:mode>): Delete.
> * config/aarch64/aarch64.cc (aarch64_pair_mode_for_mode): New.
> (aarch64_gen_store_pair): Adjust to use new unspec form of stp.
> Drop second mem from parameters.
> (aarch64_gen_load_pair): Likewise.
> (aarch64_pair_mem_from_base): New.
> (aarch64_save_callee_saves): Emit REG_CFA_OFFSET notes for
> frame-related saves. Adjust call to aarch64_gen_store_pair
> (aarch64_restore_callee_saves): Adjust calls to
> aarch64_gen_load_pair to account for change in interface.
> (aarch64_process_components): Likewise.
> (aarch64_classify_address): Handle 32-byte pair mems in
> LDP_STP_N case.
> (aarch64_print_operand): Likewise.
> (aarch64_copy_one_block_and_progress_pointers): Adjust calls to
> account for change in aarch64_gen_{load,store}_pair interface.
> (aarch64_set_one_block_and_progress_pointer): Likewise.
> (aarch64_finish_ldpstp_peephole): New.
> (aarch64_gen_adjusted_ldpstp): Adjust to use generation helper.
> * config/aarch64/aarch64.md (ldpstp): New attribute.
> (load_pair_sw_<SX:mode><SX2:mode>): Delete.
> (load_pair_dw_<DX:mode><DX2:mode>): Delete.
> (load_pair_dw_<TX:mode><TX2:mode>): Delete.
> (*load_pair_<ldst_sz>): New.
> (*load_pair_16): New.
> (store_pair_sw_<SX:mode><SX2:mode>): Delete.
> (store_pair_dw_<DX:mode><DX2:mode>): Delete.
> (store_pair_dw_<TX:mode><TX2:mode>): Delete.
> (*store_pair_<ldst_sz>): New.
> (*store_pair_16): New.
> (*load_pair_extendsidi2_aarch64): Adjust to use new form.
> (*zero_extendsidi2_aarch64): Likewise.
> * config/aarch64/iterators.md (VPAIR): New.
> * config/aarch64/predicates.md (aarch64_mem_pair_operand): Change to
> a special predicate derived from aarch64_mem_pair_operator.
> ---
> gcc/config/aarch64/aarch64-ldpstp.md | 66 +++----
> gcc/config/aarch64/aarch64-modes.def | 6 +-
> gcc/config/aarch64/aarch64-protos.h | 5 +-
> gcc/config/aarch64/aarch64-simd.md | 60 -------
> gcc/config/aarch64/aarch64.cc | 257 +++++++++++++++------------
> gcc/config/aarch64/aarch64.md | 188 +++++++++-----------
> gcc/config/aarch64/iterators.md | 3 +
> gcc/config/aarch64/predicates.md | 10 +-
> 8 files changed, 270 insertions(+), 325 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64-ldpstp.md
> b/gcc/config/aarch64/aarch64-ldpstp.md
> index 1ee7c73ff0c..dc39af85254 100644
> --- a/gcc/config/aarch64/aarch64-ldpstp.md
> +++ b/gcc/config/aarch64/aarch64-ldpstp.md
> @@ -24,10 +24,10 @@ (define_peephole2
> (set (match_operand:GPI 2 "register_operand" "")
> (match_operand:GPI 3 "memory_operand" ""))]
> "aarch64_operands_ok_for_ldpstp (operands, true, <MODE>mode)"
> - [(parallel [(set (match_dup 0) (match_dup 1))
> - (set (match_dup 2) (match_dup 3))])]
> + [(const_int 0)]
> {
> - aarch64_swap_ldrstr_operands (operands, true);
> + aarch64_finish_ldpstp_peephole (operands, true);
> + DONE;
> })
>
> (define_peephole2
> @@ -36,10 +36,10 @@ (define_peephole2
> (set (match_operand:GPI 2 "memory_operand" "")
> (match_operand:GPI 3 "aarch64_reg_or_zero" ""))]
> "aarch64_operands_ok_for_ldpstp (operands, false, <MODE>mode)"
> - [(parallel [(set (match_dup 0) (match_dup 1))
> - (set (match_dup 2) (match_dup 3))])]
> + [(const_int 0)]
> {
> - aarch64_swap_ldrstr_operands (operands, false);
> + aarch64_finish_ldpstp_peephole (operands, false);
> + DONE;
> })
>
> (define_peephole2
> @@ -48,10 +48,10 @@ (define_peephole2
> (set (match_operand:GPF 2 "register_operand" "")
> (match_operand:GPF 3 "memory_operand" ""))]
> "aarch64_operands_ok_for_ldpstp (operands, true, <MODE>mode)"
> - [(parallel [(set (match_dup 0) (match_dup 1))
> - (set (match_dup 2) (match_dup 3))])]
> + [(const_int 0)]
> {
> - aarch64_swap_ldrstr_operands (operands, true);
> + aarch64_finish_ldpstp_peephole (operands, true);
> + DONE;
> })
>
> (define_peephole2
> @@ -60,10 +60,10 @@ (define_peephole2
> (set (match_operand:GPF 2 "memory_operand" "")
> (match_operand:GPF 3 "aarch64_reg_or_fp_zero" ""))]
> "aarch64_operands_ok_for_ldpstp (operands, false, <MODE>mode)"
> - [(parallel [(set (match_dup 0) (match_dup 1))
> - (set (match_dup 2) (match_dup 3))])]
> + [(const_int 0)]
> {
> - aarch64_swap_ldrstr_operands (operands, false);
> + aarch64_finish_ldpstp_peephole (operands, false);
> + DONE;
> })
>
> (define_peephole2
> @@ -72,10 +72,10 @@ (define_peephole2
> (set (match_operand:DREG2 2 "register_operand" "")
> (match_operand:DREG2 3 "memory_operand" ""))]
> "aarch64_operands_ok_for_ldpstp (operands, true, <DREG:MODE>mode)"
> - [(parallel [(set (match_dup 0) (match_dup 1))
> - (set (match_dup 2) (match_dup 3))])]
> + [(const_int 0)]
> {
> - aarch64_swap_ldrstr_operands (operands, true);
> + aarch64_finish_ldpstp_peephole (operands, true);
> + DONE;
> })
>
> (define_peephole2
> @@ -84,10 +84,10 @@ (define_peephole2
> (set (match_operand:DREG2 2 "memory_operand" "")
> (match_operand:DREG2 3 "register_operand" ""))]
> "aarch64_operands_ok_for_ldpstp (operands, false, <DREG:MODE>mode)"
> - [(parallel [(set (match_dup 0) (match_dup 1))
> - (set (match_dup 2) (match_dup 3))])]
> + [(const_int 0)]
> {
> - aarch64_swap_ldrstr_operands (operands, false);
> + aarch64_finish_ldpstp_peephole (operands, false);
> + DONE;
> })
>
> (define_peephole2
> @@ -99,10 +99,10 @@ (define_peephole2
> && aarch64_operands_ok_for_ldpstp (operands, true, <VQ:MODE>mode)
> && (aarch64_tune_params.extra_tuning_flags
> & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS) == 0"
> - [(parallel [(set (match_dup 0) (match_dup 1))
> - (set (match_dup 2) (match_dup 3))])]
> + [(const_int 0)]
> {
> - aarch64_swap_ldrstr_operands (operands, true);
> + aarch64_finish_ldpstp_peephole (operands, true);
> + DONE;
> })
>
> (define_peephole2
> @@ -114,10 +114,10 @@ (define_peephole2
> && aarch64_operands_ok_for_ldpstp (operands, false, <VQ:MODE>mode)
> && (aarch64_tune_params.extra_tuning_flags
> & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS) == 0"
> - [(parallel [(set (match_dup 0) (match_dup 1))
> - (set (match_dup 2) (match_dup 3))])]
> + [(const_int 0)]
> {
> - aarch64_swap_ldrstr_operands (operands, false);
> + aarch64_finish_ldpstp_peephole (operands, false);
> + DONE;
> })
>
>
> @@ -129,10 +129,10 @@ (define_peephole2
> (set (match_operand:DI 2 "register_operand" "")
> (sign_extend:DI (match_operand:SI 3 "memory_operand" "")))]
> "aarch64_operands_ok_for_ldpstp (operands, true, SImode)"
> - [(parallel [(set (match_dup 0) (sign_extend:DI (match_dup 1)))
> - (set (match_dup 2) (sign_extend:DI (match_dup 3)))])]
> + [(const_int 0)]
> {
> - aarch64_swap_ldrstr_operands (operands, true);
> + aarch64_finish_ldpstp_peephole (operands, true, SIGN_EXTEND);
> + DONE;
> })
>
> (define_peephole2
> @@ -141,10 +141,10 @@ (define_peephole2
> (set (match_operand:DI 2 "register_operand" "")
> (zero_extend:DI (match_operand:SI 3 "memory_operand" "")))]
> "aarch64_operands_ok_for_ldpstp (operands, true, SImode)"
> - [(parallel [(set (match_dup 0) (zero_extend:DI (match_dup 1)))
> - (set (match_dup 2) (zero_extend:DI (match_dup 3)))])]
> + [(const_int 0)]
> {
> - aarch64_swap_ldrstr_operands (operands, true);
> + aarch64_finish_ldpstp_peephole (operands, true, ZERO_EXTEND);
> + DONE;
> })
>
> ;; Handle storing of a floating point zero with integer data.
> @@ -163,10 +163,10 @@ (define_peephole2
> (set (match_operand:<FCVT_TARGET> 2 "memory_operand" "")
> (match_operand:<FCVT_TARGET> 3 "aarch64_reg_zero_or_fp_zero" ""))]
> "aarch64_operands_ok_for_ldpstp (operands, false, <V_INT_EQUIV>mode)"
> - [(parallel [(set (match_dup 0) (match_dup 1))
> - (set (match_dup 2) (match_dup 3))])]
> + [(const_int 0)]
> {
> - aarch64_swap_ldrstr_operands (operands, false);
> + aarch64_finish_ldpstp_peephole (operands, false);
> + DONE;
> })
>
> ;; Handle consecutive load/store whose offset is out of the range
> diff --git a/gcc/config/aarch64/aarch64-modes.def
> b/gcc/config/aarch64/aarch64-modes.def
> index 6b4f4e17dd5..1e0d770f72f 100644
> --- a/gcc/config/aarch64/aarch64-modes.def
> +++ b/gcc/config/aarch64/aarch64-modes.def
> @@ -93,9 +93,13 @@ INT_MODE (XI, 64);
>
> /* V8DI mode. */
> VECTOR_MODE_WITH_PREFIX (V, INT, DI, 8, 5);
> -
> ADJUST_ALIGNMENT (V8DI, 8);
>
> +/* V2x4QImode. Used in load/store pair patterns. */
> +VECTOR_MODE_WITH_PREFIX (V2x, INT, QI, 4, 5);
> +ADJUST_NUNITS (V2x4QI, 8);
> +ADJUST_ALIGNMENT (V2x4QI, 4);
> +
> /* Define Advanced SIMD modes for structures of 2, 3 and 4 d-registers. */
> #define ADV_SIMD_D_REG_STRUCT_MODES(NVECS, VB, VH, VS, VD) \
> VECTOR_MODES_WITH_PREFIX (V##NVECS##x, INT, 8, 3); \
> diff --git a/gcc/config/aarch64/aarch64-protos.h
> b/gcc/config/aarch64/aarch64-protos.h
> index e463fd5c817..2ab54f244a7 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -967,6 +967,8 @@ void aarch64_split_compare_and_swap (rtx op[]);
> void aarch64_split_atomic_op (enum rtx_code, rtx, rtx, rtx, rtx, rtx, rtx);
>
> bool aarch64_gen_adjusted_ldpstp (rtx *, bool, machine_mode, RTX_CODE);
> +void aarch64_finish_ldpstp_peephole (rtx *, bool,
> + enum rtx_code = (enum rtx_code)0);
>
> void aarch64_expand_sve_vec_cmp_int (rtx, rtx_code, rtx, rtx);
> bool aarch64_expand_sve_vec_cmp_float (rtx, rtx_code, rtx, rtx, bool);
> @@ -1022,8 +1024,9 @@ bool aarch64_mergeable_load_pair_p (machine_mode, rtx,
> rtx);
> bool aarch64_operands_ok_for_ldpstp (rtx *, bool, machine_mode);
> bool aarch64_operands_adjust_ok_for_ldpstp (rtx *, bool, machine_mode);
> bool aarch64_mem_ok_with_ldpstp_policy_model (rtx, bool, machine_mode);
> -void aarch64_swap_ldrstr_operands (rtx *, bool);
> bool aarch64_ldpstp_operand_mode_p (machine_mode);
> +rtx aarch64_gen_load_pair (rtx, rtx, rtx, enum rtx_code = (enum rtx_code)0);
> +rtx aarch64_gen_store_pair (rtx, rtx, rtx);
>
> extern void aarch64_asm_output_pool_epilogue (FILE *, const char *,
> tree, HOST_WIDE_INT);
> diff --git a/gcc/config/aarch64/aarch64-simd.md
> b/gcc/config/aarch64/aarch64-simd.md
> index c6f2d582837..6f5080ab030 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -231,38 +231,6 @@ (define_insn "aarch64_store_lane0<mode>"
> [(set_attr "type" "neon_store1_1reg<q>")]
> )
>
> -(define_insn "load_pair<DREG:mode><DREG2:mode>"
> - [(set (match_operand:DREG 0 "register_operand")
> - (match_operand:DREG 1 "aarch64_mem_pair_operand"))
> - (set (match_operand:DREG2 2 "register_operand")
> - (match_operand:DREG2 3 "memory_operand"))]
> - "TARGET_FLOAT
> - && rtx_equal_p (XEXP (operands[3], 0),
> - plus_constant (Pmode,
> - XEXP (operands[1], 0),
> - GET_MODE_SIZE (<DREG:MODE>mode)))"
> - {@ [ cons: =0 , 1 , =2 , 3 ; attrs: type ]
> - [ w , Ump , w , m ; neon_ldp ] ldp\t%d0, %d2, %z1
> - [ r , Ump , r , m ; load_16 ] ldp\t%x0, %x2, %z1
> - }
> -)
> -
> -(define_insn "vec_store_pair<DREG:mode><DREG2:mode>"
> - [(set (match_operand:DREG 0 "aarch64_mem_pair_operand")
> - (match_operand:DREG 1 "register_operand"))
> - (set (match_operand:DREG2 2 "memory_operand")
> - (match_operand:DREG2 3 "register_operand"))]
> - "TARGET_FLOAT
> - && rtx_equal_p (XEXP (operands[2], 0),
> - plus_constant (Pmode,
> - XEXP (operands[0], 0),
> - GET_MODE_SIZE (<DREG:MODE>mode)))"
> - {@ [ cons: =0 , 1 , =2 , 3 ; attrs: type ]
> - [ Ump , w , m , w ; neon_stp ] stp\t%d1, %d3, %z0
> - [ Ump , r , m , r ; store_16 ] stp\t%x1, %x3, %z0
> - }
> -)
> -
> (define_insn "aarch64_simd_stp<mode>"
> [(set (match_operand:VP_2E 0 "aarch64_mem_pair_lanes_operand")
> (vec_duplicate:VP_2E (match_operand:<VEL> 1 "register_operand")))]
> @@ -273,34 +241,6 @@ (define_insn "aarch64_simd_stp<mode>"
> }
> )
>
> -(define_insn "load_pair<VQ:mode><VQ2:mode>"
> - [(set (match_operand:VQ 0 "register_operand" "=w")
> - (match_operand:VQ 1 "aarch64_mem_pair_operand" "Ump"))
> - (set (match_operand:VQ2 2 "register_operand" "=w")
> - (match_operand:VQ2 3 "memory_operand" "m"))]
> - "TARGET_FLOAT
> - && rtx_equal_p (XEXP (operands[3], 0),
> - plus_constant (Pmode,
> - XEXP (operands[1], 0),
> - GET_MODE_SIZE (<VQ:MODE>mode)))"
> - "ldp\\t%q0, %q2, %z1"
> - [(set_attr "type" "neon_ldp_q")]
> -)
> -
> -(define_insn "vec_store_pair<VQ:mode><VQ2:mode>"
> - [(set (match_operand:VQ 0 "aarch64_mem_pair_operand" "=Ump")
> - (match_operand:VQ 1 "register_operand" "w"))
> - (set (match_operand:VQ2 2 "memory_operand" "=m")
> - (match_operand:VQ2 3 "register_operand" "w"))]
> - "TARGET_FLOAT
> - && rtx_equal_p (XEXP (operands[2], 0),
> - plus_constant (Pmode,
> - XEXP (operands[0], 0),
> - GET_MODE_SIZE (<VQ:MODE>mode)))"
> - "stp\\t%q1, %q3, %z0"
> - [(set_attr "type" "neon_stp_q")]
> -)
> -
> (define_expand "@aarch64_split_simd_mov<mode>"
> [(set (match_operand:VQMOV 0)
> (match_operand:VQMOV 1))]
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index ccf081d2a16..1f6094bf1bc 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -9056,59 +9056,81 @@ aarch64_pop_regs (unsigned regno1, unsigned regno2,
> HOST_WIDE_INT adjustment,
> }
> }
>
> -/* Generate and return a store pair instruction of mode MODE to store
> - register REG1 to MEM1 and register REG2 to MEM2. */
> +static machine_mode
> +aarch64_pair_mode_for_mode (machine_mode mode)
> +{
> + if (known_eq (GET_MODE_SIZE (mode), 4))
> + return E_V2x4QImode;
> + else if (known_eq (GET_MODE_SIZE (mode), 8))
> + return E_V2x8QImode;
> + else if (known_eq (GET_MODE_SIZE (mode), 16))
> + return E_V2x16QImode;
> + else
> + gcc_unreachable ();
> +}
Missing function comment. There should be no need to use E_ outside switches.
>
> static rtx
> -aarch64_gen_store_pair (machine_mode mode, rtx mem1, rtx reg1, rtx mem2,
> - rtx reg2)
> +aarch64_pair_mem_from_base (rtx mem)
> {
> - switch (mode)
> - {
> - case E_DImode:
> - return gen_store_pair_dw_didi (mem1, reg1, mem2, reg2);
> -
> - case E_DFmode:
> - return gen_store_pair_dw_dfdf (mem1, reg1, mem2, reg2);
> -
> - case E_TFmode:
> - return gen_store_pair_dw_tftf (mem1, reg1, mem2, reg2);
> + auto pair_mode = aarch64_pair_mode_for_mode (GET_MODE (mem));
> + mem = adjust_bitfield_address_nv (mem, pair_mode, 0);
> + gcc_assert (aarch64_mem_pair_lanes_operand (mem, pair_mode));
> + return mem;
> +}
>
> - case E_V4SImode:
> - return gen_vec_store_pairv4siv4si (mem1, reg1, mem2, reg2);
> +/* Generate and return a store pair instruction to store REG1 and REG2
> + into memory starting at BASE_MEM. All three rtxes should have modes of
> the
> + same size. */
>
> - case E_V16QImode:
> - return gen_vec_store_pairv16qiv16qi (mem1, reg1, mem2, reg2);
> +rtx
> +aarch64_gen_store_pair (rtx base_mem, rtx reg1, rtx reg2)
> +{
> + rtx pair_mem = aarch64_pair_mem_from_base (base_mem);
>
> - default:
> - gcc_unreachable ();
> - }
> + return gen_rtx_SET (pair_mem,
> + gen_rtx_UNSPEC (GET_MODE (pair_mem),
> + gen_rtvec (2, reg1, reg2),
> + UNSPEC_STP));
> }
>
> -/* Generate and regurn a load pair isntruction of mode MODE to load register
> - REG1 from MEM1 and register REG2 from MEM2. */
> +/* Generate and return a load pair instruction to load a pair of
> + registers starting at BASE_MEM into REG1 and REG2. If CODE is
> + UNKNOWN, all three rtxes should have modes of the same size.
> + Otherwise, CODE is {SIGN,ZERO}_EXTEND, base_mem should be in SImode,
> + and REG{1,2} should be in DImode. */
>
> -static rtx
> -aarch64_gen_load_pair (machine_mode mode, rtx reg1, rtx mem1, rtx reg2,
> - rtx mem2)
> +rtx
> +aarch64_gen_load_pair (rtx reg1, rtx reg2, rtx base_mem, enum rtx_code code)
> {
> - switch (mode)
> - {
> - case E_DImode:
> - return gen_load_pair_dw_didi (reg1, mem1, reg2, mem2);
> + rtx pair_mem = aarch64_pair_mem_from_base (base_mem);
>
> - case E_DFmode:
> - return gen_load_pair_dw_dfdf (reg1, mem1, reg2, mem2);
> -
> - case E_TFmode:
> - return gen_load_pair_dw_tftf (reg1, mem1, reg2, mem2);
> + const bool any_extend_p = (code == ZERO_EXTEND || code == SIGN_EXTEND);
> + if (any_extend_p)
> + {
> + gcc_checking_assert (GET_MODE (base_mem) == SImode);
> + gcc_checking_assert (GET_MODE (reg1) == DImode);
> + gcc_checking_assert (GET_MODE (reg2) == DImode);
Not a personal preference, but I think single asserts with && are
preferred.
> + }
> + else
> + gcc_assert (code == UNKNOWN);
> +
> + rtx unspecs[2] = {
> + gen_rtx_UNSPEC (any_extend_p ? SImode : GET_MODE (reg1),
> + gen_rtvec (1, pair_mem),
> + UNSPEC_LDP_FST),
> + gen_rtx_UNSPEC (any_extend_p ? SImode : GET_MODE (reg2),
IIUC, the unspec modes could both be GET_MODE (base_mem)
> + gen_rtvec (1, copy_rtx (pair_mem)),
> + UNSPEC_LDP_SND)
> + };
>
> - case E_V4SImode:
> - return gen_load_pairv4siv4si (reg1, mem1, reg2, mem2);
> + if (any_extend_p)
> + for (int i = 0; i < 2; i++)
> + unspecs[i] = gen_rtx_fmt_e (code, DImode, unspecs[i]);
>
> - default:
> - gcc_unreachable ();
> - }
> + return gen_rtx_PARALLEL (VOIDmode,
> + gen_rtvec (2,
> + gen_rtx_SET (reg1, unspecs[0]),
> + gen_rtx_SET (reg2, unspecs[1])));
> }
>
> /* Return TRUE if return address signing should be enabled for the current
> @@ -9321,8 +9343,19 @@ aarch64_save_callee_saves (poly_int64 bytes_below_sp,
> offset -= fp_offset;
> }
> rtx mem = gen_frame_mem (mode, plus_constant (Pmode, base_rtx,
> offset));
> - bool need_cfa_note_p = (base_rtx != stack_pointer_rtx);
>
> + rtx cfa_base = stack_pointer_rtx;
> + poly_int64 cfa_offset = sp_offset;
I don't think we need both cfa_offset and sp_offset. sp_offset in the
current code only exists for CFI purposes.
> +
> + if (hard_fp_valid_p && frame_pointer_needed)
> + {
> + cfa_base = hard_frame_pointer_rtx;
> + cfa_offset += (bytes_below_sp - frame.bytes_below_hard_fp);
> + }
> +
> + rtx cfa_mem = gen_frame_mem (mode,
> + plus_constant (Pmode,
> + cfa_base, cfa_offset));
> unsigned int regno2;
> if (!aarch64_sve_mode_p (mode)
> && i + 1 < regs.size ()
> @@ -9331,45 +9364,37 @@ aarch64_save_callee_saves (poly_int64 bytes_below_sp,
> frame.reg_offset[regno2] - frame.reg_offset[regno]))
> {
> rtx reg2 = gen_rtx_REG (mode, regno2);
> - rtx mem2;
>
> offset += GET_MODE_SIZE (mode);
> - mem2 = gen_frame_mem (mode, plus_constant (Pmode, base_rtx, offset));
> - insn = emit_insn (aarch64_gen_store_pair (mode, mem, reg, mem2,
> - reg2));
> -
> - /* The first part of a frame-related parallel insn is
> - always assumed to be relevant to the frame
> - calculations; subsequent parts, are only
> - frame-related if explicitly marked. */
> + insn = emit_insn (aarch64_gen_store_pair (mem, reg, reg2));
> +
> if (aarch64_emit_cfi_for_reg_p (regno2))
> {
> - if (need_cfa_note_p)
> - aarch64_add_cfa_expression (insn, reg2, stack_pointer_rtx,
> - sp_offset + GET_MODE_SIZE (mode));
> - else
> - RTX_FRAME_RELATED_P (XVECEXP (PATTERN (insn), 0, 1)) = 1;
> + rtx cfa_mem2 = adjust_address_nv (cfa_mem,
> + Pmode,
> + GET_MODE_SIZE (mode));
Think this should use get_frame_mem directly, rather than moving beyond
the bounds of the original mem.
> + add_reg_note (insn, REG_CFA_OFFSET,
> + gen_rtx_SET (cfa_mem2, reg2));
> }
>
> regno = regno2;
> ++i;
> }
> else if (mode == VNx2DImode && BYTES_BIG_ENDIAN)
> - {
> - insn = emit_insn (gen_aarch64_pred_mov (mode, mem, ptrue, reg));
> - need_cfa_note_p = true;
> - }
> + insn = emit_insn (gen_aarch64_pred_mov (mode, mem, ptrue, reg));
> else if (aarch64_sve_mode_p (mode))
> insn = emit_insn (gen_rtx_SET (mem, reg));
> else
> insn = emit_move_insn (mem, reg);
>
> RTX_FRAME_RELATED_P (insn) = frame_related_p;
> - if (frame_related_p && need_cfa_note_p)
> - aarch64_add_cfa_expression (insn, reg, stack_pointer_rtx, sp_offset);
> +
> + if (frame_related_p)
> + add_reg_note (insn, REG_CFA_OFFSET, gen_rtx_SET (cfa_mem, reg));
For the record, I might need to add back some CFA_EXPRESSIONs for
locally-streaming SME functions, to ensure that the CFI code doesn't
aggregate SVE saves across a change in the VG DWARF register.
But it's probably easier to do that once the patch is in,
since having a note on all insns will help to ensure consistency.
> }
> }
>
> +
Stray extra whitespace.
> /* Emit code to restore the callee registers in REGS, ignoring pop candidates
> and any other registers that are handled separately. Write the
> appropriate
> REG_CFA_RESTORE notes into CFI_OPS.
> @@ -9425,12 +9450,7 @@ aarch64_restore_callee_saves (poly_int64
> bytes_below_sp,
> frame.reg_offset[regno2] - frame.reg_offset[regno]))
> {
> rtx reg2 = gen_rtx_REG (mode, regno2);
> - rtx mem2;
> -
> - offset += GET_MODE_SIZE (mode);
> - mem2 = gen_frame_mem (mode, plus_constant (Pmode, base_rtx, offset));
> - emit_insn (aarch64_gen_load_pair (mode, reg, mem, reg2, mem2));
> -
> + emit_insn (aarch64_gen_load_pair (reg, reg2, mem));
> *cfi_ops = alloc_reg_note (REG_CFA_RESTORE, reg2, *cfi_ops);
> regno = regno2;
> ++i;
> @@ -9762,9 +9782,9 @@ aarch64_process_components (sbitmap components, bool
> prologue_p)
> : gen_rtx_SET (reg2, mem2);
>
> if (prologue_p)
> - insn = emit_insn (aarch64_gen_store_pair (mode, mem, reg, mem2, reg2));
> + insn = emit_insn (aarch64_gen_store_pair (mem, reg, reg2));
> else
> - insn = emit_insn (aarch64_gen_load_pair (mode, reg, mem, reg2, mem2));
> + insn = emit_insn (aarch64_gen_load_pair (reg, reg2, mem));
>
> if (frame_related_p || frame_related2_p)
> {
> @@ -10983,12 +11003,18 @@ aarch64_classify_address (struct
> aarch64_address_info *info,
> mode of the corresponding addressing mode is half of that. */
> if (type == ADDR_QUERY_LDP_STP_N)
> {
> - if (known_eq (GET_MODE_SIZE (mode), 16))
> + if (known_eq (GET_MODE_SIZE (mode), 32))
> + mode = V16QImode;
> + else if (known_eq (GET_MODE_SIZE (mode), 16))
> mode = DFmode;
> else if (known_eq (GET_MODE_SIZE (mode), 8))
> mode = SFmode;
> else
> return false;
> +
> + /* This isn't really an Advanced SIMD struct mode, but a mode
> + used to represent the complete mem in a load/store pair. */
> + advsimd_struct_p = false;
> }
>
> bool allow_reg_index_p = (!load_store_pair_p
> @@ -12609,7 +12635,8 @@ aarch64_print_operand (FILE *f, rtx x, int code)
> if (!MEM_P (x)
> || (code == 'y'
> && maybe_ne (GET_MODE_SIZE (mode), 8)
> - && maybe_ne (GET_MODE_SIZE (mode), 16)))
> + && maybe_ne (GET_MODE_SIZE (mode), 16)
> + && maybe_ne (GET_MODE_SIZE (mode), 32)))
> {
> output_operand_lossage ("invalid operand for '%%%c'", code);
> return;
> @@ -25431,10 +25458,8 @@ aarch64_copy_one_block_and_progress_pointers (rtx
> *src, rtx *dst,
> *src = adjust_address (*src, mode, 0);
> *dst = adjust_address (*dst, mode, 0);
> /* Emit the memcpy. */
> - emit_insn (aarch64_gen_load_pair (mode, reg1, *src, reg2,
> - aarch64_progress_pointer (*src)));
> - emit_insn (aarch64_gen_store_pair (mode, *dst, reg1,
> - aarch64_progress_pointer (*dst),
> reg2));
> + emit_insn (aarch64_gen_load_pair (reg1, reg2, *src));
> + emit_insn (aarch64_gen_store_pair (*dst, reg1, reg2));
> /* Move the pointers forward. */
> *src = aarch64_move_pointer (*src, 32);
> *dst = aarch64_move_pointer (*dst, 32);
> @@ -25613,8 +25638,7 @@ aarch64_set_one_block_and_progress_pointer (rtx src,
> rtx *dst,
> /* "Cast" the *dst to the correct mode. */
> *dst = adjust_address (*dst, mode, 0);
> /* Emit the memset. */
> - emit_insn (aarch64_gen_store_pair (mode, *dst, src,
> - aarch64_progress_pointer (*dst), src));
> + emit_insn (aarch64_gen_store_pair (*dst, src, src));
>
> /* Move the pointers forward. */
> *dst = aarch64_move_pointer (*dst, 32);
> @@ -26812,6 +26836,22 @@ aarch64_swap_ldrstr_operands (rtx* operands, bool
> load)
> }
> }
>
> +void
> +aarch64_finish_ldpstp_peephole (rtx *operands, bool load_p, enum rtx_code
> code)
Missing function comment.
> +{
> + aarch64_swap_ldrstr_operands (operands, load_p);
> +
> + if (load_p)
> + emit_insn (aarch64_gen_load_pair (operands[0], operands[2],
> + operands[1], code));
> + else
> + {
> + gcc_assert (code == UNKNOWN);
> + emit_insn (aarch64_gen_store_pair (operands[0], operands[1],
> + operands[3]));
> + }
> +}
> +
> /* Taking X and Y to be HOST_WIDE_INT pointers, return the result of a
> comparison between the two. */
> int
> @@ -26993,8 +27033,8 @@ bool
> aarch64_gen_adjusted_ldpstp (rtx *operands, bool load,
> machine_mode mode, RTX_CODE code)
> {
> - rtx base, offset_1, offset_3, t1, t2;
> - rtx mem_1, mem_2, mem_3, mem_4;
> + rtx base, offset_1, offset_3;
> + rtx mem_1, mem_2;
> rtx temp_operands[8];
> HOST_WIDE_INT off_val_1, off_val_3, base_off, new_off_1, new_off_3,
> stp_off_upper_limit, stp_off_lower_limit, msize;
> @@ -27019,21 +27059,17 @@ aarch64_gen_adjusted_ldpstp (rtx *operands, bool
> load,
> if (load)
> {
> mem_1 = copy_rtx (temp_operands[1]);
> - mem_2 = copy_rtx (temp_operands[3]);
> - mem_3 = copy_rtx (temp_operands[5]);
> - mem_4 = copy_rtx (temp_operands[7]);
> + mem_2 = copy_rtx (temp_operands[5]);
> }
> else
> {
> mem_1 = copy_rtx (temp_operands[0]);
> - mem_2 = copy_rtx (temp_operands[2]);
> - mem_3 = copy_rtx (temp_operands[4]);
> - mem_4 = copy_rtx (temp_operands[6]);
> + mem_2 = copy_rtx (temp_operands[4]);
> gcc_assert (code == UNKNOWN);
> }
>
> extract_base_offset_in_addr (mem_1, &base, &offset_1);
> - extract_base_offset_in_addr (mem_3, &base, &offset_3);
> + extract_base_offset_in_addr (mem_2, &base, &offset_3);
mem_2 with offset_3 feels a bit awkward. Might be worth using mem_3 instead,
so that the memory and register numbers are in sync.
I suppose we still need Ump for the extending loads, is that right?
Are there any other uses left?
Thanks,
Richard
> gcc_assert (base != NULL_RTX && offset_1 != NULL_RTX
> && offset_3 != NULL_RTX);
>
> @@ -27097,63 +27133,48 @@ aarch64_gen_adjusted_ldpstp (rtx *operands, bool
> load,
> replace_equiv_address_nv (mem_1, plus_constant (Pmode, operands[8],
> new_off_1), true);
> replace_equiv_address_nv (mem_2, plus_constant (Pmode, operands[8],
> - new_off_1 + msize), true);
> - replace_equiv_address_nv (mem_3, plus_constant (Pmode, operands[8],
> new_off_3), true);
> - replace_equiv_address_nv (mem_4, plus_constant (Pmode, operands[8],
> - new_off_3 + msize), true);
>
> if (!aarch64_mem_pair_operand (mem_1, mode)
> - || !aarch64_mem_pair_operand (mem_3, mode))
> + || !aarch64_mem_pair_operand (mem_2, mode))
> return false;
>
> - if (code == ZERO_EXTEND)
> - {
> - mem_1 = gen_rtx_ZERO_EXTEND (DImode, mem_1);
> - mem_2 = gen_rtx_ZERO_EXTEND (DImode, mem_2);
> - mem_3 = gen_rtx_ZERO_EXTEND (DImode, mem_3);
> - mem_4 = gen_rtx_ZERO_EXTEND (DImode, mem_4);
> - }
> - else if (code == SIGN_EXTEND)
> - {
> - mem_1 = gen_rtx_SIGN_EXTEND (DImode, mem_1);
> - mem_2 = gen_rtx_SIGN_EXTEND (DImode, mem_2);
> - mem_3 = gen_rtx_SIGN_EXTEND (DImode, mem_3);
> - mem_4 = gen_rtx_SIGN_EXTEND (DImode, mem_4);
> - }
> -
> if (load)
> {
> operands[0] = temp_operands[0];
> operands[1] = mem_1;
> operands[2] = temp_operands[2];
> - operands[3] = mem_2;
> operands[4] = temp_operands[4];
> - operands[5] = mem_3;
> + operands[5] = mem_2;
> operands[6] = temp_operands[6];
> - operands[7] = mem_4;
> }
> else
> {
> operands[0] = mem_1;
> operands[1] = temp_operands[1];
> - operands[2] = mem_2;
> operands[3] = temp_operands[3];
> - operands[4] = mem_3;
> + operands[4] = mem_2;
> operands[5] = temp_operands[5];
> - operands[6] = mem_4;
> operands[7] = temp_operands[7];
> }
>
> /* Emit adjusting instruction. */
> emit_insn (gen_rtx_SET (operands[8], plus_constant (DImode, base,
> base_off)));
> /* Emit ldp/stp instructions. */
> - t1 = gen_rtx_SET (operands[0], operands[1]);
> - t2 = gen_rtx_SET (operands[2], operands[3]);
> - emit_insn (gen_rtx_PARALLEL (VOIDmode, gen_rtvec (2, t1, t2)));
> - t1 = gen_rtx_SET (operands[4], operands[5]);
> - t2 = gen_rtx_SET (operands[6], operands[7]);
> - emit_insn (gen_rtx_PARALLEL (VOIDmode, gen_rtvec (2, t1, t2)));
> + if (load)
> + {
> + emit_insn (aarch64_gen_load_pair (operands[0], operands[2],
> + operands[1], code));
> + emit_insn (aarch64_gen_load_pair (operands[4], operands[6],
> + operands[5], code));
> + }
> + else
> + {
> + emit_insn (aarch64_gen_store_pair (operands[0], operands[1],
> + operands[3]));
> + emit_insn (aarch64_gen_store_pair (operands[4], operands[5],
> + operands[7]));
> + }
> return true;
> }
>
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index c92a51690c5..ffb6b0ba749 100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -175,6 +175,9 @@ (define_c_enum "unspec" [
> UNSPEC_GOTSMALLTLS
> UNSPEC_GOTTINYPIC
> UNSPEC_GOTTINYTLS
> + UNSPEC_STP
> + UNSPEC_LDP_FST
> + UNSPEC_LDP_SND
> UNSPEC_LD1
> UNSPEC_LD2
> UNSPEC_LD2_DREG
> @@ -453,6 +456,11 @@ (define_attr "predicated" "yes,no" (const_string "no"))
> ;; may chose to hold the tracking state encoded in SP.
> (define_attr "speculation_barrier" "true,false" (const_string "false"))
>
> +;; Attribute use to identify load pair and store pair instructions.
> +;; Currently the attribute is only applied to the non-writeback ldp/stp
> +;; patterns.
> +(define_attr "ldpstp" "ldp,stp,none" (const_string "none"))
> +
> ;; -------------------------------------------------------------------
> ;; Pipeline descriptions and scheduling
> ;; -------------------------------------------------------------------
> @@ -1735,100 +1743,62 @@ (define_expand "setmemdi"
> FAIL;
> })
>
> -;; Operands 1 and 3 are tied together by the final condition; so we allow
> -;; fairly lax checking on the second memory operation.
> -(define_insn "load_pair_sw_<SX:mode><SX2:mode>"
> - [(set (match_operand:SX 0 "register_operand")
> - (match_operand:SX 1 "aarch64_mem_pair_operand"))
> - (set (match_operand:SX2 2 "register_operand")
> - (match_operand:SX2 3 "memory_operand"))]
> - "rtx_equal_p (XEXP (operands[3], 0),
> - plus_constant (Pmode,
> - XEXP (operands[1], 0),
> - GET_MODE_SIZE (<SX:MODE>mode)))"
> - {@ [ cons: =0 , 1 , =2 , 3 ; attrs: type , arch ]
> - [ r , Ump , r , m ; load_8 , * ] ldp\t%w0, %w2, %z1
> - [ w , Ump , w , m ; neon_load1_2reg , fp ] ldp\t%s0, %s2, %z1
> - }
> -)
> -
> -;; Storing different modes that can still be merged
> -(define_insn "load_pair_dw_<DX:mode><DX2:mode>"
> - [(set (match_operand:DX 0 "register_operand")
> - (match_operand:DX 1 "aarch64_mem_pair_operand"))
> - (set (match_operand:DX2 2 "register_operand")
> - (match_operand:DX2 3 "memory_operand"))]
> - "rtx_equal_p (XEXP (operands[3], 0),
> - plus_constant (Pmode,
> - XEXP (operands[1], 0),
> - GET_MODE_SIZE (<DX:MODE>mode)))"
> - {@ [ cons: =0 , 1 , =2 , 3 ; attrs: type , arch ]
> - [ r , Ump , r , m ; load_16 , * ] ldp\t%x0, %x2, %z1
> - [ w , Ump , w , m ; neon_load1_2reg , fp ] ldp\t%d0, %d2, %z1
> - }
> -)
> -
> -(define_insn "load_pair_dw_<TX:mode><TX2:mode>"
> - [(set (match_operand:TX 0 "register_operand" "=w")
> - (match_operand:TX 1 "aarch64_mem_pair_operand" "Ump"))
> - (set (match_operand:TX2 2 "register_operand" "=w")
> - (match_operand:TX2 3 "memory_operand" "m"))]
> - "TARGET_SIMD
> - && rtx_equal_p (XEXP (operands[3], 0),
> - plus_constant (Pmode,
> - XEXP (operands[1], 0),
> - GET_MODE_SIZE (<TX:MODE>mode)))"
> - "ldp\\t%q0, %q2, %z1"
> +(define_insn "*load_pair_<ldst_sz>"
> + [(set (match_operand:GPI 0 "aarch64_ldp_reg_operand")
> + (unspec [
> + (match_operand:<VPAIR> 1 "aarch64_mem_pair_lanes_operand")
> + ] UNSPEC_LDP_FST))
> + (set (match_operand:GPI 2 "aarch64_ldp_reg_operand")
> + (unspec [
> + (match_dup 1)
> + ] UNSPEC_LDP_SND))]
> + ""
> + {@ [cons: =0, 1, =2; attrs: type, arch]
> + [ r, Umn, r; load_<ldpstp_sz>, * ] ldp\t%<w>0, %<w>2, %y1
> + [ w, Umn, w; neon_load1_2reg, fp ] ldp\t%<v>0, %<v>2, %y1
> + }
> + [(set_attr "ldpstp" "ldp")]
> +)
> +
> +(define_insn "*load_pair_16"
> + [(set (match_operand:TI 0 "aarch64_ldp_reg_operand" "=w")
> + (unspec [
> + (match_operand:V2x16QI 1 "aarch64_mem_pair_lanes_operand" "Umn")
> + ] UNSPEC_LDP_FST))
> + (set (match_operand:TI 2 "aarch64_ldp_reg_operand" "=w")
> + (unspec [
> + (match_dup 1)
> + ] UNSPEC_LDP_SND))]
> + "TARGET_FLOAT"
> + "ldp\\t%q0, %q2, %y1"
> [(set_attr "type" "neon_ldp_q")
> - (set_attr "fp" "yes")]
> -)
> -
> -;; Operands 0 and 2 are tied together by the final condition; so we allow
> -;; fairly lax checking on the second memory operation.
> -(define_insn "store_pair_sw_<SX:mode><SX2:mode>"
> - [(set (match_operand:SX 0 "aarch64_mem_pair_operand")
> - (match_operand:SX 1 "aarch64_reg_zero_or_fp_zero"))
> - (set (match_operand:SX2 2 "memory_operand")
> - (match_operand:SX2 3 "aarch64_reg_zero_or_fp_zero"))]
> - "rtx_equal_p (XEXP (operands[2], 0),
> - plus_constant (Pmode,
> - XEXP (operands[0], 0),
> - GET_MODE_SIZE (<SX:MODE>mode)))"
> - {@ [ cons: =0 , 1 , =2 , 3 ; attrs: type , arch ]
> - [ Ump , rYZ , m , rYZ ; store_8 , * ] stp\t%w1, %w3,
> %z0
> - [ Ump , w , m , w ; neon_store1_2reg , fp ] stp\t%s1, %s3,
> %z0
> - }
> -)
> -
> -;; Storing different modes that can still be merged
> -(define_insn "store_pair_dw_<DX:mode><DX2:mode>"
> - [(set (match_operand:DX 0 "aarch64_mem_pair_operand")
> - (match_operand:DX 1 "aarch64_reg_zero_or_fp_zero"))
> - (set (match_operand:DX2 2 "memory_operand")
> - (match_operand:DX2 3 "aarch64_reg_zero_or_fp_zero"))]
> - "rtx_equal_p (XEXP (operands[2], 0),
> - plus_constant (Pmode,
> - XEXP (operands[0], 0),
> - GET_MODE_SIZE (<DX:MODE>mode)))"
> - {@ [ cons: =0 , 1 , =2 , 3 ; attrs: type , arch ]
> - [ Ump , rYZ , m , rYZ ; store_16 , * ] stp\t%x1, %x3,
> %z0
> - [ Ump , w , m , w ; neon_store1_2reg , fp ] stp\t%d1, %d3,
> %z0
> - }
> -)
> -
> -(define_insn "store_pair_dw_<TX:mode><TX2:mode>"
> - [(set (match_operand:TX 0 "aarch64_mem_pair_operand" "=Ump")
> - (match_operand:TX 1 "register_operand" "w"))
> - (set (match_operand:TX2 2 "memory_operand" "=m")
> - (match_operand:TX2 3 "register_operand" "w"))]
> - "TARGET_SIMD &&
> - rtx_equal_p (XEXP (operands[2], 0),
> - plus_constant (Pmode,
> - XEXP (operands[0], 0),
> - GET_MODE_SIZE (TFmode)))"
> - "stp\\t%q1, %q3, %z0"
> + (set_attr "fp" "yes")
> + (set_attr "ldpstp" "ldp")]
> +)
> +
> +(define_insn "*store_pair_<ldst_sz>"
> + [(set (match_operand:<VPAIR> 0 "aarch64_mem_pair_lanes_operand")
> + (unspec:<VPAIR>
> + [(match_operand:GPI 1 "aarch64_stp_reg_operand")
> + (match_operand:GPI 2 "aarch64_stp_reg_operand")] UNSPEC_STP))]
> + ""
> + {@ [cons: =0, 1, 2; attrs: type , arch]
> + [ Umn, rYZ, rYZ; store_<ldpstp_sz>, * ] stp\t%<w>1, %<w>2,
> %y0
> + [ Umn, w, w; neon_store1_2reg , fp ] stp\t%<v>1, %<v>2,
> %y0
> + }
> + [(set_attr "ldpstp" "stp")]
> +)
> +
> +(define_insn "*store_pair_16"
> + [(set (match_operand:V2x16QI 0 "aarch64_mem_pair_lanes_operand" "=Umn")
> + (unspec:V2x16QI
> + [(match_operand:TI 1 "aarch64_ldp_reg_operand" "w")
> + (match_operand:TI 2 "aarch64_ldp_reg_operand" "w")] UNSPEC_STP))]
> + "TARGET_FLOAT"
> + "stp\t%q1, %q2, %y0"
> [(set_attr "type" "neon_stp_q")
> - (set_attr "fp" "yes")]
> + (set_attr "fp" "yes")
> + (set_attr "ldpstp" "stp")]
> )
>
> ;; Writeback load/store pair patterns.
> @@ -2074,14 +2044,15 @@ (define_insn "*extendsidi2_aarch64"
>
> (define_insn "*load_pair_extendsidi2_aarch64"
> [(set (match_operand:DI 0 "register_operand" "=r")
> - (sign_extend:DI (match_operand:SI 1 "aarch64_mem_pair_operand" "Ump")))
> + (sign_extend:DI (unspec:SI [
> + (match_operand:V2x4QI 1 "aarch64_mem_pair_lanes_operand" "Umn")
> + ] UNSPEC_LDP_FST)))
> (set (match_operand:DI 2 "register_operand" "=r")
> - (sign_extend:DI (match_operand:SI 3 "memory_operand" "m")))]
> - "rtx_equal_p (XEXP (operands[3], 0),
> - plus_constant (Pmode,
> - XEXP (operands[1], 0),
> - GET_MODE_SIZE (SImode)))"
> - "ldpsw\\t%0, %2, %z1"
> + (sign_extend:DI (unspec:SI [
> + (match_dup 1)
> + ] UNSPEC_LDP_SND)))]
> + ""
> + "ldpsw\\t%0, %2, %y1"
> [(set_attr "type" "load_8")]
> )
>
> @@ -2101,16 +2072,17 @@ (define_insn "*zero_extendsidi2_aarch64"
>
> (define_insn "*load_pair_zero_extendsidi2_aarch64"
> [(set (match_operand:DI 0 "register_operand")
> - (zero_extend:DI (match_operand:SI 1 "aarch64_mem_pair_operand")))
> + (zero_extend:DI (unspec:SI [
> + (match_operand:V2x4QI 1 "aarch64_mem_pair_lanes_operand")
> + ] UNSPEC_LDP_FST)))
> (set (match_operand:DI 2 "register_operand")
> - (zero_extend:DI (match_operand:SI 3 "memory_operand")))]
> - "rtx_equal_p (XEXP (operands[3], 0),
> - plus_constant (Pmode,
> - XEXP (operands[1], 0),
> - GET_MODE_SIZE (SImode)))"
> - {@ [ cons: =0 , 1 , =2 , 3 ; attrs: type , arch ]
> - [ r , Ump , r , m ; load_8 , * ] ldp\t%w0, %w2, %z1
> - [ w , Ump , w , m ; neon_load1_2reg , fp ] ldp\t%s0, %s2, %z1
> + (zero_extend:DI (unspec:SI [
> + (match_dup 1)
> + ] UNSPEC_LDP_SND)))]
> + ""
> + {@ [ cons: =0 , 1 , =2; attrs: type , arch]
> + [ r , Umn , r ; load_8 , * ] ldp\t%w0, %w2, %y1
> + [ w , Umn , w ; neon_load1_2reg, fp ] ldp\t%s0, %s2, %y1
> }
> )
>
> diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
> index a920de99ffc..fd8dd6db349 100644
> --- a/gcc/config/aarch64/iterators.md
> +++ b/gcc/config/aarch64/iterators.md
> @@ -1435,6 +1435,9 @@ (define_mode_attr VDBL [(V8QI "V16QI") (V4HI "V8HI")
> (SI "V2SI") (SF "V2SF")
> (DI "V2DI") (DF "V2DF")])
>
> +;; Load/store pair mode.
> +(define_mode_attr VPAIR [(SI "V2x4QI") (DI "V2x8QI")])
> +
> ;; Register suffix for double-length mode.
> (define_mode_attr Vdtype [(V4HF "8h") (V2SF "4s")])
>
> diff --git a/gcc/config/aarch64/predicates.md
> b/gcc/config/aarch64/predicates.md
> index b647e5af7c6..80f2e03d8de 100644
> --- a/gcc/config/aarch64/predicates.md
> +++ b/gcc/config/aarch64/predicates.md
> @@ -266,10 +266,12 @@ (define_special_predicate "aarch64_mem_pair_operator"
> (match_test "known_eq (GET_MODE_SIZE (mode),
> GET_MODE_SIZE (GET_MODE (op)))"))))
>
> -(define_predicate "aarch64_mem_pair_operand"
> - (and (match_code "mem")
> - (match_test "aarch64_legitimate_address_p (mode, XEXP (op, 0), false,
> - ADDR_QUERY_LDP_STP)")))
> +;; Like aarch64_mem_pair_operator, but additionally check the
> +;; address is suitable.
> +(define_special_predicate "aarch64_mem_pair_operand"
> + (and (match_operand 0 "aarch64_mem_pair_operator")
> + (match_test "aarch64_legitimate_address_p (GET_MODE (op), XEXP (op,
> 0),
> + false, ADDR_QUERY_LDP_STP)")))
>
> (define_predicate "pmode_plus_operator"
> (and (match_code "plus")