Alex Coplan <alex.cop...@arm.com> writes: > On 12/12/2023 15:58, Richard Sandiford wrote: >> Alex Coplan <alex.cop...@arm.com> writes: >> > Hi, >> > >> > This is a v2 version which addresses feedback from Richard's review >> > here: >> > >> > https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637648.html >> > >> > I'll reply inline to address specific comments. >> > >> > Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk? >> > >> > Thanks, >> > Alex >> > >> > -- >8 -- >> > >> > This patch overhauls the load/store pair patterns with two main goals: >> > >> > 1. Fixing a correctness issue (the current patterns are not RA-friendly). >> > 2. Allowing more flexibility in which operand modes are supported, and >> > which >> > combinations of modes are allowed in the two arms of the load/store >> > pair, >> > while reducing the number of patterns required both in the source and in >> > the generated code. >> > >> > The correctness issue (1) is due to the fact that the current patterns have >> > two independent memory operands tied together only by a predicate on the >> > insns. >> > Since LRA only looks at the constraints, one of the memory operands can get >> > reloaded without the other one being changed, leading to the insn becoming >> > unrecognizable after reload. >> > >> > We fix this issue by changing the patterns such that they only ever have >> > one >> > memory operand representing the entire pair. For the store case, we use an >> > unspec to logically concatenate the register operands before storing them. >> > For the load case, we use unspecs to extract the "lanes" from the pair mem, >> > with the second occurrence of the mem matched using a match_dup (such that >> > there >> > is still really only one memory operand as far as the RA is concerned). >> > >> > In terms of the modes used for the pair memory operands, we canonicalize >> > these to V2x4QImode, V2x8QImode, and V2x16QImode. These modes have not >> > only the correct size but also correct alignment requirement for a >> > memory operand representing an entire load/store pair. Unlike the other >> > two, V2x4QImode didn't previously exist, so had to be added with the >> > patch. >> > >> > As with the previous patch generalizing the writeback patterns, this >> > patch aims to be flexible in the combinations of modes supported by the >> > patterns without requiring a large number of generated patterns by using >> > distinct mode iterators. >> > >> > The new scheme means we only need a single (generated) pattern for each >> > load/store operation of a given operand size. For the 4-byte and 8-byte >> > operand cases, we use the GPI iterator to synthesize the two patterns. >> > The 16-byte case is implemented as a separate pattern in the source (due >> > to only having a single possible alternative). >> > >> > Since the UNSPEC patterns can't be interpreted by the dwarf2cfi code, >> > we add REG_CFA_OFFSET notes to the store pair insns emitted by >> > aarch64_save_callee_saves, so that correct CFI information can still be >> > generated. Furthermore, we now unconditionally generate these CFA >> > notes on frame-related insns emitted by aarch64_save_callee_saves. >> > This is done in case that the load/store pair pass forms these into >> > pairs, in which case the CFA notes would be needed. >> > >> > We also adjust the ldp/stp peepholes to generate the new form. This is >> > done by switching the generation to use the >> > aarch64_gen_{load,store}_pair interface, making it easier to change the >> > form in the future if needed. (Likewise, the upcoming aarch64 >> > load/store pair pass also makes use of this interface). >> > >> > This patch also adds an "ldpstp" attribute to the non-writeback >> > load/store pair patterns, which is used by the post-RA load/store pair >> > pass to identify existing patterns and see if they can be promoted to >> > writeback variants. >> > >> > One potential concern with using unspecs for the patterns is that it can >> > block >> > optimization by the generic RTL passes. This patch series tries to >> > mitigate >> > this in two ways: >> > 1. The pre-RA load/store pair pass runs very late in the pre-RA pipeline. >> > 2. A later patch in the series adjusts the aarch64 mem{cpy,set} expansion >> > to >> > emit individual loads/stores instead of ldp/stp. These should then be >> > formed back into load/store pairs much later in the RTL pipeline by the >> > new load/store pair pass. >> > >> > gcc/ChangeLog: >> > >> > * config/aarch64/aarch64-ldpstp.md: Abstract ldp/stp >> > representation from peepholes, allowing use of new form. >> > * config/aarch64/aarch64-modes.def (V2x4QImode): Define. >> > * config/aarch64/aarch64-protos.h >> > (aarch64_finish_ldpstp_peephole): Declare. >> > (aarch64_swap_ldrstr_operands): Delete declaration. >> > (aarch64_gen_load_pair): Adjust parameters. >> > (aarch64_gen_store_pair): Likewise. >> > * config/aarch64/aarch64-simd.md >> > (load_pair<DREG:mode><DREG2:mode>): >> > Delete. >> > (vec_store_pair<DREG:mode><DREG2:mode>): Delete. >> > (load_pair<VQ:mode><VQ2:mode>): Delete. >> > (vec_store_pair<VQ:mode><VQ2:mode>): Delete. >> > * config/aarch64/aarch64.cc (aarch64_pair_mode_for_mode): New. >> > (aarch64_gen_store_pair): Adjust to use new unspec form of stp. >> > Drop second mem from parameters. >> > (aarch64_gen_load_pair): Likewise. >> > (aarch64_pair_mem_from_base): New. >> > (aarch64_save_callee_saves): Emit REG_CFA_OFFSET notes for >> > frame-related saves. Adjust call to aarch64_gen_store_pair >> > (aarch64_restore_callee_saves): Adjust calls to >> > aarch64_gen_load_pair to account for change in interface. >> > (aarch64_process_components): Likewise. >> > (aarch64_classify_address): Handle 32-byte pair mems in >> > LDP_STP_N case. >> > (aarch64_print_operand): Likewise. >> > (aarch64_copy_one_block_and_progress_pointers): Adjust calls to >> > account for change in aarch64_gen_{load,store}_pair interface. >> > (aarch64_set_one_block_and_progress_pointer): Likewise. >> > (aarch64_finish_ldpstp_peephole): New. >> > (aarch64_gen_adjusted_ldpstp): Adjust to use generation helper. >> > * config/aarch64/aarch64.md (ldpstp): New attribute. >> > (load_pair_sw_<SX:mode><SX2:mode>): Delete. >> > (load_pair_dw_<DX:mode><DX2:mode>): Delete. >> > (load_pair_dw_<TX:mode><TX2:mode>): Delete. >> > (*load_pair_<ldst_sz>): New. >> > (*load_pair_16): New. >> > (store_pair_sw_<SX:mode><SX2:mode>): Delete. >> > (store_pair_dw_<DX:mode><DX2:mode>): Delete. >> > (store_pair_dw_<TX:mode><TX2:mode>): Delete. >> > (*store_pair_<ldst_sz>): New. >> > (*store_pair_16): New. >> > (*load_pair_extendsidi2_aarch64): Adjust to use new form. >> > (*zero_extendsidi2_aarch64): Likewise. >> > * config/aarch64/iterators.md (VPAIR): New. >> > * config/aarch64/predicates.md (aarch64_mem_pair_operand): Change >> > to >> > a special predicate derived from aarch64_mem_pair_operator. >> >> Thanks, looks great. OK for trunk with a couple of trivial changes: > > Thanks a lot for the review, but I posted a v3 which has further changes > needed when rebasing on top of SME (as e.g. the SME code directly forms > the PARALLEL pair representation and invokes the old named patterns): > https://gcc.gnu.org/pipermail/gcc-patches/2023-December/639767.html
Yeah, sorry, only saw that there was a v3 after sending the review. > is the v3 OK with those changes? Yes, thanks. Makes for a nice simplification in aarch64_init_tpidr2_block. Richard > Thanks, > Alex > >> >> > [...] >> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc >> > index 8faaa748a05..c2e4f531254 100644 >> > --- a/gcc/config/aarch64/aarch64.cc >> > +++ b/gcc/config/aarch64/aarch64.cc >> > @@ -6680,59 +6680,87 @@ aarch64_pop_regs (unsigned regno1, unsigned >> > regno2, HOST_WIDE_INT adjustment, >> > } >> > } >> > >> > -/* Generate and return a store pair instruction of mode MODE to store >> > - register REG1 to MEM1 and register REG2 to MEM2. */ >> > +/* Given an ldp/stp register operand mode MODE, return a suitable mode to >> > use >> > + for a mem rtx representing the entire pair. */ >> > >> > -static rtx >> > -aarch64_gen_store_pair (machine_mode mode, rtx mem1, rtx reg1, rtx mem2, >> > - rtx reg2) >> > -{ >> > - switch (mode) >> > - { >> > - case E_DImode: >> > - return gen_store_pair_dw_didi (mem1, reg1, mem2, reg2); >> > +static machine_mode >> > +aarch64_pair_mode_for_mode (machine_mode mode) >> > +{ >> > + if (known_eq (GET_MODE_SIZE (mode), 4)) >> > + return V2x4QImode; >> > + else if (known_eq (GET_MODE_SIZE (mode), 8)) >> > + return V2x8QImode; >> > + else if (known_eq (GET_MODE_SIZE (mode), 16)) >> > + return V2x16QImode; >> > + else >> > + gcc_unreachable (); >> > +} >> > >> > - case E_DFmode: >> > - return gen_store_pair_dw_dfdf (mem1, reg1, mem2, reg2); >> > +/* Given a base mem MEM with a mode suitable for an ldp/stp register >> > operand, >> > + return an rtx like MEM which instead represents the entire pair. */ >> >> Might be worth saying "mode and address suitable...", given the assert >> for aarch64_mem_pair_lanes_operand. >> >> > >> > - case E_TFmode: >> > - return gen_store_pair_dw_tftf (mem1, reg1, mem2, reg2); >> > +static rtx >> > +aarch64_pair_mem_from_base (rtx mem) >> > +{ >> > + auto pair_mode = aarch64_pair_mode_for_mode (GET_MODE (mem)); >> > + mem = adjust_bitfield_address_nv (mem, pair_mode, 0); >> > + gcc_assert (aarch64_mem_pair_lanes_operand (mem, pair_mode)); >> > + return mem; >> > +} >> > >> > - case E_V4SImode: >> > - return gen_vec_store_pairv4siv4si (mem1, reg1, mem2, reg2); >> > +/* Generate and return a store pair instruction to store REG1 and REG2 >> > + into memory starting at BASE_MEM. All three rtxes should have modes >> > of the >> > + same size. */ >> > >> > - case E_V16QImode: >> > - return gen_vec_store_pairv16qiv16qi (mem1, reg1, mem2, reg2); >> > +rtx >> > +aarch64_gen_store_pair (rtx base_mem, rtx reg1, rtx reg2) >> > +{ >> > + rtx pair_mem = aarch64_pair_mem_from_base (base_mem); >> > >> > - default: >> > - gcc_unreachable (); >> > - } >> > + return gen_rtx_SET (pair_mem, >> > + gen_rtx_UNSPEC (GET_MODE (pair_mem), >> > + gen_rtvec (2, reg1, reg2), >> > + UNSPEC_STP)); >> > } >> > >> > -/* Generate and regurn a load pair isntruction of mode MODE to load >> > register >> > - REG1 from MEM1 and register REG2 from MEM2. */ >> > +/* Generate and return a load pair instruction to load a pair of >> > + registers starting at BASE_MEM into REG1 and REG2. If CODE is >> > + UNKNOWN, all three rtxes should have modes of the same size. >> > + Otherwise, CODE is {SIGN,ZERO}_EXTEND, base_mem should be in SImode, >> > + and REG{1,2} should be in DImode. */ >> > >> > -static rtx >> > -aarch64_gen_load_pair (machine_mode mode, rtx reg1, rtx mem1, rtx reg2, >> > - rtx mem2) >> > +rtx >> > +aarch64_gen_load_pair (rtx reg1, rtx reg2, rtx base_mem, enum rtx_code >> > code) >> > { >> > - switch (mode) >> > - { >> > - case E_DImode: >> > - return gen_load_pair_dw_didi (reg1, mem1, reg2, mem2); >> > - >> > - case E_DFmode: >> > - return gen_load_pair_dw_dfdf (reg1, mem1, reg2, mem2); >> > + rtx pair_mem = aarch64_pair_mem_from_base (base_mem); >> > >> > - case E_TFmode: >> > - return gen_load_pair_dw_tftf (reg1, mem1, reg2, mem2); >> > + const bool any_extend_p = (code == ZERO_EXTEND || code == SIGN_EXTEND); >> > + if (any_extend_p) >> > + { >> > + gcc_checking_assert (GET_MODE (base_mem) == SImode >> > + && GET_MODE (reg1) == DImode >> > + && GET_MODE (reg2) == DImode); >> > + } >> >> No need for the braces. >> >> > + else >> > + gcc_assert (code == UNKNOWN); >> > + >> > + rtx unspecs[2] = { >> > + gen_rtx_UNSPEC (any_extend_p ? SImode : GET_MODE (reg1), >> > + gen_rtvec (1, pair_mem), >> > + UNSPEC_LDP_FST), >> > + gen_rtx_UNSPEC (any_extend_p ? SImode : GET_MODE (reg2), >> > + gen_rtvec (1, copy_rtx (pair_mem)), >> > + UNSPEC_LDP_SND) >> > + }; >> > >> > - case E_V4SImode: >> > - return gen_load_pairv4siv4si (reg1, mem1, reg2, mem2); >> > + if (any_extend_p) >> > + for (int i = 0; i < 2; i++) >> > + unspecs[i] = gen_rtx_fmt_e (code, DImode, unspecs[i]); >> > >> > - default: >> > - gcc_unreachable (); >> > - } >> > + return gen_rtx_PARALLEL (VOIDmode, >> > + gen_rtvec (2, >> > + gen_rtx_SET (reg1, unspecs[0]), >> > + gen_rtx_SET (reg2, unspecs[1]))); >> > } >> > >> > /* Return TRUE if return address signing should be enabled for the current >> > [...] >> > @@ -453,6 +456,11 @@ (define_attr "predicated" "yes,no" (const_string >> > "no")) >> > ;; may chose to hold the tracking state encoded in SP. >> > (define_attr "speculation_barrier" "true,false" (const_string "false")) >> > >> > +;; Attribute use to identify load pair and store pair instructions. >> >> used >> >> > +;; Currently the attribute is only applied to the non-writeback ldp/stp >> > +;; patterns. >> > +(define_attr "ldpstp" "ldp,stp,none" (const_string "none")) >> > + >> > ;; ------------------------------------------------------------------- >> > ;; Pipeline descriptions and scheduling >> > ;; -------------------------------------------------------------------