Alex Coplan <alex.cop...@arm.com> writes:
> On 12/12/2023 15:58, Richard Sandiford wrote:
>> Alex Coplan <alex.cop...@arm.com> writes:
>> > Hi,
>> >
>> > This is a v2 version which addresses feedback from Richard's review
>> > here:
>> >
>> > https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637648.html
>> >
>> > I'll reply inline to address specific comments.
>> >
>> > Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?
>> >
>> > Thanks,
>> > Alex
>> >
>> > -- >8 --
>> >
>> > This patch overhauls the load/store pair patterns with two main goals:
>> >
>> > 1. Fixing a correctness issue (the current patterns are not RA-friendly).
>> > 2. Allowing more flexibility in which operand modes are supported, and 
>> > which
>> >    combinations of modes are allowed in the two arms of the load/store 
>> > pair,
>> >    while reducing the number of patterns required both in the source and in
>> >    the generated code.
>> >
>> > The correctness issue (1) is due to the fact that the current patterns have
>> > two independent memory operands tied together only by a predicate on the 
>> > insns.
>> > Since LRA only looks at the constraints, one of the memory operands can get
>> > reloaded without the other one being changed, leading to the insn becoming
>> > unrecognizable after reload.
>> >
>> > We fix this issue by changing the patterns such that they only ever have 
>> > one
>> > memory operand representing the entire pair.  For the store case, we use an
>> > unspec to logically concatenate the register operands before storing them.
>> > For the load case, we use unspecs to extract the "lanes" from the pair mem,
>> > with the second occurrence of the mem matched using a match_dup (such that 
>> > there
>> > is still really only one memory operand as far as the RA is concerned).
>> >
>> > In terms of the modes used for the pair memory operands, we canonicalize
>> > these to V2x4QImode, V2x8QImode, and V2x16QImode.  These modes have not
>> > only the correct size but also correct alignment requirement for a
>> > memory operand representing an entire load/store pair.  Unlike the other
>> > two, V2x4QImode didn't previously exist, so had to be added with the
>> > patch.
>> >
>> > As with the previous patch generalizing the writeback patterns, this
>> > patch aims to be flexible in the combinations of modes supported by the
>> > patterns without requiring a large number of generated patterns by using
>> > distinct mode iterators.
>> >
>> > The new scheme means we only need a single (generated) pattern for each
>> > load/store operation of a given operand size.  For the 4-byte and 8-byte
>> > operand cases, we use the GPI iterator to synthesize the two patterns.
>> > The 16-byte case is implemented as a separate pattern in the source (due
>> > to only having a single possible alternative).
>> >
>> > Since the UNSPEC patterns can't be interpreted by the dwarf2cfi code,
>> > we add REG_CFA_OFFSET notes to the store pair insns emitted by
>> > aarch64_save_callee_saves, so that correct CFI information can still be
>> > generated.  Furthermore, we now unconditionally generate these CFA
>> > notes on frame-related insns emitted by aarch64_save_callee_saves.
>> > This is done in case that the load/store pair pass forms these into
>> > pairs, in which case the CFA notes would be needed.
>> >
>> > We also adjust the ldp/stp peepholes to generate the new form.  This is
>> > done by switching the generation to use the
>> > aarch64_gen_{load,store}_pair interface, making it easier to change the
>> > form in the future if needed.  (Likewise, the upcoming aarch64
>> > load/store pair pass also makes use of this interface).
>> >
>> > This patch also adds an "ldpstp" attribute to the non-writeback
>> > load/store pair patterns, which is used by the post-RA load/store pair
>> > pass to identify existing patterns and see if they can be promoted to
>> > writeback variants.
>> >
>> > One potential concern with using unspecs for the patterns is that it can 
>> > block
>> > optimization by the generic RTL passes.  This patch series tries to 
>> > mitigate
>> > this in two ways:
>> >  1. The pre-RA load/store pair pass runs very late in the pre-RA pipeline.
>> >  2. A later patch in the series adjusts the aarch64 mem{cpy,set} expansion 
>> > to
>> >     emit individual loads/stores instead of ldp/stp.  These should then be
>> >     formed back into load/store pairs much later in the RTL pipeline by the
>> >     new load/store pair pass.
>> >
>> > gcc/ChangeLog:
>> >
>> >         * config/aarch64/aarch64-ldpstp.md: Abstract ldp/stp
>> >         representation from peepholes, allowing use of new form.
>> >         * config/aarch64/aarch64-modes.def (V2x4QImode): Define.
>> >         * config/aarch64/aarch64-protos.h
>> >         (aarch64_finish_ldpstp_peephole): Declare.
>> >         (aarch64_swap_ldrstr_operands): Delete declaration.
>> >         (aarch64_gen_load_pair): Adjust parameters.
>> >         (aarch64_gen_store_pair): Likewise.
>> >         * config/aarch64/aarch64-simd.md 
>> > (load_pair<DREG:mode><DREG2:mode>):
>> >         Delete.
>> >         (vec_store_pair<DREG:mode><DREG2:mode>): Delete.
>> >         (load_pair<VQ:mode><VQ2:mode>): Delete.
>> >         (vec_store_pair<VQ:mode><VQ2:mode>): Delete.
>> >         * config/aarch64/aarch64.cc (aarch64_pair_mode_for_mode): New.
>> >         (aarch64_gen_store_pair): Adjust to use new unspec form of stp.
>> >         Drop second mem from parameters.
>> >         (aarch64_gen_load_pair): Likewise.
>> >         (aarch64_pair_mem_from_base): New.
>> >         (aarch64_save_callee_saves): Emit REG_CFA_OFFSET notes for
>> >         frame-related saves.  Adjust call to aarch64_gen_store_pair
>> >         (aarch64_restore_callee_saves): Adjust calls to
>> >         aarch64_gen_load_pair to account for change in interface.
>> >         (aarch64_process_components): Likewise.
>> >         (aarch64_classify_address): Handle 32-byte pair mems in
>> >         LDP_STP_N case.
>> >         (aarch64_print_operand): Likewise.
>> >         (aarch64_copy_one_block_and_progress_pointers): Adjust calls to
>> >         account for change in aarch64_gen_{load,store}_pair interface.
>> >         (aarch64_set_one_block_and_progress_pointer): Likewise.
>> >         (aarch64_finish_ldpstp_peephole): New.
>> >         (aarch64_gen_adjusted_ldpstp): Adjust to use generation helper.
>> >         * config/aarch64/aarch64.md (ldpstp): New attribute.
>> >         (load_pair_sw_<SX:mode><SX2:mode>): Delete.
>> >         (load_pair_dw_<DX:mode><DX2:mode>): Delete.
>> >         (load_pair_dw_<TX:mode><TX2:mode>): Delete.
>> >         (*load_pair_<ldst_sz>): New.
>> >         (*load_pair_16): New.
>> >         (store_pair_sw_<SX:mode><SX2:mode>): Delete.
>> >         (store_pair_dw_<DX:mode><DX2:mode>): Delete.
>> >         (store_pair_dw_<TX:mode><TX2:mode>): Delete.
>> >         (*store_pair_<ldst_sz>): New.
>> >         (*store_pair_16): New.
>> >         (*load_pair_extendsidi2_aarch64): Adjust to use new form.
>> >         (*zero_extendsidi2_aarch64): Likewise.
>> >         * config/aarch64/iterators.md (VPAIR): New.
>> >         * config/aarch64/predicates.md (aarch64_mem_pair_operand): Change 
>> > to
>> >         a special predicate derived from aarch64_mem_pair_operator.
>> 
>> Thanks, looks great.  OK for trunk with a couple of trivial changes:
>
> Thanks a lot for the review, but I posted a v3 which has further changes
> needed when rebasing on top of SME (as e.g. the SME code directly forms
> the PARALLEL pair representation and invokes the old named patterns):
> https://gcc.gnu.org/pipermail/gcc-patches/2023-December/639767.html

Yeah, sorry, only saw that there was a v3 after sending the review.

> is the v3 OK with those changes?

Yes, thanks.  Makes for a nice simplification in aarch64_init_tpidr2_block.

Richard

> Thanks,
> Alex
>
>> 
>> > [...]
>> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>> > index 8faaa748a05..c2e4f531254 100644
>> > --- a/gcc/config/aarch64/aarch64.cc
>> > +++ b/gcc/config/aarch64/aarch64.cc
>> > @@ -6680,59 +6680,87 @@ aarch64_pop_regs (unsigned regno1, unsigned 
>> > regno2, HOST_WIDE_INT adjustment,
>> >      }
>> >  }
>> >  
>> > -/* Generate and return a store pair instruction of mode MODE to store
>> > -   register REG1 to MEM1 and register REG2 to MEM2.  */
>> > +/* Given an ldp/stp register operand mode MODE, return a suitable mode to 
>> > use
>> > +   for a mem rtx representing the entire pair.  */
>> >  
>> > -static rtx
>> > -aarch64_gen_store_pair (machine_mode mode, rtx mem1, rtx reg1, rtx mem2,
>> > -                  rtx reg2)
>> > -{
>> > -  switch (mode)
>> > -    {
>> > -    case E_DImode:
>> > -      return gen_store_pair_dw_didi (mem1, reg1, mem2, reg2);
>> > +static machine_mode
>> > +aarch64_pair_mode_for_mode (machine_mode mode)
>> > +{
>> > +  if (known_eq (GET_MODE_SIZE (mode), 4))
>> > +    return V2x4QImode;
>> > +  else if (known_eq (GET_MODE_SIZE (mode), 8))
>> > +    return V2x8QImode;
>> > +  else if (known_eq (GET_MODE_SIZE (mode), 16))
>> > +    return V2x16QImode;
>> > +  else
>> > +    gcc_unreachable ();
>> > +}
>> >  
>> > -    case E_DFmode:
>> > -      return gen_store_pair_dw_dfdf (mem1, reg1, mem2, reg2);
>> > +/* Given a base mem MEM with a mode suitable for an ldp/stp register 
>> > operand,
>> > +   return an rtx like MEM which instead represents the entire pair.  */
>> 
>> Might be worth saying "mode and address suitable...", given the assert
>> for aarch64_mem_pair_lanes_operand.
>> 
>> >  
>> > -    case E_TFmode:
>> > -      return gen_store_pair_dw_tftf (mem1, reg1, mem2, reg2);
>> > +static rtx
>> > +aarch64_pair_mem_from_base (rtx mem)
>> > +{
>> > +  auto pair_mode = aarch64_pair_mode_for_mode (GET_MODE (mem));
>> > +  mem = adjust_bitfield_address_nv (mem, pair_mode, 0);
>> > +  gcc_assert (aarch64_mem_pair_lanes_operand (mem, pair_mode));
>> > +  return mem;
>> > +}
>> >  
>> > -    case E_V4SImode:
>> > -      return gen_vec_store_pairv4siv4si (mem1, reg1, mem2, reg2);
>> > +/* Generate and return a store pair instruction to store REG1 and REG2
>> > +   into memory starting at BASE_MEM.  All three rtxes should have modes 
>> > of the
>> > +   same size.  */
>> >  
>> > -    case E_V16QImode:
>> > -      return gen_vec_store_pairv16qiv16qi (mem1, reg1, mem2, reg2);
>> > +rtx
>> > +aarch64_gen_store_pair (rtx base_mem, rtx reg1, rtx reg2)
>> > +{
>> > +  rtx pair_mem = aarch64_pair_mem_from_base (base_mem);
>> >  
>> > -    default:
>> > -      gcc_unreachable ();
>> > -    }
>> > +  return gen_rtx_SET (pair_mem,
>> > +                gen_rtx_UNSPEC (GET_MODE (pair_mem),
>> > +                                gen_rtvec (2, reg1, reg2),
>> > +                                UNSPEC_STP));
>> >  }
>> >  
>> > -/* Generate and regurn a load pair isntruction of mode MODE to load 
>> > register
>> > -   REG1 from MEM1 and register REG2 from MEM2.  */
>> > +/* Generate and return a load pair instruction to load a pair of
>> > +   registers starting at BASE_MEM into REG1 and REG2.  If CODE is
>> > +   UNKNOWN, all three rtxes should have modes of the same size.
>> > +   Otherwise, CODE is {SIGN,ZERO}_EXTEND, base_mem should be in SImode,
>> > +   and REG{1,2} should be in DImode.  */
>> >  
>> > -static rtx
>> > -aarch64_gen_load_pair (machine_mode mode, rtx reg1, rtx mem1, rtx reg2,
>> > -                 rtx mem2)
>> > +rtx
>> > +aarch64_gen_load_pair (rtx reg1, rtx reg2, rtx base_mem, enum rtx_code 
>> > code)
>> >  {
>> > -  switch (mode)
>> > -    {
>> > -    case E_DImode:
>> > -      return gen_load_pair_dw_didi (reg1, mem1, reg2, mem2);
>> > -
>> > -    case E_DFmode:
>> > -      return gen_load_pair_dw_dfdf (reg1, mem1, reg2, mem2);
>> > +  rtx pair_mem = aarch64_pair_mem_from_base (base_mem);
>> >  
>> > -    case E_TFmode:
>> > -      return gen_load_pair_dw_tftf (reg1, mem1, reg2, mem2);
>> > +  const bool any_extend_p = (code == ZERO_EXTEND || code == SIGN_EXTEND);
>> > +  if (any_extend_p)
>> > +    {
>> > +      gcc_checking_assert (GET_MODE (base_mem) == SImode
>> > +                     && GET_MODE (reg1) == DImode
>> > +                     && GET_MODE (reg2) == DImode);
>> > +    }
>> 
>> No need for the braces.
>> 
>> > +  else
>> > +    gcc_assert (code == UNKNOWN);
>> > +
>> > +  rtx unspecs[2] = {
>> > +    gen_rtx_UNSPEC (any_extend_p ? SImode : GET_MODE (reg1),
>> > +              gen_rtvec (1, pair_mem),
>> > +              UNSPEC_LDP_FST),
>> > +    gen_rtx_UNSPEC (any_extend_p ? SImode : GET_MODE (reg2),
>> > +              gen_rtvec (1, copy_rtx (pair_mem)),
>> > +              UNSPEC_LDP_SND)
>> > +  };
>> >  
>> > -    case E_V4SImode:
>> > -      return gen_load_pairv4siv4si (reg1, mem1, reg2, mem2);
>> > +  if (any_extend_p)
>> > +    for (int i = 0; i < 2; i++)
>> > +      unspecs[i] = gen_rtx_fmt_e (code, DImode, unspecs[i]);
>> >  
>> > -    default:
>> > -      gcc_unreachable ();
>> > -    }
>> > +  return gen_rtx_PARALLEL (VOIDmode,
>> > +                     gen_rtvec (2,
>> > +                                gen_rtx_SET (reg1, unspecs[0]),
>> > +                                gen_rtx_SET (reg2, unspecs[1])));
>> >  }
>> >  
>> >  /* Return TRUE if return address signing should be enabled for the current
>> > [...]
>> > @@ -453,6 +456,11 @@ (define_attr "predicated" "yes,no" (const_string 
>> > "no"))
>> >  ;; may chose to hold the tracking state encoded in SP.
>> >  (define_attr "speculation_barrier" "true,false" (const_string "false"))
>> >  
>> > +;; Attribute use to identify load pair and store pair instructions.
>> 
>> used
>> 
>> > +;; Currently the attribute is only applied to the non-writeback ldp/stp
>> > +;; patterns.
>> > +(define_attr "ldpstp" "ldp,stp,none" (const_string "none"))
>> > +
>> >  ;; -------------------------------------------------------------------
>> >  ;; Pipeline descriptions and scheduling
>> >  ;; -------------------------------------------------------------------

Reply via email to