Re: [PATCH V4 1/3] aarch64: Place target independent and dependent changed code in one file

Ajit Agarwal Tue, 09 Apr 2024 05:00:47 -0700

On 05/04/24 10:03 pm, Alex Coplan wrote:
> On 05/04/2024 13:53, Ajit Agarwal wrote:
>> Hello Alex/Richard:
>>
>> All review comments are incorporated.
> 
> Thanks, I was kind-of expecting you to also send the renaming patch as a
> preparatory patch as we discussed.
> 
> Sorry for another meta comment, but: I think the reason that the Linaro
> CI isn't running tests on your patches is actually because you're
> sending 1/3 of a series but not sending the rest of the series.
> 
> So please can you either send this as an individual preparatory patch
> (not marked as a series) or if you're going to send a series (e.g. with
> a preparatory rename patch as 1/2 and this as 2/2) then send the entire
> series when you make updates.  That way the CI should test your patches,
> which would be helpful.
>

Addressed.
 
>>
>> Common infrastructure of load store pair fusion is divided into target
>> independent and target dependent changed code.
>>
>> Target independent code is the Generic code with pure virtual function
>> to interface betwwen target independent and dependent code.
>>
>> Target dependent code is the implementation of pure virtual function for
>> aarch64 target and the call to target independent code.
>>
>> Thanks & Regards
>> Ajit
>>
>>
>> aarch64: Place target independent and dependent changed code in one file
>>
>> Common infrastructure of load store pair fusion is divided into target
>> independent and target dependent changed code.
>>
>> Target independent code is the Generic code with pure virtual function
>> to interface betwwen target independent and dependent code.
>>
>> Target dependent code is the implementation of pure virtual function for
>> aarch64 target and the call to target independent code.
>>
>> 2024-04-06  Ajit Kumar Agarwal  <aagar...@linux.ibm.com>
>>
>> gcc/ChangeLog:
>>
>>      * config/aarch64/aarch64-ldp-fusion.cc: Place target
>>      independent and dependent changed code.
> 
> You're going to need a proper ChangeLog eventually, but I guess there's
> no need for that right now.
> 
>> ---
>>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 371 +++++++++++++++--------
>>  1 file changed, 249 insertions(+), 122 deletions(-)
>>
>> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
>> b/gcc/config/aarch64/aarch64-ldp-fusion.cc
>> index 22ed95eb743..cb21b514ef7 100644
>> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
>> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
>> @@ -138,8 +138,122 @@ struct alt_base
>>    poly_int64 offset;
>>  };
>>  
>> +// Virtual base class for load/store walkers used in alias analysis.
>> +struct alias_walker
>> +{
>> +  virtual bool conflict_p (int &budget) const = 0;
>> +  virtual insn_info *insn () const = 0;
>> +  virtual bool valid () const  = 0;
> 
> Heh, looking at this made me realise there is a whitespace bug here in
> the existing code (double space after const).  Sorry about that!  I'll
> push an obvious fix for that.
> 
>> +  virtual void advance () = 0;
>> +};
>> +
>> +struct pair_fusion {
>> +
>> +  pair_fusion () {};
> 
> This ctor looks pointless at the moment.  Perhaps instead we could put
> the contents of ldp_fusion_init in here and then delete that function?
> 

Addressed.

>> +  virtual bool fpsimd_op_p (rtx reg_op, machine_mode mem_mode,
>> +                           bool load_p) = 0;
> 
> Please can we have comments above each of these virtual functions
> describing any parameters, what the purpose of the hook is, and the
> interpretation of the return value?  This will serve as the
> documentation for other targets that want to make use of the pass.
> 
> It might make sense to have a default-false implementation for
> fpsimd_op_p, especially if you don't want to make use of this bit for
> rs6000.
>

Addressed.
 
>> +
>> +  virtual bool pair_operand_mode_ok_p (machine_mode mode) = 0;
>> +  virtual bool pair_trailing_writeback_p () = 0;
> 
> Sorry for the run-around, but: I think this and
> handle_writeback_opportunities () should be the same function, either
> returning an enum or taking an enum and returning a boolean.
> 
> At a minimum they should have more similar sounding names.
> 

Addressed.

>> +  virtual bool pair_reg_operand_ok_p (bool load_p, rtx reg_op,
>> +                                  machine_mode mem_mode) = 0;
>> +  virtual int pair_mem_alias_check_limit () = 0;
>> +  virtual bool handle_writeback_opportunities () = 0 ;
>> +  virtual bool pair_mem_ok_with_policy (rtx first_mem, bool load_p,
>> +                                    machine_mode mode) = 0;
>> +  virtual rtx gen_mem_pair (rtx *pats,  rtx writeback,
> 
> Nit: excess whitespace after pats,
> 
>> +                        bool load_p) = 0;
>> +  virtual bool pair_mem_promote_writeback_p (rtx pat) = 0;
>> +  virtual bool track_load_p () = 0;
>> +  virtual bool track_store_p () = 0;
> 
> I think it would probably make more sense for these two to have
> default-true implementations rather than being pure virtual functions.
> 
> Also, sorry for the bikeshedding, but please can we keep the plural
> names (so track_loads_p and track_stores_p).
> 
>> +  virtual bool cand_insns_empty_p (std::list<insn_info *> &insns) = 0;
> 
> Why does this need to be virtualised?  I would it expect it to
> just be insns.empty () on all targets.
> 
>> +  virtual bool pair_mem_in_range_p (HOST_WIDE_INT off) = 0;
>> +  void ldp_fusion_bb (bb_info *bb);
> 
> Just to check, are you planning to rename this function in a separate
> patch?
> 

Yes.
>> +
>> +  ~pair_fusion() { }
> 
> As with the ctor, perhaps we could put the contents of ldp_fusion_destroy
> in here and then delete that function?
> 

Addressed.
>> +};
>> +
>> +struct aarch64_pair_fusion : public pair_fusion
>> +{
>> +public:
> 

Addressed.
> Nit: there shouldn't be a need for this as public is already the default
> for structs.
> 
>> +  aarch64_pair_fusion () : pair_fusion () {};
> 
> This ctor looks pointless, the default one should do, so can we remove
> this?
> 

Addressed.
>> +  bool fpsimd_op_p (rtx reg_op, machine_mode mem_mode,
>> +                bool load_p) override final
>> +  {
>> +    const bool fpsimd_op_p
>> +      = reload_completed
>> +      ? (REG_P (reg_op) && FP_REGNUM_P (REGNO (reg_op)))
>> +      : (GET_MODE_CLASS (mem_mode) != MODE_INT
>> +     && (load_p || !aarch64_const_zero_rtx_p (reg_op)));
>> +    return fpsimd_op_p;
> 
> Can this just directly return the boolean expression instead of storing
> it into a variable first?
> 
Addressed.
>> +  }
>> +
>> +  bool pair_mem_promote_writeback_p (rtx pat)
>> +  {
>> +     if (reload_completed
>> +     && aarch64_ldp_writeback > 1
>> +     && GET_CODE (pat) == PARALLEL
>> +     && XVECLEN (pat, 0) == 2)
>> +    return true;
>> +
>> +     return false;
>> +  }
>> +
>> +  bool pair_mem_ok_with_policy (rtx first_mem, bool load_p,
>> +                            machine_mode mode)
>> +  {
>> +    return aarch64_mem_ok_with_ldpstp_policy_model (first_mem,
>> +                                                 load_p,
>> +                                                 mode);
>> +  }
>> +  bool pair_operand_mode_ok_p (machine_mode mode);
> 
> Why have you implemented this (one-line function) out-of-line but the
> others are implemented inline?  Please be consistent.
> 

Addressed.
>> +
>> +  rtx gen_mem_pair (rtx *pats, rtx writeback, bool load_p);
>> +
>> +  bool pair_trailing_writeback_p  ()
>> +  {
>> +    return aarch64_ldp_writeback > 1;
>> +  }
>> +  bool pair_reg_operand_ok_p (bool load_p, rtx reg_op,
>> +                          machine_mode mem_mode)
>> +  {
>> +    return (load_p
>> +         ? aarch64_ldp_reg_operand (reg_op, mem_mode)
>> +         : aarch64_stp_reg_operand (reg_op, mem_mode));
>> +  }
>> +  int pair_mem_alias_check_limit ()
>> +  {
>> +    return aarch64_ldp_alias_check_limit;
>> +  }
>> +  bool handle_writeback_opportunities ()
>> +  {
>> +    return aarch64_ldp_writeback;
>> +  }
>> +  bool track_load_p ()
>> +  {
>> +    const bool track_loads
>> +      = aarch64_tune_params.ldp_policy_model != 
>> AARCH64_LDP_STP_POLICY_NEVER;
>> +    return track_loads;
>> +  }
> 

Addressed.
> As above, can this just return the boolean expression directly without
> the temporary?
> 
>> +  bool track_store_p ()
>> +  {
>> +    const bool track_stores
>> +      = aarch64_tune_params.stp_policy_model != 
>> AARCH64_LDP_STP_POLICY_NEVER;
>> +    return track_stores;
>> +  }
> 

Addressed.
> Likewise.
> 
>> +  bool cand_insns_empty_p (std::list<insn_info *> &insns)
>> +  {
>> +    return insns.empty();
>> +  }
> 
> As mentioned on the declaration, I don't see why this needs to be
> virtualised.
> 
>> +  bool pair_mem_in_range_p (HOST_WIDE_INT off)
>> +  {
>> +    return (off < LDP_MIN_IMM || off > LDP_MAX_IMM);
>> +  }
> 
> Hmm, this looks to test exactly the opposite condition from what the
> name suggests, i.e. the implementation looks like what I would expect
> for a function called pair_mem_out_of_range_p.
> 
> Please can you invert the condition and fix up callers appropriately?
> 
> This suggests you're not really thinking about the code you're writing,
> please try to think a bit more carefully before posting patches.
> 

Sorry for my mistake as I have overlooked it.
>> +};
>> +
>> +
>>  // State used by the pass for a given basic block.
>> -struct ldp_bb_info
>> +struct pair_fusion_bb_info
>>  {
>>    using def_hash = nofree_ptr_hash<def_info>;
>>    using expr_key_t = pair_hash<tree_operand_hash, int_hash<int, -1, -2>>;
>> @@ -160,14 +274,17 @@ struct ldp_bb_info
>>  
>>    static const size_t obstack_alignment = sizeof (void *);
>>    bb_info *m_bb;
>> +  pair_fusion *bb_state;
> 

Addressed.

> Can we call this m_pass instead of bb_state?  The pair_fusion class is
> explicitly at the pass level rather than the bb level.
> 
> Also I think this (and indeed m_bb, although that's pre-existing) could
> be private members.  Not sure there's a good reason for making them
> public.
>

Addressed.
 
> It might be slightly nicer to use a reference for m_pass unless there's
> a good reason to use a pointer, although I don't feel strongly about
> this.
> 
>>  
>> -  ldp_bb_info (bb_info *bb) : m_bb (bb), m_emitted_tombstone (false)
>> +  pair_fusion_bb_info (bb_info *bb,
>> +                  aarch64_pair_fusion *d) : m_bb (bb),
> 
> This generic class needs to take the generic pass type, i.e. just a
> pair_fusion * rather than an aarch64_pair_fusion *.
> 
>> +                  bb_state (d), m_emitted_tombstone (false)
>>    {
>>      obstack_specify_allocation (&m_obstack, OBSTACK_CHUNK_SIZE,
>>                              obstack_alignment, obstack_chunk_alloc,
>>                              obstack_chunk_free);
>>    }
>> -  ~ldp_bb_info ()
>> +  ~pair_fusion_bb_info ()
>>    {
>>      obstack_free (&m_obstack, nullptr);
>>  
>> @@ -177,10 +294,32 @@ struct ldp_bb_info
>>      bitmap_obstack_release (&m_bitmap_obstack);
>>        }
>>    }
>> +  void track_access (insn_info *, bool load, rtx mem);
>> +  void transform ();
>> +  void cleanup_tombstones ();
> 
> I think most (or indeed all?) of the changes here shouldn't be needed.
> AFAICS, there shouldn't be a need to:
>  - drop inline from these functions.
>  - change the accessibility of any members.
> 

Addressed.
> Plese can you drop these changes or explain why they're needed (from
> here until the end of the class).
> 
>> +  void merge_pairs (insn_list_t &, insn_list_t &,
>> +                bool load_p, unsigned access_size);
>> +  void transform_for_base (int load_size, access_group &group);
>> +
>> +  bool try_fuse_pair (bool load_p, unsigned access_size,
>> +                         insn_info *i1, insn_info *i2);
>> +
>> +  bool fuse_pair (bool load_p, unsigned access_size,
>> +              int writeback,
>> +              insn_info *i1, insn_info *i2,
>> +              base_cand &base,
>> +              const insn_range_info &move_range);
>>  
>> -  inline void track_access (insn_info *, bool load, rtx mem);
>> -  inline void transform ();
>> -  inline void cleanup_tombstones ();
>> +  void do_alias_analysis (insn_info *alias_hazards[4],
>> +                      alias_walker *walkers[4],
>> +                      bool load_p);
>> +
>> +  void track_tombstone (int uid);
>> +
>> +  bool track_via_mem_expr (insn_info *, rtx mem, lfs_fields lfs);
>> +
>> +  template<typename Map>
>> +    void traverse_base_map (Map &map);
>>  
>>  private:
>>    obstack m_obstack;
>> @@ -191,30 +330,32 @@ private:
>>    bool m_emitted_tombstone;
>>  
>>    inline splay_tree_node<access_record *> *node_alloc (access_record *);
>> +};
>>  
>> -  template<typename Map>
>> -  inline void traverse_base_map (Map &map);
>> -  inline void transform_for_base (int load_size, access_group &group);
>> -
>> -  inline void merge_pairs (insn_list_t &, insn_list_t &,
>> -                       bool load_p, unsigned access_size);
>> -
>> -  inline bool try_fuse_pair (bool load_p, unsigned access_size,
>> -                         insn_info *i1, insn_info *i2);
>> -
>> -  inline bool fuse_pair (bool load_p, unsigned access_size,
>> -                     int writeback,
>> -                     insn_info *i1, insn_info *i2,
>> -                     base_cand &base,
>> -                     const insn_range_info &move_range);
>> -
>> -  inline void track_tombstone (int uid);
>> +rtx aarch64_pair_fusion::gen_mem_pair (rtx *pats,
> 
> Style nit: newline after rtx.
> 
Addressed.

>> +              rtx writeback,
>> +              bool load_p)
>> +  {
>> +    rtx pair_pat;
>>  
>> -  inline bool track_via_mem_expr (insn_info *, rtx mem, lfs_fields lfs);
>> -};
>> +    if (writeback)
>> +      {
>> +    auto patvec = gen_rtvec (3, writeback, pats[0], pats[1]);
>> +    pair_pat = gen_rtx_PARALLEL (VOIDmode, patvec);
>> +      }
>> +    else if (load_p)
>> +      pair_pat = aarch64_gen_load_pair (XEXP (pats[0], 0),
>> +                                    XEXP (pats[1], 0),
>> +                                    XEXP (pats[0], 1));
>> +    else
>> +      pair_pat = aarch64_gen_store_pair (XEXP (pats[0], 0),
>> +                                     XEXP (pats[0], 1),
>> +                                     XEXP (pats[1], 1));
>> +     return pair_pat;
> 
> Can the assignments just be direct returns here, please?  Then
> get rid of the temporary.
>

Addressed.
 
>> +  }
>>  
>>  splay_tree_node<access_record *> *
>> -ldp_bb_info::node_alloc (access_record *access)
>> +pair_fusion_bb_info::node_alloc (access_record *access)
>>  {
>>    using T = splay_tree_node<access_record *>;
>>    void *addr = obstack_alloc (&m_obstack, sizeof (T));
>> @@ -262,7 +403,7 @@ drop_writeback (rtx mem)
>>  // RTX_AUTOINC addresses.  The interface is like strip_offset except we 
>> take a
>>  // MEM so that we know the mode of the access.
>>  static rtx
>> -ldp_strip_offset (rtx mem, poly_int64 *offset)
>> +pair_mem_strip_offset (rtx mem, poly_int64 *offset)
> 
> I thought we discussed splitting the renaming into a separate patch?
> 
>>  {
>>    rtx addr = XEXP (mem, 0);
>>  
>> @@ -332,6 +473,12 @@ ldp_operand_mode_ok_p (machine_mode mode)
>>    return reload_completed || mode != TImode;
>>  }
>>  
>> +bool
>> +aarch64_pair_fusion::pair_operand_mode_ok_p (machine_mode mode)
>> +{
>> +  return ldp_operand_mode_ok_p (mode);
>> +}
>> +
>>  // Given LFS (load_p, fpsimd_p, size) fields in FIELDS, encode these
>>  // into an integer for use as a hash table key.
>>  static int
>> @@ -396,7 +543,7 @@ access_group::track (Alloc alloc_node, poly_int64 
>> offset, insn_info *insn)
>>  // MEM_EXPR base (i.e. a tree decl) relative to which we can track the 
>> access.
>>  // LFS is used as part of the key to the hash table, see track_access.
>>  bool
>> -ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
>> +pair_fusion_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, 
>> lfs_fields lfs)
> 
> Again, please do any renaming in a separate patch.
> 

Addressed.
>>  {
>>    if (!MEM_EXPR (mem) || !MEM_OFFSET_KNOWN_P (mem))
>>      return false;
>> @@ -412,9 +559,10 @@ ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx 
>> mem, lfs_fields lfs)
>>    const machine_mode mem_mode = GET_MODE (mem);
>>    const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode).to_constant ();
>>  
>> -  // Punt on misaligned offsets.  LDP/STP instructions require offsets to 
>> be a
>> -  // multiple of the access size, and we believe that misaligned offsets on
>> -  // MEM_EXPR bases are likely to lead to misaligned offsets w.r.t. RTL 
>> bases.
>> +  // Punt on misaligned offsets.  Paired memory access instructions require
>> +  // offsets to be a multiple of the access size, and we believe that
>> +  // misaligned offsets on MEM_EXPR bases are likely to lead to misaligned
>> +  // offsets w.r.t. RTL bases.
>>    if (!multiple_p (offset, mem_size))
>>      return false;
>>  
>> @@ -438,46 +586,38 @@ ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx 
>> mem, lfs_fields lfs)
>>  }
>>  
>>  // Main function to begin pair discovery.  Given a memory access INSN,
>> -// determine whether it could be a candidate for fusing into an ldp/stp,
>> +// determine whether it could be a candidate for fusing into an pair mem,
> 
> s/an pair mem/a paired access/
> 
>>  // and if so, track it in the appropriate data structure for this basic
>>  // block.  LOAD_P is true if the access is a load, and MEM is the mem
>>  // rtx that occurs in INSN.
>>  void
>> -ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
>> +pair_fusion_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
>>  {
>>    // We can't combine volatile MEMs, so punt on these.
>>    if (MEM_VOLATILE_P (mem))
>>      return;
>>  
>> -  // Ignore writeback accesses if the param says to do so.
>> -  if (!aarch64_ldp_writeback
>> +  // Ignore writeback accesses if the param says to do so
> 

Addressed.
> This comment needs updating, I guess something like "if the hook says to
> do so".  But also please don't delete the full stop.
> 
>> +  if (!bb_state->handle_writeback_opportunities ()
>>        && GET_RTX_CLASS (GET_CODE (XEXP (mem, 0))) == RTX_AUTOINC)
>>      return;
>>  
>>    const machine_mode mem_mode = GET_MODE (mem);
>> -  if (!ldp_operand_mode_ok_p (mem_mode))
>> +
> 
> I don't think we need the extra whitespace here.
> 
>> +  if (!bb_state->pair_operand_mode_ok_p (mem_mode))
>>      return;
>>  
>>    rtx reg_op = XEXP (PATTERN (insn->rtl ()), !load_p);
>>  
>> -  // Ignore the access if the register operand isn't suitable for ldp/stp.
>> -  if (load_p
>> -      ? !aarch64_ldp_reg_operand (reg_op, mem_mode)
>> -      : !aarch64_stp_reg_operand (reg_op, mem_mode))
>> +  if (!bb_state->pair_reg_operand_ok_p (load_p, reg_op, mem_mode))
>>      return;
>> -
>>    // We want to segregate FP/SIMD accesses from GPR accesses.
>>    //
>>    // Before RA, we use the modes, noting that stores of constant zero
>>    // operands use GPRs (even in non-integer modes).  After RA, we use
>>    // the hard register numbers.
>> -  const bool fpsimd_op_p
>> -    = reload_completed
>> -    ? (REG_P (reg_op) && FP_REGNUM_P (REGNO (reg_op)))
>> -    : (GET_MODE_CLASS (mem_mode) != MODE_INT
>> -       && (load_p || !aarch64_const_zero_rtx_p (reg_op)));
>> -
>> -  // Note ldp_operand_mode_ok_p already rejected VL modes.
>> +  const bool fpsimd_op_p = bb_state->fpsimd_op_p (reg_op, mem_mode, load_p);
>> +  // Note pair_operand_mode_ok_p already rejected VL modes.
>>    const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode).to_constant ();
>>    const lfs_fields lfs = { load_p, fpsimd_op_p, mem_size };
>>  
>> @@ -487,7 +627,7 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, 
>> rtx mem)
>>    poly_int64 mem_off;
>>    rtx addr = XEXP (mem, 0);
>>    const bool autoinc_p = GET_RTX_CLASS (GET_CODE (addr)) == RTX_AUTOINC;
>> -  rtx base = ldp_strip_offset (mem, &mem_off);
>> +  rtx base = pair_mem_strip_offset (mem, &mem_off);
> 
> Again, let's split the renaming off into a separate patch.
> 

Addressed.
>>    if (!REG_P (base))
>>      return;
>>  
>> @@ -506,8 +646,8 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, 
>> rtx mem)
>>    // elimination offset pre-RA, we should postpone forming pairs on such
>>    // accesses until after RA.
>>    //
>> -  // As it stands, addresses with offsets in range for LDR but not
>> -  // in range for LDP/STP are currently reloaded inefficiently,
>> +  // As it stands, addresses in range for an individual load/store but not
>> +  // for a paired access are currently reloaded inefficiently,
>>    // ending up with a separate base register for each pair.
>>    //
>>    // In theory LRA should make use of
>> @@ -519,8 +659,8 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, 
>> rtx mem)
>>    // that calls targetm.legitimize_address_displacement.
>>    //
>>    // So for now, it's better to punt when we can't be sure that the
>> -  // offset is in range for LDP/STP.  Out-of-range cases can then be
>> -  // handled after RA by the out-of-range LDP/STP peepholes.  Eventually, it
>> +  // offset is in range for paired access.  Out-of-range cases can then be
>> +  // handled after RA by the out-of-range PAIR MEM  peepholes.  Eventually, 
>> it
>>    // would be nice to handle known out-of-range opportunities in the
>>    // pass itself (for stack accesses, this would be in the post-RA pass).
>>    if (!reload_completed
>> @@ -573,8 +713,8 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, 
>> rtx mem)
>>      gcc_unreachable (); // Base defs should be unique.
>>      }
>>  
>> -  // Punt on misaligned offsets.  LDP/STP require offsets to be a multiple 
>> of
>> -  // the access size.
>> +  // Punt on misaligned offsets.  Paired memory accesses require offsets
>> +  // to be a multiple of the access size.
>>    if (!multiple_p (mem_off, mem_size))
>>      return;
>>  
>> @@ -614,7 +754,7 @@ static bool no_ignore (insn_info *) { return false; }
>>  // making use of alias disambiguation.
>>  static insn_info *
>>  latest_hazard_before (insn_info *insn, rtx *ignore,
>> -                  insn_info *ignore_insn = nullptr)
>> +                  insn_info *ignore_insn = 0)
> 
> Is this change needed?
> 

Addressed.
>>  {
>>    insn_info *result = nullptr;
>>  
>> @@ -1150,7 +1290,7 @@ extract_writebacks (bool load_p, rtx pats[2], int 
>> changed)
>>        const bool autoinc_p = GET_RTX_CLASS (GET_CODE (addr)) == RTX_AUTOINC;
>>  
>>        poly_int64 offset;
>> -      rtx this_base = ldp_strip_offset (mem, &offset);
>> +      rtx this_base = pair_mem_strip_offset (mem, &offset);
>>        gcc_assert (REG_P (this_base));
>>        if (base_reg)
>>      gcc_assert (rtx_equal_p (base_reg, this_base));
>> @@ -1286,7 +1426,11 @@ find_trailing_add (insn_info *insns[2],
>>  
>>    off_hwi /= access_size;
>>  
>> -  if (off_hwi < LDP_MIN_IMM || off_hwi > LDP_MAX_IMM)
>> +  pair_fusion *pfuse;
>> +  aarch64_pair_fusion derived;
>> +  pfuse = &derived;
> 
> Huh? We'll want to use the same instance of the pass structure (pair_fusion)
> that gets created at the top level, so that should get passed around 
> everywhere.
> 
> In the case of find_trailing_add we probably don't want to add any more
> parameters (as Segher noted there are already too many), so perhaps the
> easiest thing would be to make find_trailing_add itself a (non-virtual)
> member function of pair_fusion.
> 
> Then here we can just query pair_mem_in_range_p directly.
> 
> For try_promote_writeback itself you can either make it a non-virtual
> member function of pair_fusion or just pass a reference to the pass
> object in when you call it.
> 
> Note also that you don't have to create a pointer of the parent type in
> order to call a virtual function on a derived object, you can just call
> it directly.
> 

Addressed.
>> +
>> +  if (pfuse->pair_mem_in_range_p (off_hwi))
>>      return nullptr;
> 
> Note of course this condition needs inverting when you invert the
> implementation.
> 

Addressed.
>>  
>>    auto dump_prefix = [&]()
>> @@ -1328,7 +1472,7 @@ find_trailing_add (insn_info *insns[2],
>>  // We just emitted a tombstone with uid UID, track it in a bitmap for
>>  // this BB so we can easily identify it later when cleaning up tombstones.
>>  void
>> -ldp_bb_info::track_tombstone (int uid)
>> +pair_fusion_bb_info::track_tombstone (int uid)
> 
> Again, please split the renaming off.  Same with other instances below.
> 
>>  {
>>    if (!m_emitted_tombstone)
>>      {
>> @@ -1528,7 +1672,7 @@ fixup_debug_uses (obstack_watermark &attempt,
>>        gcc_checking_assert (GET_RTX_CLASS (GET_CODE (XEXP (mem, 0)))
>>                             == RTX_AUTOINC);
>>  
>> -      base = ldp_strip_offset (mem, &offset);
>> +      base = pair_mem_strip_offset (mem, &offset);
>>        gcc_checking_assert (REG_P (base) && REGNO (base) == base_regno);
>>      }
>>        fixup_debug_use (attempt, use, def, base, offset);
>> @@ -1664,7 +1808,7 @@ fixup_debug_uses (obstack_watermark &attempt,
>>  // BASE gives the chosen base candidate for the pair and MOVE_RANGE is
>>  // a singleton range which says where to place the pair.
>>  bool
>> -ldp_bb_info::fuse_pair (bool load_p,
>> +pair_fusion_bb_info::fuse_pair (bool load_p,
>>                      unsigned access_size,
>>                      int writeback,
>>                      insn_info *i1, insn_info *i2,
>> @@ -1800,7 +1944,7 @@ ldp_bb_info::fuse_pair (bool load_p,
>>      {
>>        if (dump_file)
>>      fprintf (dump_file,
>> -             "  ldp: i%d has wb but subsequent i%d has non-wb "
>> +             "  load pair: i%d has wb but subsequent i%d has non-wb "
>>               "update of base (r%d), dropping wb\n",
>>               insns[0]->uid (), insns[1]->uid (), base_regno);
>>        gcc_assert (writeback_effect);
>> @@ -1823,7 +1967,7 @@ ldp_bb_info::fuse_pair (bool load_p,
>>      }
>>  
>>    // If either of the original insns had writeback, but the resulting pair 
>> insn
>> -  // does not (can happen e.g. in the ldp edge case above, or if the 
>> writeback
>> +  // does not (can happen e.g. in the load pair edge case above, or if the 
>> writeback
>>    // effects cancel out), then drop the def(s) of the base register as
>>    // appropriate.
>>    //
>> @@ -1842,7 +1986,7 @@ ldp_bb_info::fuse_pair (bool load_p,
>>    // update of the base register and try and fold it in to make this into a
>>    // writeback pair.
>>    insn_info *trailing_add = nullptr;
>> -  if (aarch64_ldp_writeback > 1
>> +  if (bb_state->pair_trailing_writeback_p ()
>>        && !writeback_effect
>>        && (!load_p || (!refers_to_regno_p (base_regno, base_regno + 1,
>>                                       XEXP (pats[0], 0), nullptr)
>> @@ -1863,14 +2007,14 @@ ldp_bb_info::fuse_pair (bool load_p,
>>      }
>>  
>>    // Now that we know what base mem we're going to use, check if it's OK
>> -  // with the ldp/stp policy.
>> +  // with the pair mem policy.
>>    rtx first_mem = XEXP (pats[0], load_p);
>> -  if (!aarch64_mem_ok_with_ldpstp_policy_model (first_mem,
>> -                                            load_p,
>> -                                            GET_MODE (first_mem)))
>> +  if (!bb_state->pair_mem_ok_with_policy (first_mem,
>> +                      load_p,
>> +                      GET_MODE (first_mem)))
>>      {
>>        if (dump_file)
>> -    fprintf (dump_file, "punting on pair (%d,%d), ldp/stp policy says no\n",
>> +    fprintf (dump_file, "punting on pair (%d,%d), pair mem  policy says 
>> no\n",
>>               i1->uid (), i2->uid ());
>>        return false;
>>      }
>> @@ -1878,21 +2022,10 @@ ldp_bb_info::fuse_pair (bool load_p,
>>    rtx reg_notes = combine_reg_notes (first, second, load_p);
>>  
>>    rtx pair_pat;
>> -  if (writeback_effect)
>> -    {
>> -      auto patvec = gen_rtvec (3, writeback_effect, pats[0], pats[1]);
>> -      pair_pat = gen_rtx_PARALLEL (VOIDmode, patvec);
>> -    }
>> -  else if (load_p)
>> -    pair_pat = aarch64_gen_load_pair (XEXP (pats[0], 0),
>> -                                  XEXP (pats[1], 0),
>> -                                  XEXP (pats[0], 1));
>> -  else
>> -    pair_pat = aarch64_gen_store_pair (XEXP (pats[0], 0),
>> -                                   XEXP (pats[0], 1),
>> -                                   XEXP (pats[1], 1));
>>  
>> +  pair_pat = bb_state->gen_mem_pair (pats, writeback_effect, load_p);
>>    insn_change *pair_change = nullptr;
>> +
>>    auto set_pair_pat = [pair_pat,reg_notes](insn_change *change) {
>>      rtx_insn *rti = change->insn ()->rtl ();
>>      validate_unshare_change (rti, &PATTERN (rti), pair_pat, true);
>> @@ -2133,15 +2266,6 @@ load_modified_by_store_p (insn_info *load,
>>    return false;
>>  }
>>  
>> -// Virtual base class for load/store walkers used in alias analysis.
>> -struct alias_walker
>> -{
>> -  virtual bool conflict_p (int &budget) const = 0;
>> -  virtual insn_info *insn () const = 0;
>> -  virtual bool valid () const  = 0;
>> -  virtual void advance () = 0;
>> -};
>> -
>>  // Implement some common functionality used by both store_walker
>>  // and load_walker.
>>  template<bool reverse>
>> @@ -2259,13 +2383,13 @@ public:
>>  //
>>  // We try to maintain the invariant that if a walker becomes invalid, we
>>  // set its pointer to null.
>> -static void
>> -do_alias_analysis (insn_info *alias_hazards[4],
>> +void
>> +pair_fusion_bb_info::do_alias_analysis (insn_info *alias_hazards[4],
> 
> It would probably make more sense for this to be a (non-virtual) member
> function of the pair_fusion class, since it doesn't need to access the
> BB-specific state.
> 

Addressed.
>>                 alias_walker *walkers[4],
>>                 bool load_p)
>>  {
>>    const int n_walkers = 2 + (2 * !load_p);
>> -  int budget = aarch64_ldp_alias_check_limit;
>> +  int budget = bb_state->pair_mem_alias_check_limit ();
>>  
>>    auto next_walker = [walkers,n_walkers](int current) -> int {
>>      for (int j = 1; j <= n_walkers; j++)
>> @@ -2365,7 +2489,7 @@ get_viable_bases (insn_info *insns[2],
>>      {
>>        const bool is_lower = (i == reversed);
>>        poly_int64 poly_off;
>> -      rtx base = ldp_strip_offset (cand_mems[i], &poly_off);
>> +      rtx base = pair_mem_strip_offset (cand_mems[i], &poly_off);
>>        if (GET_RTX_CLASS (GET_CODE (XEXP (cand_mems[i], 0))) == RTX_AUTOINC)
>>      writeback |= (1 << i);
>>  
>> @@ -2373,7 +2497,7 @@ get_viable_bases (insn_info *insns[2],
>>      continue;
>>  
>>        // Punt on accesses relative to eliminable regs.  See the comment in
>> -      // ldp_bb_info::track_access for a detailed explanation of this.
>> +      // pair_fusion_bb_info::track_access for a detailed explanation of 
>> this.
>>        if (!reload_completed
>>        && (REGNO (base) == FRAME_POINTER_REGNUM
>>            || REGNO (base) == ARG_POINTER_REGNUM))
>> @@ -2397,7 +2521,11 @@ get_viable_bases (insn_info *insns[2],
>>        if (!is_lower)
>>      base_off--;
>>  
>> -      if (base_off < LDP_MIN_IMM || base_off > LDP_MAX_IMM)
>> +      pair_fusion *pfuse;
>> +      aarch64_pair_fusion derived;
>> +      pfuse = &derived;
> 
> Again, please don't do this but instead use the same instance
> everywhere.
> 

Addressed.
>> +
>> +      if (pfuse->pair_mem_in_range_p (base_off))
>>      continue;
>>  
>>        use_info *use = find_access (insns[i]->uses (), REGNO (base));
>> @@ -2454,12 +2582,12 @@ get_viable_bases (insn_info *insns[2],
>>  }
>>  
>>  // Given two adjacent memory accesses of the same size, I1 and I2, try
>> -// and see if we can merge them into a ldp or stp.
>> +// and see if we can merge them into a paired accesses load and store.
> 
> "into a load pair or store pair" or "into a paired access" if you
> prefer.
> 

Addressed.
>>  //
>>  // ACCESS_SIZE gives the (common) size of a single access, LOAD_P is true
>>  // if the accesses are both loads, otherwise they are both stores.
>>  bool
>> -ldp_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
>> +pair_fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
>>                          insn_info *i1, insn_info *i2)
>>  {
>>    if (dump_file)
>> @@ -2494,7 +2622,7 @@ ldp_bb_info::try_fuse_pair (bool load_p, unsigned 
>> access_size,
>>      {
>>        if (dump_file)
>>      fprintf (dump_file,
>> -             "punting on ldp due to reg conflcits (%d,%d)\n",
>> +             "punting on pair mem load  due to reg conflcits (%d,%d)\n",
>>               insns[0]->uid (), insns[1]->uid ());
>>        return false;
>>      }
>> @@ -2843,7 +2971,7 @@ debug (const insn_list_t &l)
>>  // we can't re-order them anyway, so provided earlier passes have cleaned up
>>  // redundant loads, we shouldn't miss opportunities by doing this.
>>  void
>> -ldp_bb_info::merge_pairs (insn_list_t &left_list,
>> +pair_fusion_bb_info::merge_pairs (insn_list_t &left_list,
>>                        insn_list_t &right_list,
>>                        bool load_p,
>>                        unsigned access_size)
>> @@ -2890,8 +3018,8 @@ ldp_bb_info::merge_pairs (insn_list_t &left_list,
>>  // of accesses.  If we find two sets of adjacent accesses, call
>>  // merge_pairs.
>>  void
>> -ldp_bb_info::transform_for_base (int encoded_lfs,
>> -                             access_group &group)
>> +pair_fusion_bb_info::transform_for_base (int encoded_lfs,
>> +                                     access_group &group)
>>  {
>>    const auto lfs = decode_lfs (encoded_lfs);
>>    const unsigned access_size = lfs.size;
>> @@ -2909,7 +3037,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
>>                     access.cand_insns,
>>                     lfs.load_p,
>>                     access_size);
>> -      skip_next = access.cand_insns.empty ();
>> +      skip_next = bb_state->cand_insns_empty_p (access.cand_insns);
> 
> As above, why is this needed?

For rs6000 we want to return always true. as load store pair
that are to be merged with 8/16 16/32 32/64 is occuring for rs6000.
And we want load store pair to 8/16 32/64. Thats why we want
to generate always true for rs6000 to skip pairs as above.

> 
>>      }
>>        prev_access = &access;
>>      }
>> @@ -2919,7 +3047,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
>>  // and remove all the tombstone insns, being sure to reparent any uses
>>  // of mem to previous defs when we do this.
>>  void
>> -ldp_bb_info::cleanup_tombstones ()
>> +pair_fusion_bb_info::cleanup_tombstones ()
>>  {
>>    // No need to do anything if we didn't emit a tombstone insn for this BB.
>>    if (!m_emitted_tombstone)
>> @@ -2947,7 +3075,7 @@ ldp_bb_info::cleanup_tombstones ()
>>  
>>  template<typename Map>
>>  void
>> -ldp_bb_info::traverse_base_map (Map &map)
>> +pair_fusion_bb_info::traverse_base_map (Map &map)
>>  {
>>    for (auto kv : map)
>>      {
>> @@ -2958,7 +3086,7 @@ ldp_bb_info::traverse_base_map (Map &map)
>>  }
>>  
>>  void
>> -ldp_bb_info::transform ()
>> +pair_fusion_bb_info::transform ()
>>  {
>>    traverse_base_map (expr_map);
>>    traverse_base_map (def_map);
>> @@ -3167,14 +3295,13 @@ try_promote_writeback (insn_info *insn)
>>  // for load/store candidates.  If running after RA, also try and promote
>>  // non-writeback pairs to use writeback addressing.  Then try to fuse
>>  // candidates into pairs.
>> -void ldp_fusion_bb (bb_info *bb)
>> +void pair_fusion::ldp_fusion_bb (bb_info *bb)
>>  {
>> -  const bool track_loads
>> -    = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
>> -  const bool track_stores
>> -    = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
>> +  const bool track_loads = track_load_p ();
>> +  const bool track_stores = track_store_p ();
>>  
>> -  ldp_bb_info bb_state (bb);
> 
> This:
> 
>> +  aarch64_pair_fusion derived;
> 
> can be deleted and then:
> 
>> +  pair_fusion_bb_info bb_info (bb, &derived);
> 
> can just be:
> 
>   pair_fusion_bb_info bb_info (bb, this);
> 
> (or you can pass *this if you make bb_info take a reference).
> 
> I don't think there's a particular need to change the variable name
> (bb_state -> bb_info).  I chose the former because it doens't clash
> with the RTL-SSA structure of the same name as the latter.
> 

Addressed.
>>  
>>    for (auto insn : bb->nondebug_insns ())
>>      {
>> @@ -3184,31 +3311,31 @@ void ldp_fusion_bb (bb_info *bb)
>>      continue;
>>  
>>        rtx pat = PATTERN (rti);
>> -      if (reload_completed
>> -      && aarch64_ldp_writeback > 1
>> -      && GET_CODE (pat) == PARALLEL
>> -      && XVECLEN (pat, 0) == 2)
>> +      if (pair_mem_promote_writeback_p (pat))
>>      try_promote_writeback (insn);
> 
> It looks like try_promote_writeback itself will need some further work
> to make it target-independent.  I suppose this check:
> 
>   auto rti = insn->rtl ();
>   const auto attr = get_attr_ldpstp (rti);
>   if (attr == LDPSTP_NONE)
>     return;
> 
>   bool load_p = (attr == LDPSTP_LDP);
>   gcc_checking_assert (load_p || attr == LDPSTP_STP);
> 
> will need to become part of the pair_mem_promote_writeback_p hook that you
> added, potentially changing it to return a boolean for load_p.
> 
> Then I guess we will need hooks for destructuring the pair insn and
> another hook to wrap aarch64_gen_writeback_pair.
> 

Addressed.
>>  
>>        if (GET_CODE (pat) != SET)
>>      continue;
>>  
>>        if (track_stores && MEM_P (XEXP (pat, 0)))
>> -    bb_state.track_access (insn, false, XEXP (pat, 0));
>> +    bb_info.track_access (insn, false, XEXP (pat, 0));
>>        else if (track_loads && MEM_P (XEXP (pat, 1)))
>> -    bb_state.track_access (insn, true, XEXP (pat, 1));
>> +    bb_info.track_access (insn, true, XEXP (pat, 1));
>>      }
>>  
>> -  bb_state.transform ();
>> -  bb_state.cleanup_tombstones ();
>> +  bb_info.transform ();
>> +  bb_info.cleanup_tombstones ();
>>  }
>>  
>>  void ldp_fusion ()
>>  {
>>    ldp_fusion_init ();
>> +  pair_fusion *pfuse;
>> +  aarch64_pair_fusion derived;
>> +  pfuse = &derived;
> 
> This is indeed the one place where I think it is acceptable to
> instantiate aarch64_pair_fusion.  But again there's no need to create a
> pointer to the parent class, just call any function you like directly.
> 

Addressed.
>>  
>>    for (auto bb : crtl->ssa->bbs ())
>> -    ldp_fusion_bb (bb);
>> +    pfuse->ldp_fusion_bb (bb);
> 
> I think even the code to iterate over bbs should itself be a member
> function of pair_fusion (say "run") and then that becomes part of the
> generic code.
> 
> So this function would just become:
> 
> aarch64_pair_fusion pass;
> pass.run ();
> 
> and could be inlined into the caller.
> 

Addressed.
> Perhaps you could also add an early return like:
> 
>   if (!track_loads_p () && !track_stores_p ())
>     return;
> 
> in pair_fusion::run () and then remove the corresponding code from
> pass_ldp_fusion::gate?
> 

Addressed.

>>  
>>    ldp_fusion_destroy ();
>>  }
>> -- 
>> 2.39.3
>>
> 
> Thanks,
> Alex

Thanks & Regards
Ajit
Re: [PATCH V4 1/3] aarch64: Place target independent and dependent changed code in one file

Reply via email to