Re: [PATCH V4 1/3] aarch64: Place target independent and dependent changed code in one file

Alex Coplan Fri, 03 May 2024 01:59:28 -0700

On 22/04/2024 13:01, Ajit Agarwal wrote:
> Hello Alex:
> 
> On 14/04/24 10:29 pm, Ajit Agarwal wrote:
> > Hello Alex:
> > 
> > On 12/04/24 11:02 pm, Ajit Agarwal wrote:
> >> Hello Alex:
> >>
> >> On 12/04/24 8:15 pm, Alex Coplan wrote:
> >>> On 12/04/2024 20:02, Ajit Agarwal wrote:
> >>>> Hello Alex:
> >>>>
> >>>> On 11/04/24 7:55 pm, Alex Coplan wrote:
> >>>>> On 10/04/2024 23:48, Ajit Agarwal wrote:
> >>>>>> Hello Alex:
> >>>>>>
> >>>>>> On 10/04/24 7:52 pm, Alex Coplan wrote:
> >>>>>>> Hi Ajit,
> >>>>>>>
> >>>>>>> On 10/04/2024 15:31, Ajit Agarwal wrote:
> >>>>>>>> Hello Alex:
> >>>>>>>>
> >>>>>>>> On 10/04/24 1:42 pm, Alex Coplan wrote:
> >>>>>>>>> Hi Ajit,
> >>>>>>>>>
> >>>>>>>>> On 09/04/2024 20:59, Ajit Agarwal wrote:
> >>>>>>>>>> Hello Alex:
> >>>>>>>>>>
> >>>>>>>>>> On 09/04/24 8:39 pm, Alex Coplan wrote:
> >>>>>>>>>>> On 09/04/2024 20:01, Ajit Agarwal wrote:
> >>>>>>>>>>>> Hello Alex:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 09/04/24 7:29 pm, Alex Coplan wrote:
> >>>>>>>>>>>>> On 09/04/2024 17:30, Ajit Agarwal wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 05/04/24 10:03 pm, Alex Coplan wrote:
> >>>>>>>>>>>>>>> On 05/04/2024 13:53, Ajit Agarwal wrote:
> >>>>>>>>>>>>>>>> Hello Alex/Richard:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> All review comments are incorporated.
> >>>>>>> <snip>
> >>>>>>>>>>>>>>>> @@ -2890,8 +3018,8 @@ ldp_bb_info::merge_pairs (insn_list_t 
> >>>>>>>>>>>>>>>> &left_list,
> >>>>>>>>>>>>>>>>  // of accesses.  If we find two sets of adjacent accesses, 
> >>>>>>>>>>>>>>>> call
> >>>>>>>>>>>>>>>>  // merge_pairs.
> >>>>>>>>>>>>>>>>  void
> >>>>>>>>>>>>>>>> -ldp_bb_info::transform_for_base (int encoded_lfs,
> >>>>>>>>>>>>>>>> -                             access_group &group)
> >>>>>>>>>>>>>>>> +pair_fusion_bb_info::transform_for_base (int encoded_lfs,
> >>>>>>>>>>>>>>>> +                                     access_group &group)
> >>>>>>>>>>>>>>>>  {
> >>>>>>>>>>>>>>>>    const auto lfs = decode_lfs (encoded_lfs);
> >>>>>>>>>>>>>>>>    const unsigned access_size = lfs.size;
> >>>>>>>>>>>>>>>> @@ -2909,7 +3037,7 @@ ldp_bb_info::transform_for_base (int 
> >>>>>>>>>>>>>>>> encoded_lfs,
> >>>>>>>>>>>>>>>>                     access.cand_insns,
> >>>>>>>>>>>>>>>>                     lfs.load_p,
> >>>>>>>>>>>>>>>>                     access_size);
> >>>>>>>>>>>>>>>> -      skip_next = access.cand_insns.empty ();
> >>>>>>>>>>>>>>>> +      skip_next = bb_state->cand_insns_empty_p 
> >>>>>>>>>>>>>>>> (access.cand_insns);
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> As above, why is this needed?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> For rs6000 we want to return always true. as load store pair
> >>>>>>>>>>>>>> that are to be merged with 8/16 16/32 32/64 is occuring for 
> >>>>>>>>>>>>>> rs6000.
> >>>>>>>>>>>>>> And we want load store pair to 8/16 32/64. Thats why we want
> >>>>>>>>>>>>>> to generate always true for rs6000 to skip pairs as above.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hmm, sorry, I'm not sure I follow.  Are you saying that for 
> >>>>>>>>>>>>> rs6000 you have
> >>>>>>>>>>>>> load/store pair instructions where the two arms of the access 
> >>>>>>>>>>>>> are storing
> >>>>>>>>>>>>> operands of different sizes?  Or something else?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> As it stands the logic is to skip the next iteration only if we
> >>>>>>>>>>>>> exhausted all the candidate insns for the current access.  In 
> >>>>>>>>>>>>> the case
> >>>>>>>>>>>>> that we didn't exhaust all such candidates, then the idea is 
> >>>>>>>>>>>>> that when
> >>>>>>>>>>>>> access becomes prev_access, we can attempt to use those 
> >>>>>>>>>>>>> candidates as
> >>>>>>>>>>>>> the "left-hand side" of a pair in the next iteration since we 
> >>>>>>>>>>>>> failed to
> >>>>>>>>>>>>> use them as the "right-hand side" of a pair in the current 
> >>>>>>>>>>>>> iteration.
> >>>>>>>>>>>>> I don't see why you wouldn't want that behaviour.  Please can 
> >>>>>>>>>>>>> you
> >>>>>>>>>>>>> explain?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> In merge_pair we get the 2 load candiates one load from 0 offset 
> >>>>>>>>>>>> and
> >>>>>>>>>>>> other load is from 16th offset. Then in next iteration we get 
> >>>>>>>>>>>> load
> >>>>>>>>>>>> from 16th offset and other load from 32 offset. In next iteration
> >>>>>>>>>>>> we get load from 32 offset and other load from 48 offset.
> >>>>>>>>>>>>
> >>>>>>>>>>>> For example:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Currently we get the load candiates as follows.
> >>>>>>>>>>>>
> >>>>>>>>>>>> pairs:
> >>>>>>>>>>>>
> >>>>>>>>>>>> load from 0th offset.
> >>>>>>>>>>>> load from 16th offset.
> >>>>>>>>>>>>
> >>>>>>>>>>>> next pairs:
> >>>>>>>>>>>>
> >>>>>>>>>>>> load from 16th offset.
> >>>>>>>>>>>> load from 32th offset.
> >>>>>>>>>>>>
> >>>>>>>>>>>> next pairs:
> >>>>>>>>>>>>
> >>>>>>>>>>>> load from 32th offset
> >>>>>>>>>>>> load from 48th offset.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Instead in rs6000 we should get:
> >>>>>>>>>>>>
> >>>>>>>>>>>> pairs:
> >>>>>>>>>>>>
> >>>>>>>>>>>> load from 0th offset
> >>>>>>>>>>>> load from 16th offset.
> >>>>>>>>>>>>
> >>>>>>>>>>>> next pairs:
> >>>>>>>>>>>>
> >>>>>>>>>>>> load from 32th offset
> >>>>>>>>>>>> load from 48th offset.
> >>>>>>>>>>>
> >>>>>>>>>>> Hmm, so then I guess my question is: why wouldn't you consider 
> >>>>>>>>>>> merging
> >>>>>>>>>>> the pair with offsets (16,32) for rs6000?  Is it because you have 
> >>>>>>>>>>> a
> >>>>>>>>>>> stricter alignment requirement on the base pair offsets (i.e. 
> >>>>>>>>>>> they have
> >>>>>>>>>>> to be a multiple of 32 when the operand size is 16)?  So the pair
> >>>>>>>>>>> offsets have to be a multiple of the entire pair size rather than 
> >>>>>>>>>>> a
> >>>>>>>>>>> single operand size> 
> >>>>>>>>>>
> >>>>>>>>>> We get load pair at a certain point with (0,16) and other program
> >>>>>>>>>> point we get load pair (32, 48).
> >>>>>>>>>>
> >>>>>>>>>> In current implementation it takes offsets loads as (0, 16),
> >>>>>>>>>> (16, 32), (32, 48).
> >>>>>>>>>>
> >>>>>>>>>> But In rs6000 we want  the load pair to be merged at different 
> >>>>>>>>>> points
> >>>>>>>>>> as (0,16) and (32, 48). for (0,16) we want to replace load lxvp 
> >>>>>>>>>> with
> >>>>>>>>>> 0 offset and other load (32, 48) with lxvp with 32 offset.
> >>>>>>>>>>
> >>>>>>>>>> In current case it will merge with lxvp with 0 offset and lxvp with
> >>>>>>>>>> 16 offset, then lxvp with 32 offset and lxvp with 48 offset which
> >>>>>>>>>> is incorrect in our case as the (16-32) case 16 offset will not
> >>>>>>>>>> load from even register and break for rs6000.
> >>>>>>>>>
> >>>>>>>>> Sorry, I think I'm still missing something here.  Why does the 
> >>>>>>>>> address offset
> >>>>>>>>> affect the parity of the tranfser register?  ISTM they needn't be 
> >>>>>>>>> related at
> >>>>>>>>> all (and indeed we can't even know the parity of the transfer 
> >>>>>>>>> register before
> >>>>>>>>> RA, but perhaps you're only intending on running the pass after RA?)
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> We have load pair with (0,16) wherein these loads are adjacent and
> >>>>>>>> replaced with lxvp.
> >>>>>>>>
> >>>>>>>> Semantic of lxvp instruction is that it loads adjacent load pair in
> >>>>>>>> even register and even_register + 1.
> >>>>>>>>
> >>>>>>>> We replace the above load pair with lxvp instruction and then we
> >>>>>>>> dont need to merge (16,32) as (0, 16) is already merged and instead
> >>>>>>>> we merge (32,48).
> >>>>>>>
> >>>>>>> Ok, but the existing logic should already account for this.  I.e. if 
> >>>>>>> we
> >>>>>>> successfully merge (0,16), then we don't attempt to merge (16,32).  
> >>>>>>> We'd only
> >>>>>>> attempt to merge (16,32) if the merge of (0,16) failed (for whatever 
> >>>>>>> reason).
> >>>>>>> So I don't see that there's anything specific to lxvp that requires 
> >>>>>>> this logic
> >>>>>>> to change, _unless_ you have a stricter alignment requirement on the 
> >>>>>>> offsets as
> >>>>>>> I mentioned before.
> >>>>>>>
> >>>>>>
> >>>>>> Thanks for the suggestion. It worked for rs6000 also with current 
> >>>>>> changes.
> >>>>>> Sorry for the confusion.
> >>>>>
> >>>>> Alright, glad we got to the bottom of this!
> >>>>
> >>>> Thanks.
> >>>>>
> >>>>>>
> >>>>>>>>
> >>>>>>>> Yes you are correct, the addresss offset doesn't affect the parity of
> >>>>>>>> the register transfer as we are doing fusion pass before RA.
> >>>>>>>> If that is the case then I think it would be better to introduce a
> >>>>>>>>>>> virtual function (say pair_offset_alignment_ok_p) that vets the 
> >>>>>>>>>>> base
> >>>>>>>>>>> offset of the pair (prev_access->offset in transform_for_base).  
> >>>>>>>>>>> I guess
> >>>>>>>>>>> it would also take access_size as a parameter and for aarch64 it 
> >>>>>>>>>>> should
> >>>>>>>>>>> check:
> >>>>>>>>>>>
> >>>>>>>>>>>   multiple_p (offset, access_size)
> >>>>>>>>>>>
> >>>>>>>>>>> and for rs6000 it could check:
> >>>>>>>>>>>
> >>>>>>>>>>>   multiple_p (offset, access_size * 2)
> >>>>>>>>>>>
> >>>>>>>>>>> and we would then incorporate a call to that predicate in the 
> >>>>>>>>>>> else if
> >>>>>>>>>>> condition of tranform_for_base.
> >>>>>>>>>>>
> >>>>>>>>>>> It would have the same limitation whereby we assume that MEM_EXPR 
> >>>>>>>>>>> offset
> >>>>>>>>>>> alignment is a good proxy for RTL offset alignment, but we already
> >>>>>>>>>>> depend on that with the multiple_p check in track_via_mem_expr.
> >>>>>>>>>>>
> >>>>>>>> I have addressed the above hooks and it worked fine with both rs6000
> >>>>>>>> and aarch64. I am sending subsequent patch in some time that address
> >>>>>>>> above.
> >>>>>>>>
> >>>>>>>>> How do you plan on handling this even-odd requirement for rs6000?
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> We plan to handle with V16QI subreg: 0 and V16QI subreg : 16 to
> >>>>>>>> generate register pair and thats what we generate and implement
> >>>>>>>> in rs6000 target
> >>>>>>>> code.
> >>>>>>>
> >>>>>>> Ah, this is coming back to me now.  Sorry, I should have remembered 
> >>>>>>> this from
> >>>>>>> the previous discussion with Richard S.
> >>>>>>>
> >>>>>>> Apologies for going on a slight tangent here, but if you're running
> >>>>>>> before RA are you planning to create a new OImode pseudo register for
> >>>>>>> the lxvp insn and then somehow update uses of the old transfer 
> >>>>>>> registers
> >>>>>>> to replace them with subregs of that OImode pseudo?  
> >>>>>>
> >>>>>> Yes I do the same as you have mentioned. We generate register pairs
> >>>>>> with 256 bit mode with two subregs of 128 bit modes with 0 and
> >>>>>> 16 offset.
> >>>>>>
> >>>>>> Or do you just plan
> >>>>>>> or replacing the individual loads with moves (instead of deleting 
> >>>>>>> them)?
> >>>>>>> I guess the latter would be simpler and might work better in the
> >>>>>>> presence of hard regs.
> >>>>>>>
> >>>>>>
> >>>>>> Would you mind explaining how to generate register pairs with lxvp by
> >>>>>> replacing loads with moves.
> >>>>>
> >>>>> Yeah, so suppose you have something like:
> >>>>>
> >>>>> (set (reg:V16QI v1) (mem:V16QI addr))
> >>>>> (set (reg:V16QI v2) (mem:V16QI addr+16))
> >>>>>
> >>>>> then when you insert the lxvp you can then (logically) replace the
> >>>>> original load instructions with moves from the appropriate subreg, as
> >>>>> follows:
> >>>>>
> >>>>> (set (reg:OI pair-pseudo) (mem:OI addr)) ; lxvp
> >>>>> (set (reg:V16QI v1) (subreg:V16QI (reg:OI pair-pseudo) 0))
> >>>>> (set (reg:V16QI v2) (subreg:V16QI (reg:OI pair-pseudo) 16))
> >>>>>
> >>>>
> >>>> Any Pseudo created with gen_rtx_REG like 
> >>>> gen_rtx_REG (OOmode, REGNO (dest_exp) will error'ed out 
> >>>> with unrecognize insn by LRA.
> >>>
> >>> I'm not surprised that goes wrong: you can't just create a new REG
> >>> rtx in a different mode and reuse the regno of an existing pseudo.
> >>>
> > 
> > Thanks for the suggestion.
> >>>>
> >>>> If I create pseudo with gen_reg_rtx (OOmode) will error'ed
> >>>> out with new_defs Pseudo register is not found in
> >>>> change->new_defs.
> >>>
> >>> Yeah, I suppose you'd need to add an RTL-SSA def for the new pseudo.
> >>>
> >>
> >> Would you mind explaining how can I add and RTL-SSA def for the
> >> new pseudo.
> > 
> > I have added and RTL-SSA def for the new pseudo. With that I could
> > get register oairs correctly.
> >>
> >>>>
> >>>> Also the sequential register pairs are not generated by
> >>>> Register Allocator.
> >>>
> >>> So how do you get the sequential pairs with your current approach?  My
> >>> understanding was that what I suggested above doesn't really change what
> >>> you're doing w.r.t the lxvp insn itself, but maybe I didn't properly
> >>> understand the approach taken in your initial patchset.
> >>>
> >>
> >> I generate (set (reg:OO pair-pseudo) (mem:OI addr)) ; lxvp
> >> and then at the use point of pair_pseudo generate the following.
> >>
> >> (subreg:V16QI (reg:OI pair-pseudo) 0))
> >> (subreg:V16QI (reg:OI pair-pseudo) 16))
> >>
> > 
> > I get register pairs correctly generating RTL as you have 
> > suggested but we get extra moves that would impact the performance.
> > 
> > Please let me know what do you think.
> > 
> 
> For the below testcase:
> 
> #include <altivec.h>
> 
> void
> foo (__vector_quad *dst, vector unsigned char *ptr, vector unsigned char src)
> {
>   __vector_quad acc;
>   __builtin_mma_xvf32ger(&acc, src, ptr[0]);
>   __builtin_mma_xvf32gerpp(&acc, src, ptr[1]);
>   *dst = acc;
> }
> 
> 
> My earlier implementation without moves generates the following assembly
> for the above testcase:
> 
> .LFB0:
>         .cfi_startproc
>         .localentry     _Z3fooPu13__vector_quadPDv16_hS1_,1
>         lxvp %vs32,0(%r4)
>         xvf32ger 0,%vs34,%vs33
>         xvf32gerpp 0,%vs34,%vs32
>         xxmfacc 0
>         stxvp %vs2,0(%r3)
>         stxvp %vs0,32(%r3)
>         blr
> 
> 
> With moves you have suggested I get the below generated assembly:
> .LFB0:
>         .cfi_startproc
>         .localentry     _Z3fooPu13__vector_quadPDv16_hS1_,1
>         lxvp %vs0,0(%r4)
>         xxlor %vs33,%vs1,%vs1
>         xxlor %vs32,%vs0,%vs0
>         xvf32ger 0,%vs34,%vs33
>         xvf32gerpp 0,%vs34,%vs32
>         xxmfacc 0
>         stxvp %vs2,0(%r3)
>         stxvp %vs0,32(%r3)
>         blr
> 
> 
> As you see with register moves two extra xxlor instructions are 
> generated.


I see.  There's no need to change your approach for now, then, it was
just a suggestion.  I was hoping the RA would be able to eliminate the
redundant moves, but it looks like that isn't happening here.  Richard S
might have some pointers for you w.r.t. this approach when he gets back.

Sorry for the delay in getting back to you, I was busy with other things
last week.  I'll try to look at the latest version of the patch you sent
for the pass.

Thanks,
Alex

> 
> Please let me know what do you think.
> 
> Thanks & Regards
> Ajit
> > Thanks & Regards
> > Ajit
> >> Thanks & Regards
> >> Ajit
> >>> Thanks,
> >>> Alex
> >>>
> >>>>
> >>>> Thats why I haven't used above method as I also thought
> >>>> through.
> >>>>
> >>>> Please let me know what do you think.
> >>>>
> >>>> Thanks & Regards
> >>>> Ajit 
> >>>>> now I'm not sure off-hand if this exact combination of subregs and mode
> >>>>> changes is valid, but hopefully you get the idea.  The benefit of this
> >>>>> approach is that it keeps the transformation local and is more
> >>>>> compile-time efficient (we don't have to look through all mentions of
> >>>>> v1/v2 and replace them with subregs).  I'd hope that the RA can then
> >>>>> clean up any redundant moves (especially in the case that v1/v2 are
> >>>>> pseudos).
> >>>>>
> >>>>> That would mean that you don't need all the grubbing around with DF
> >>>>> looking for occurrences of the transfer regs.  Instead we'd just need
> >>>>> some logic to insert the extra moves after the lxvp for rs6000, and I
> >>>>> think this might fit more cleanly into the current pass.
> >>>>>
> >>>>> Does that make sense?  In any case, this shouldn't really affect the
> >>>>> preparatory aarch64 patch because we were planning to defer adding any
> >>>>> hooks that are only needed for rs6000 from the initial aarch64/generic
> >>>>> split.
> >>>>>
> >>>>> Thanks,
> >>>>> Alex
> >>>>>
> >>>>>>
> >>>>>> Thanks & Regards
> >>>>>> Ajit
> >>>>>>  
> >>>>>>> Thanks,
> >>>>>>> Alex
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks & Regards
> >>>>>>>> Ajit 
> >>>>>>>>> Thanks,
> >>>>>>>>> Alex
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> lxvp should load from even registers and then loaded value will
> >>>>>>>>>> be in even register and even register +1 (which is odd).
> >>>>>>>>>>
> >>>>>>>>>> Thanks & Regards
> >>>>>>>>>> Ajit
> >>>>>>>>>>> If that is the case then I think it would be better to introduce a
> >>>>>>>>>>> virtual function (say pair_offset_alignment_ok_p) that vets the 
> >>>>>>>>>>> base
> >>>>>>>>>>> offset of the pair (prev_access->offset in transform_for_base).  
> >>>>>>>>>>> I guess
> >>>>>>>>>>> it would also take access_size as a parameter and for aarch64 it 
> >>>>>>>>>>> should
> >>>>>>>>>>> check:
> >>>>>>>>>>>
> >>>>>>>>>>>   multiple_p (offset, access_size)
> >>>>>>>>>>>
> >>>>>>>>>>> and for rs6000 it could check:
> >>>>>>>>>>>
> >>>>>>>>>>>   multiple_p (offset, access_size * 2)
> >>>>>>>>>>>
> >>>>>>>>>>> and we would then incorporate a call to that predicate in the 
> >>>>>>>>>>> else if
> >>>>>>>>>>> condition of tranform_for_base.
> >>>>>>>>>>>
> >>>>>>>>>>> It would have the same limitation whereby we assume that MEM_EXPR 
> >>>>>>>>>>> offset
> >>>>>>>>>>> alignment is a good proxy for RTL offset alignment, but we already
> >>>>>>>>>>> depend on that with the multiple_p check in track_via_mem_expr.
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Alex
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks & Regards
> >>>>>>>>>>>> Ajit
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>> Alex
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>      }
> >>>>>>>>>>>>>>>>        prev_access = &access;
> >>>>>>>>>>>>>>>>      }
> >>>>>>>>>>>>>>>> @@ -2919,7 +3047,7 @@ ldp_bb_info::transform_for_base (int 
> >>>>>>>>>>>>>>>> encoded_lfs,
> >>>>>>>>>>>>>>>>  // and remove all the tombstone insns, being sure to 
> >>>>>>>>>>>>>>>> reparent any uses
> >>>>>>>>>>>>>>>>  // of mem to previous defs when we do this.
> >>>>>>>>>>>>>>>>  void
> >>>>>>>>>>>>>>>> -ldp_bb_info::cleanup_tombstones ()
> >>>>>>>>>>>>>>>> +pair_fusion_bb_info::cleanup_tombstones ()
> >>>>>>>>>>>>>>>>  {
> >>>>>>>>>>>>>>>>    // No need to do anything if we didn't emit a tombstone 
> >>>>>>>>>>>>>>>> insn for this BB.
> >>>>>>>>>>>>>>>>    if (!m_emitted_tombstone)
> >>>>>>>>>>>>>>>> @@ -2947,7 +3075,7 @@ ldp_bb_info::cleanup_tombstones ()
> >>>>>>>>>>>>>>>>  
> >>>>>>>>>>>>>>>>  template<typename Map>
> >>>>>>>>>>>>>>>>  void
> >>>>>>>>>>>>>>>> -ldp_bb_info::traverse_base_map (Map &map)
> >>>>>>>>>>>>>>>> +pair_fusion_bb_info::traverse_base_map (Map &map)
> >>>>>>>>>>>>>>>>  {
> >>>>>>>>>>>>>>>>    for (auto kv : map)
> >>>>>>>>>>>>>>>>      {
> >>>>>>>>>>>>>>>> @@ -2958,7 +3086,7 @@ ldp_bb_info::traverse_base_map (Map 
> >>>>>>>>>>>>>>>> &map)
> >>>>>>>>>>>>>>>>  }
> >>>>>>>>>>>>>>>>  
> >>>>>>>>>>>>>>>>  void
> >>>>>>>>>>>>>>>> -ldp_bb_info::transform ()
> >>>>>>>>>>>>>>>> +pair_fusion_bb_info::transform ()
> >>>>>>>>>>>>>>>>  {
> >>>>>>>>>>>>>>>>    traverse_base_map (expr_map);
> >>>>>>>>>>>>>>>>    traverse_base_map (def_map);
> >>>>>>>>>>>>>>>> @@ -3167,14 +3295,13 @@ try_promote_writeback (insn_info 
> >>>>>>>>>>>>>>>> *insn)
> >>>>>>>>>>>>>>>>  // for load/store candidates.  If running after RA, also 
> >>>>>>>>>>>>>>>> try and promote
> >>>>>>>>>>>>>>>>  // non-writeback pairs to use writeback addressing.  Then 
> >>>>>>>>>>>>>>>> try to fuse
> >>>>>>>>>>>>>>>>  // candidates into pairs.
> >>>>>>>>>>>>>>>> -void ldp_fusion_bb (bb_info *bb)
> >>>>>>>>>>>>>>>> +void pair_fusion::ldp_fusion_bb (bb_info *bb)
> >>>>>>>>>>>>>>>>  {
> >>>>>>>>>>>>>>>> -  const bool track_loads
> >>>>>>>>>>>>>>>> -    = aarch64_tune_params.ldp_policy_model != 
> >>>>>>>>>>>>>>>> AARCH64_LDP_STP_POLICY_NEVER;
> >>>>>>>>>>>>>>>> -  const bool track_stores
> >>>>>>>>>>>>>>>> -    = aarch64_tune_params.stp_policy_model != 
> >>>>>>>>>>>>>>>> AARCH64_LDP_STP_POLICY_NEVER;
> >>>>>>>>>>>>>>>> +  const bool track_loads = track_load_p ();
> >>>>>>>>>>>>>>>> +  const bool track_stores = track_store_p ();
> >>>>>>>>>>>>>>>>  
> >>>>>>>>>>>>>>>> -  ldp_bb_info bb_state (bb);
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> This:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> +  aarch64_pair_fusion derived;
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> can be deleted and then:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> +  pair_fusion_bb_info bb_info (bb, &derived);
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> can just be:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>   pair_fusion_bb_info bb_info (bb, this);
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> (or you can pass *this if you make bb_info take a reference).
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I don't think there's a particular need to change the 
> >>>>>>>>>>>>>>> variable name
> >>>>>>>>>>>>>>> (bb_state -> bb_info).  I chose the former because it doens't 
> >>>>>>>>>>>>>>> clash
> >>>>>>>>>>>>>>> with the RTL-SSA structure of the same name as the latter.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Addressed.
> >>>>>>>>>>>>>>>>  
> >>>>>>>>>>>>>>>>    for (auto insn : bb->nondebug_insns ())
> >>>>>>>>>>>>>>>>      {
> >>>>>>>>>>>>>>>> @@ -3184,31 +3311,31 @@ void ldp_fusion_bb (bb_info *bb)
> >>>>>>>>>>>>>>>>      continue;
> >>>>>>>>>>>>>>>>  
> >>>>>>>>>>>>>>>>        rtx pat = PATTERN (rti);
> >>>>>>>>>>>>>>>> -      if (reload_completed
> >>>>>>>>>>>>>>>> -      && aarch64_ldp_writeback > 1
> >>>>>>>>>>>>>>>> -      && GET_CODE (pat) == PARALLEL
> >>>>>>>>>>>>>>>> -      && XVECLEN (pat, 0) == 2)
> >>>>>>>>>>>>>>>> +      if (pair_mem_promote_writeback_p (pat))
> >>>>>>>>>>>>>>>>      try_promote_writeback (insn);
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> It looks like try_promote_writeback itself will need some 
> >>>>>>>>>>>>>>> further work
> >>>>>>>>>>>>>>> to make it target-independent.  I suppose this check:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>   auto rti = insn->rtl ();
> >>>>>>>>>>>>>>>   const auto attr = get_attr_ldpstp (rti);
> >>>>>>>>>>>>>>>   if (attr == LDPSTP_NONE)
> >>>>>>>>>>>>>>>     return;
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>   bool load_p = (attr == LDPSTP_LDP);
> >>>>>>>>>>>>>>>   gcc_checking_assert (load_p || attr == LDPSTP_STP);
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> will need to become part of the pair_mem_promote_writeback_p 
> >>>>>>>>>>>>>>> hook that you
> >>>>>>>>>>>>>>> added, potentially changing it to return a boolean for load_p.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Then I guess we will need hooks for destructuring the pair 
> >>>>>>>>>>>>>>> insn and
> >>>>>>>>>>>>>>> another hook to wrap aarch64_gen_writeback_pair.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Addressed.
> >>>>>>>>>>>>>>>>  
> >>>>>>>>>>>>>>>>        if (GET_CODE (pat) != SET)
> >>>>>>>>>>>>>>>>      continue;
> >>>>>>>>>>>>>>>>  
> >>>>>>>>>>>>>>>>        if (track_stores && MEM_P (XEXP (pat, 0)))
> >>>>>>>>>>>>>>>> -    bb_state.track_access (insn, false, XEXP (pat, 0));
> >>>>>>>>>>>>>>>> +    bb_info.track_access (insn, false, XEXP (pat, 0));
> >>>>>>>>>>>>>>>>        else if (track_loads && MEM_P (XEXP (pat, 1)))
> >>>>>>>>>>>>>>>> -    bb_state.track_access (insn, true, XEXP (pat, 1));
> >>>>>>>>>>>>>>>> +    bb_info.track_access (insn, true, XEXP (pat, 1));
> >>>>>>>>>>>>>>>>      }
> >>>>>>>>>>>>>>>>  
> >>>>>>>>>>>>>>>> -  bb_state.transform ();
> >>>>>>>>>>>>>>>> -  bb_state.cleanup_tombstones ();
> >>>>>>>>>>>>>>>> +  bb_info.transform ();
> >>>>>>>>>>>>>>>> +  bb_info.cleanup_tombstones ();
> >>>>>>>>>>>>>>>>  }
> >>>>>>>>>>>>>>>>  
> >>>>>>>>>>>>>>>>  void ldp_fusion ()
> >>>>>>>>>>>>>>>>  {
> >>>>>>>>>>>>>>>>    ldp_fusion_init ();
> >>>>>>>>>>>>>>>> +  pair_fusion *pfuse;
> >>>>>>>>>>>>>>>> +  aarch64_pair_fusion derived;
> >>>>>>>>>>>>>>>> +  pfuse = &derived;
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> This is indeed the one place where I think it is acceptable to
> >>>>>>>>>>>>>>> instantiate aarch64_pair_fusion.  But again there's no need 
> >>>>>>>>>>>>>>> to create a
> >>>>>>>>>>>>>>> pointer to the parent class, just call any function you like 
> >>>>>>>>>>>>>>> directly.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Addressed.
> >>>>>>>>>>>>>>>>  
> >>>>>>>>>>>>>>>>    for (auto bb : crtl->ssa->bbs ())
> >>>>>>>>>>>>>>>> -    ldp_fusion_bb (bb);
> >>>>>>>>>>>>>>>> +    pfuse->ldp_fusion_bb (bb);
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I think even the code to iterate over bbs should itself be a 
> >>>>>>>>>>>>>>> member
> >>>>>>>>>>>>>>> function of pair_fusion (say "run") and then that becomes 
> >>>>>>>>>>>>>>> part of the
> >>>>>>>>>>>>>>> generic code.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> So this function would just become:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> aarch64_pair_fusion pass;
> >>>>>>>>>>>>>>> pass.run ();
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> and could be inlined into the caller.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Addressed.
> >>>>>>>>>>>>>>> Perhaps you could also add an early return like:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>   if (!track_loads_p () && !track_stores_p ())
> >>>>>>>>>>>>>>>     return;
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> in pair_fusion::run () and then remove the corresponding code 
> >>>>>>>>>>>>>>> from
> >>>>>>>>>>>>>>> pass_ldp_fusion::gate?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Addressed.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>  
> >>>>>>>>>>>>>>>>    ldp_fusion_destroy ();
> >>>>>>>>>>>>>>>>  }
> >>>>>>>>>>>>>>>> -- 
> >>>>>>>>>>>>>>>> 2.39.3
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>> Alex
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks & Regards
> >>>>>>>>>>>>>> Ajit

Re: [PATCH V4 1/3] aarch64: Place target independent and dependent changed code in one file

Reply via email to