Re: [PATCH V4 1/3] aarch64: Place target independent and dependent changed code in one file

Ajit Agarwal Mon, 22 Apr 2024 00:33:10 -0700

Hello Alex:

On 14/04/24 10:29 pm, Ajit Agarwal wrote:
> Hello Alex:
> 
> On 12/04/24 11:02 pm, Ajit Agarwal wrote:
>> Hello Alex:
>>
>> On 12/04/24 8:15 pm, Alex Coplan wrote:
>>> On 12/04/2024 20:02, Ajit Agarwal wrote:
>>>> Hello Alex:
>>>>
>>>> On 11/04/24 7:55 pm, Alex Coplan wrote:
>>>>> On 10/04/2024 23:48, Ajit Agarwal wrote:
>>>>>> Hello Alex:
>>>>>>
>>>>>> On 10/04/24 7:52 pm, Alex Coplan wrote:
>>>>>>> Hi Ajit,
>>>>>>>
>>>>>>> On 10/04/2024 15:31, Ajit Agarwal wrote:
>>>>>>>> Hello Alex:
>>>>>>>>
>>>>>>>> On 10/04/24 1:42 pm, Alex Coplan wrote:
>>>>>>>>> Hi Ajit,
>>>>>>>>>
>>>>>>>>> On 09/04/2024 20:59, Ajit Agarwal wrote:
>>>>>>>>>> Hello Alex:
>>>>>>>>>>
>>>>>>>>>> On 09/04/24 8:39 pm, Alex Coplan wrote:
>>>>>>>>>>> On 09/04/2024 20:01, Ajit Agarwal wrote:
>>>>>>>>>>>> Hello Alex:
>>>>>>>>>>>>
>>>>>>>>>>>> On 09/04/24 7:29 pm, Alex Coplan wrote:
>>>>>>>>>>>>> On 09/04/2024 17:30, Ajit Agarwal wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 05/04/24 10:03 pm, Alex Coplan wrote:
>>>>>>>>>>>>>>> On 05/04/2024 13:53, Ajit Agarwal wrote:
>>>>>>>>>>>>>>>> Hello Alex/Richard:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> All review comments are incorporated.
>>>>>>> <snip>
>>>>>>>>>>>>>>>> @@ -2890,8 +3018,8 @@ ldp_bb_info::merge_pairs (insn_list_t 
>>>>>>>>>>>>>>>> &left_list,
>>>>>>>>>>>>>>>>  // of accesses.  If we find two sets of adjacent accesses, 
>>>>>>>>>>>>>>>> call
>>>>>>>>>>>>>>>>  // merge_pairs.
>>>>>>>>>>>>>>>>  void
>>>>>>>>>>>>>>>> -ldp_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>>>>>>>> -                               access_group &group)
>>>>>>>>>>>>>>>> +pair_fusion_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>>>>>>>> +                                       access_group &group)
>>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>>>    const auto lfs = decode_lfs (encoded_lfs);
>>>>>>>>>>>>>>>>    const unsigned access_size = lfs.size;
>>>>>>>>>>>>>>>> @@ -2909,7 +3037,7 @@ ldp_bb_info::transform_for_base (int 
>>>>>>>>>>>>>>>> encoded_lfs,
>>>>>>>>>>>>>>>>                       access.cand_insns,
>>>>>>>>>>>>>>>>                       lfs.load_p,
>>>>>>>>>>>>>>>>                       access_size);
>>>>>>>>>>>>>>>> -        skip_next = access.cand_insns.empty ();
>>>>>>>>>>>>>>>> +        skip_next = bb_state->cand_insns_empty_p 
>>>>>>>>>>>>>>>> (access.cand_insns);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> As above, why is this needed?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For rs6000 we want to return always true. as load store pair
>>>>>>>>>>>>>> that are to be merged with 8/16 16/32 32/64 is occuring for 
>>>>>>>>>>>>>> rs6000.
>>>>>>>>>>>>>> And we want load store pair to 8/16 32/64. Thats why we want
>>>>>>>>>>>>>> to generate always true for rs6000 to skip pairs as above.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hmm, sorry, I'm not sure I follow.  Are you saying that for 
>>>>>>>>>>>>> rs6000 you have
>>>>>>>>>>>>> load/store pair instructions where the two arms of the access are 
>>>>>>>>>>>>> storing
>>>>>>>>>>>>> operands of different sizes?  Or something else?
>>>>>>>>>>>>>
>>>>>>>>>>>>> As it stands the logic is to skip the next iteration only if we
>>>>>>>>>>>>> exhausted all the candidate insns for the current access.  In the 
>>>>>>>>>>>>> case
>>>>>>>>>>>>> that we didn't exhaust all such candidates, then the idea is that 
>>>>>>>>>>>>> when
>>>>>>>>>>>>> access becomes prev_access, we can attempt to use those 
>>>>>>>>>>>>> candidates as
>>>>>>>>>>>>> the "left-hand side" of a pair in the next iteration since we 
>>>>>>>>>>>>> failed to
>>>>>>>>>>>>> use them as the "right-hand side" of a pair in the current 
>>>>>>>>>>>>> iteration.
>>>>>>>>>>>>> I don't see why you wouldn't want that behaviour.  Please can you
>>>>>>>>>>>>> explain?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> In merge_pair we get the 2 load candiates one load from 0 offset 
>>>>>>>>>>>> and
>>>>>>>>>>>> other load is from 16th offset. Then in next iteration we get load
>>>>>>>>>>>> from 16th offset and other load from 32 offset. In next iteration
>>>>>>>>>>>> we get load from 32 offset and other load from 48 offset.
>>>>>>>>>>>>
>>>>>>>>>>>> For example:
>>>>>>>>>>>>
>>>>>>>>>>>> Currently we get the load candiates as follows.
>>>>>>>>>>>>
>>>>>>>>>>>> pairs:
>>>>>>>>>>>>
>>>>>>>>>>>> load from 0th offset.
>>>>>>>>>>>> load from 16th offset.
>>>>>>>>>>>>
>>>>>>>>>>>> next pairs:
>>>>>>>>>>>>
>>>>>>>>>>>> load from 16th offset.
>>>>>>>>>>>> load from 32th offset.
>>>>>>>>>>>>
>>>>>>>>>>>> next pairs:
>>>>>>>>>>>>
>>>>>>>>>>>> load from 32th offset
>>>>>>>>>>>> load from 48th offset.
>>>>>>>>>>>>
>>>>>>>>>>>> Instead in rs6000 we should get:
>>>>>>>>>>>>
>>>>>>>>>>>> pairs:
>>>>>>>>>>>>
>>>>>>>>>>>> load from 0th offset
>>>>>>>>>>>> load from 16th offset.
>>>>>>>>>>>>
>>>>>>>>>>>> next pairs:
>>>>>>>>>>>>
>>>>>>>>>>>> load from 32th offset
>>>>>>>>>>>> load from 48th offset.
>>>>>>>>>>>
>>>>>>>>>>> Hmm, so then I guess my question is: why wouldn't you consider 
>>>>>>>>>>> merging
>>>>>>>>>>> the pair with offsets (16,32) for rs6000?  Is it because you have a
>>>>>>>>>>> stricter alignment requirement on the base pair offsets (i.e. they 
>>>>>>>>>>> have
>>>>>>>>>>> to be a multiple of 32 when the operand size is 16)?  So the pair
>>>>>>>>>>> offsets have to be a multiple of the entire pair size rather than a
>>>>>>>>>>> single operand size> 
>>>>>>>>>>
>>>>>>>>>> We get load pair at a certain point with (0,16) and other program
>>>>>>>>>> point we get load pair (32, 48).
>>>>>>>>>>
>>>>>>>>>> In current implementation it takes offsets loads as (0, 16),
>>>>>>>>>> (16, 32), (32, 48).
>>>>>>>>>>
>>>>>>>>>> But In rs6000 we want  the load pair to be merged at different points
>>>>>>>>>> as (0,16) and (32, 48). for (0,16) we want to replace load lxvp with
>>>>>>>>>> 0 offset and other load (32, 48) with lxvp with 32 offset.
>>>>>>>>>>
>>>>>>>>>> In current case it will merge with lxvp with 0 offset and lxvp with
>>>>>>>>>> 16 offset, then lxvp with 32 offset and lxvp with 48 offset which
>>>>>>>>>> is incorrect in our case as the (16-32) case 16 offset will not
>>>>>>>>>> load from even register and break for rs6000.
>>>>>>>>>
>>>>>>>>> Sorry, I think I'm still missing something here.  Why does the 
>>>>>>>>> address offset
>>>>>>>>> affect the parity of the tranfser register?  ISTM they needn't be 
>>>>>>>>> related at
>>>>>>>>> all (and indeed we can't even know the parity of the transfer 
>>>>>>>>> register before
>>>>>>>>> RA, but perhaps you're only intending on running the pass after RA?)
>>>>>>>>>
>>>>>>>>
>>>>>>>> We have load pair with (0,16) wherein these loads are adjacent and
>>>>>>>> replaced with lxvp.
>>>>>>>>
>>>>>>>> Semantic of lxvp instruction is that it loads adjacent load pair in
>>>>>>>> even register and even_register + 1.
>>>>>>>>
>>>>>>>> We replace the above load pair with lxvp instruction and then we
>>>>>>>> dont need to merge (16,32) as (0, 16) is already merged and instead
>>>>>>>> we merge (32,48).
>>>>>>>
>>>>>>> Ok, but the existing logic should already account for this.  I.e. if we
>>>>>>> successfully merge (0,16), then we don't attempt to merge (16,32).  
>>>>>>> We'd only
>>>>>>> attempt to merge (16,32) if the merge of (0,16) failed (for whatever 
>>>>>>> reason).
>>>>>>> So I don't see that there's anything specific to lxvp that requires 
>>>>>>> this logic
>>>>>>> to change, _unless_ you have a stricter alignment requirement on the 
>>>>>>> offsets as
>>>>>>> I mentioned before.
>>>>>>>
>>>>>>
>>>>>> Thanks for the suggestion. It worked for rs6000 also with current 
>>>>>> changes.
>>>>>> Sorry for the confusion.
>>>>>
>>>>> Alright, glad we got to the bottom of this!
>>>>
>>>> Thanks.
>>>>>
>>>>>>
>>>>>>>>
>>>>>>>> Yes you are correct, the addresss offset doesn't affect the parity of
>>>>>>>> the register transfer as we are doing fusion pass before RA.
>>>>>>>> If that is the case then I think it would be better to introduce a
>>>>>>>>>>> virtual function (say pair_offset_alignment_ok_p) that vets the base
>>>>>>>>>>> offset of the pair (prev_access->offset in transform_for_base).  I 
>>>>>>>>>>> guess
>>>>>>>>>>> it would also take access_size as a parameter and for aarch64 it 
>>>>>>>>>>> should
>>>>>>>>>>> check:
>>>>>>>>>>>
>>>>>>>>>>>   multiple_p (offset, access_size)
>>>>>>>>>>>
>>>>>>>>>>> and for rs6000 it could check:
>>>>>>>>>>>
>>>>>>>>>>>   multiple_p (offset, access_size * 2)
>>>>>>>>>>>
>>>>>>>>>>> and we would then incorporate a call to that predicate in the else 
>>>>>>>>>>> if
>>>>>>>>>>> condition of tranform_for_base.
>>>>>>>>>>>
>>>>>>>>>>> It would have the same limitation whereby we assume that MEM_EXPR 
>>>>>>>>>>> offset
>>>>>>>>>>> alignment is a good proxy for RTL offset alignment, but we already
>>>>>>>>>>> depend on that with the multiple_p check in track_via_mem_expr.
>>>>>>>>>>>
>>>>>>>> I have addressed the above hooks and it worked fine with both rs6000
>>>>>>>> and aarch64. I am sending subsequent patch in some time that address
>>>>>>>> above.
>>>>>>>>
>>>>>>>>> How do you plan on handling this even-odd requirement for rs6000?
>>>>>>>>>
>>>>>>>>
>>>>>>>> We plan to handle with V16QI subreg: 0 and V16QI subreg : 16 to
>>>>>>>> generate register pair and thats what we generate and implement
>>>>>>>> in rs6000 target
>>>>>>>> code.
>>>>>>>
>>>>>>> Ah, this is coming back to me now.  Sorry, I should have remembered 
>>>>>>> this from
>>>>>>> the previous discussion with Richard S.
>>>>>>>
>>>>>>> Apologies for going on a slight tangent here, but if you're running
>>>>>>> before RA are you planning to create a new OImode pseudo register for
>>>>>>> the lxvp insn and then somehow update uses of the old transfer registers
>>>>>>> to replace them with subregs of that OImode pseudo?  
>>>>>>
>>>>>> Yes I do the same as you have mentioned. We generate register pairs
>>>>>> with 256 bit mode with two subregs of 128 bit modes with 0 and
>>>>>> 16 offset.
>>>>>>
>>>>>> Or do you just plan
>>>>>>> or replacing the individual loads with moves (instead of deleting them)?
>>>>>>> I guess the latter would be simpler and might work better in the
>>>>>>> presence of hard regs.
>>>>>>>
>>>>>>
>>>>>> Would you mind explaining how to generate register pairs with lxvp by
>>>>>> replacing loads with moves.
>>>>>
>>>>> Yeah, so suppose you have something like:
>>>>>
>>>>> (set (reg:V16QI v1) (mem:V16QI addr))
>>>>> (set (reg:V16QI v2) (mem:V16QI addr+16))
>>>>>
>>>>> then when you insert the lxvp you can then (logically) replace the
>>>>> original load instructions with moves from the appropriate subreg, as
>>>>> follows:
>>>>>
>>>>> (set (reg:OI pair-pseudo) (mem:OI addr)) ; lxvp
>>>>> (set (reg:V16QI v1) (subreg:V16QI (reg:OI pair-pseudo) 0))
>>>>> (set (reg:V16QI v2) (subreg:V16QI (reg:OI pair-pseudo) 16))
>>>>>
>>>>
>>>> Any Pseudo created with gen_rtx_REG like 
>>>> gen_rtx_REG (OOmode, REGNO (dest_exp) will error'ed out 
>>>> with unrecognize insn by LRA.
>>>
>>> I'm not surprised that goes wrong: you can't just create a new REG
>>> rtx in a different mode and reuse the regno of an existing pseudo.
>>>
> 
> Thanks for the suggestion.
>>>>
>>>> If I create pseudo with gen_reg_rtx (OOmode) will error'ed
>>>> out with new_defs Pseudo register is not found in
>>>> change->new_defs.
>>>
>>> Yeah, I suppose you'd need to add an RTL-SSA def for the new pseudo.
>>>
>>
>> Would you mind explaining how can I add and RTL-SSA def for the
>> new pseudo.
> 
> I have added and RTL-SSA def for the new pseudo. With that I could
> get register oairs correctly.
>>
>>>>
>>>> Also the sequential register pairs are not generated by
>>>> Register Allocator.
>>>
>>> So how do you get the sequential pairs with your current approach?  My
>>> understanding was that what I suggested above doesn't really change what
>>> you're doing w.r.t the lxvp insn itself, but maybe I didn't properly
>>> understand the approach taken in your initial patchset.
>>>
>>
>> I generate (set (reg:OO pair-pseudo) (mem:OI addr)) ; lxvp
>> and then at the use point of pair_pseudo generate the following.
>>
>> (subreg:V16QI (reg:OI pair-pseudo) 0))
>> (subreg:V16QI (reg:OI pair-pseudo) 16))
>>
> 
> I get register pairs correctly generating RTL as you have 
> suggested but we get extra moves that would impact the performance.
> 
> Please let me know what do you think.
>


For the below testcase:

#include <altivec.h>

void
foo (__vector_quad *dst, vector unsigned char *ptr, vector unsigned char src)
{
  __vector_quad acc;
  __builtin_mma_xvf32ger(&acc, src, ptr[0]);
  __builtin_mma_xvf32gerpp(&acc, src, ptr[1]);
  *dst = acc;
}


My earlier implementation without moves generates the following assembly
for the above testcase:

.LFB0:
        .cfi_startproc
        .localentry     _Z3fooPu13__vector_quadPDv16_hS1_,1
        lxvp %vs32,0(%r4)
        xvf32ger 0,%vs34,%vs33
        xvf32gerpp 0,%vs34,%vs32
        xxmfacc 0
        stxvp %vs2,0(%r3)
        stxvp %vs0,32(%r3)
        blr


With moves you have suggested I get the below generated assembly:
.LFB0:
        .cfi_startproc
        .localentry     _Z3fooPu13__vector_quadPDv16_hS1_,1
        lxvp %vs0,0(%r4)
        xxlor %vs33,%vs1,%vs1
        xxlor %vs32,%vs0,%vs0
        xvf32ger 0,%vs34,%vs33
        xvf32gerpp 0,%vs34,%vs32
        xxmfacc 0
        stxvp %vs2,0(%r3)
        stxvp %vs0,32(%r3)
        blr


As you see with register moves two extra xxlor instructions are 
generated.

Please let me know what do you think.

Thanks & Regards
Ajit
> Thanks & Regards
> Ajit
>> Thanks & Regards
>> Ajit
>>> Thanks,
>>> Alex
>>>
>>>>
>>>> Thats why I haven't used above method as I also thought
>>>> through.
>>>>
>>>> Please let me know what do you think.
>>>>
>>>> Thanks & Regards
>>>> Ajit 
>>>>> now I'm not sure off-hand if this exact combination of subregs and mode
>>>>> changes is valid, but hopefully you get the idea.  The benefit of this
>>>>> approach is that it keeps the transformation local and is more
>>>>> compile-time efficient (we don't have to look through all mentions of
>>>>> v1/v2 and replace them with subregs).  I'd hope that the RA can then
>>>>> clean up any redundant moves (especially in the case that v1/v2 are
>>>>> pseudos).
>>>>>
>>>>> That would mean that you don't need all the grubbing around with DF
>>>>> looking for occurrences of the transfer regs.  Instead we'd just need
>>>>> some logic to insert the extra moves after the lxvp for rs6000, and I
>>>>> think this might fit more cleanly into the current pass.
>>>>>
>>>>> Does that make sense?  In any case, this shouldn't really affect the
>>>>> preparatory aarch64 patch because we were planning to defer adding any
>>>>> hooks that are only needed for rs6000 from the initial aarch64/generic
>>>>> split.
>>>>>
>>>>> Thanks,
>>>>> Alex
>>>>>
>>>>>>
>>>>>> Thanks & Regards
>>>>>> Ajit
>>>>>>  
>>>>>>> Thanks,
>>>>>>> Alex
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks & Regards
>>>>>>>> Ajit 
>>>>>>>>> Thanks,
>>>>>>>>> Alex
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> lxvp should load from even registers and then loaded value will
>>>>>>>>>> be in even register and even register +1 (which is odd).
>>>>>>>>>>
>>>>>>>>>> Thanks & Regards
>>>>>>>>>> Ajit
>>>>>>>>>>> If that is the case then I think it would be better to introduce a
>>>>>>>>>>> virtual function (say pair_offset_alignment_ok_p) that vets the base
>>>>>>>>>>> offset of the pair (prev_access->offset in transform_for_base).  I 
>>>>>>>>>>> guess
>>>>>>>>>>> it would also take access_size as a parameter and for aarch64 it 
>>>>>>>>>>> should
>>>>>>>>>>> check:
>>>>>>>>>>>
>>>>>>>>>>>   multiple_p (offset, access_size)
>>>>>>>>>>>
>>>>>>>>>>> and for rs6000 it could check:
>>>>>>>>>>>
>>>>>>>>>>>   multiple_p (offset, access_size * 2)
>>>>>>>>>>>
>>>>>>>>>>> and we would then incorporate a call to that predicate in the else 
>>>>>>>>>>> if
>>>>>>>>>>> condition of tranform_for_base.
>>>>>>>>>>>
>>>>>>>>>>> It would have the same limitation whereby we assume that MEM_EXPR 
>>>>>>>>>>> offset
>>>>>>>>>>> alignment is a good proxy for RTL offset alignment, but we already
>>>>>>>>>>> depend on that with the multiple_p check in track_via_mem_expr.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Alex
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks & Regards
>>>>>>>>>>>> Ajit
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Alex
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>        }
>>>>>>>>>>>>>>>>        prev_access = &access;
>>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>>> @@ -2919,7 +3047,7 @@ ldp_bb_info::transform_for_base (int 
>>>>>>>>>>>>>>>> encoded_lfs,
>>>>>>>>>>>>>>>>  // and remove all the tombstone insns, being sure to reparent 
>>>>>>>>>>>>>>>> any uses
>>>>>>>>>>>>>>>>  // of mem to previous defs when we do this.
>>>>>>>>>>>>>>>>  void
>>>>>>>>>>>>>>>> -ldp_bb_info::cleanup_tombstones ()
>>>>>>>>>>>>>>>> +pair_fusion_bb_info::cleanup_tombstones ()
>>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>>>    // No need to do anything if we didn't emit a tombstone 
>>>>>>>>>>>>>>>> insn for this BB.
>>>>>>>>>>>>>>>>    if (!m_emitted_tombstone)
>>>>>>>>>>>>>>>> @@ -2947,7 +3075,7 @@ ldp_bb_info::cleanup_tombstones ()
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>  template<typename Map>
>>>>>>>>>>>>>>>>  void
>>>>>>>>>>>>>>>> -ldp_bb_info::traverse_base_map (Map &map)
>>>>>>>>>>>>>>>> +pair_fusion_bb_info::traverse_base_map (Map &map)
>>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>>>    for (auto kv : map)
>>>>>>>>>>>>>>>>      {
>>>>>>>>>>>>>>>> @@ -2958,7 +3086,7 @@ ldp_bb_info::traverse_base_map (Map &map)
>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>  void
>>>>>>>>>>>>>>>> -ldp_bb_info::transform ()
>>>>>>>>>>>>>>>> +pair_fusion_bb_info::transform ()
>>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>>>    traverse_base_map (expr_map);
>>>>>>>>>>>>>>>>    traverse_base_map (def_map);
>>>>>>>>>>>>>>>> @@ -3167,14 +3295,13 @@ try_promote_writeback (insn_info *insn)
>>>>>>>>>>>>>>>>  // for load/store candidates.  If running after RA, also try 
>>>>>>>>>>>>>>>> and promote
>>>>>>>>>>>>>>>>  // non-writeback pairs to use writeback addressing.  Then try 
>>>>>>>>>>>>>>>> to fuse
>>>>>>>>>>>>>>>>  // candidates into pairs.
>>>>>>>>>>>>>>>> -void ldp_fusion_bb (bb_info *bb)
>>>>>>>>>>>>>>>> +void pair_fusion::ldp_fusion_bb (bb_info *bb)
>>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>>> -  const bool track_loads
>>>>>>>>>>>>>>>> -    = aarch64_tune_params.ldp_policy_model != 
>>>>>>>>>>>>>>>> AARCH64_LDP_STP_POLICY_NEVER;
>>>>>>>>>>>>>>>> -  const bool track_stores
>>>>>>>>>>>>>>>> -    = aarch64_tune_params.stp_policy_model != 
>>>>>>>>>>>>>>>> AARCH64_LDP_STP_POLICY_NEVER;
>>>>>>>>>>>>>>>> +  const bool track_loads = track_load_p ();
>>>>>>>>>>>>>>>> +  const bool track_stores = track_store_p ();
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>> -  ldp_bb_info bb_state (bb);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +  aarch64_pair_fusion derived;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> can be deleted and then:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +  pair_fusion_bb_info bb_info (bb, &derived);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> can just be:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   pair_fusion_bb_info bb_info (bb, this);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (or you can pass *this if you make bb_info take a reference).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I don't think there's a particular need to change the variable 
>>>>>>>>>>>>>>> name
>>>>>>>>>>>>>>> (bb_state -> bb_info).  I chose the former because it doens't 
>>>>>>>>>>>>>>> clash
>>>>>>>>>>>>>>> with the RTL-SSA structure of the same name as the latter.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Addressed.
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>    for (auto insn : bb->nondebug_insns ())
>>>>>>>>>>>>>>>>      {
>>>>>>>>>>>>>>>> @@ -3184,31 +3311,31 @@ void ldp_fusion_bb (bb_info *bb)
>>>>>>>>>>>>>>>>        continue;
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>        rtx pat = PATTERN (rti);
>>>>>>>>>>>>>>>> -      if (reload_completed
>>>>>>>>>>>>>>>> -        && aarch64_ldp_writeback > 1
>>>>>>>>>>>>>>>> -        && GET_CODE (pat) == PARALLEL
>>>>>>>>>>>>>>>> -        && XVECLEN (pat, 0) == 2)
>>>>>>>>>>>>>>>> +      if (pair_mem_promote_writeback_p (pat))
>>>>>>>>>>>>>>>>        try_promote_writeback (insn);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It looks like try_promote_writeback itself will need some 
>>>>>>>>>>>>>>> further work
>>>>>>>>>>>>>>> to make it target-independent.  I suppose this check:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   auto rti = insn->rtl ();
>>>>>>>>>>>>>>>   const auto attr = get_attr_ldpstp (rti);
>>>>>>>>>>>>>>>   if (attr == LDPSTP_NONE)
>>>>>>>>>>>>>>>     return;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   bool load_p = (attr == LDPSTP_LDP);
>>>>>>>>>>>>>>>   gcc_checking_assert (load_p || attr == LDPSTP_STP);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> will need to become part of the pair_mem_promote_writeback_p 
>>>>>>>>>>>>>>> hook that you
>>>>>>>>>>>>>>> added, potentially changing it to return a boolean for load_p.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Then I guess we will need hooks for destructuring the pair insn 
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> another hook to wrap aarch64_gen_writeback_pair.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Addressed.
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>        if (GET_CODE (pat) != SET)
>>>>>>>>>>>>>>>>        continue;
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>        if (track_stores && MEM_P (XEXP (pat, 0)))
>>>>>>>>>>>>>>>> -      bb_state.track_access (insn, false, XEXP (pat, 0));
>>>>>>>>>>>>>>>> +      bb_info.track_access (insn, false, XEXP (pat, 0));
>>>>>>>>>>>>>>>>        else if (track_loads && MEM_P (XEXP (pat, 1)))
>>>>>>>>>>>>>>>> -      bb_state.track_access (insn, true, XEXP (pat, 1));
>>>>>>>>>>>>>>>> +      bb_info.track_access (insn, true, XEXP (pat, 1));
>>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>> -  bb_state.transform ();
>>>>>>>>>>>>>>>> -  bb_state.cleanup_tombstones ();
>>>>>>>>>>>>>>>> +  bb_info.transform ();
>>>>>>>>>>>>>>>> +  bb_info.cleanup_tombstones ();
>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>  void ldp_fusion ()
>>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>>>    ldp_fusion_init ();
>>>>>>>>>>>>>>>> +  pair_fusion *pfuse;
>>>>>>>>>>>>>>>> +  aarch64_pair_fusion derived;
>>>>>>>>>>>>>>>> +  pfuse = &derived;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This is indeed the one place where I think it is acceptable to
>>>>>>>>>>>>>>> instantiate aarch64_pair_fusion.  But again there's no need to 
>>>>>>>>>>>>>>> create a
>>>>>>>>>>>>>>> pointer to the parent class, just call any function you like 
>>>>>>>>>>>>>>> directly.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Addressed.
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>    for (auto bb : crtl->ssa->bbs ())
>>>>>>>>>>>>>>>> -    ldp_fusion_bb (bb);
>>>>>>>>>>>>>>>> +    pfuse->ldp_fusion_bb (bb);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think even the code to iterate over bbs should itself be a 
>>>>>>>>>>>>>>> member
>>>>>>>>>>>>>>> function of pair_fusion (say "run") and then that becomes part 
>>>>>>>>>>>>>>> of the
>>>>>>>>>>>>>>> generic code.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So this function would just become:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> aarch64_pair_fusion pass;
>>>>>>>>>>>>>>> pass.run ();
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> and could be inlined into the caller.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Addressed.
>>>>>>>>>>>>>>> Perhaps you could also add an early return like:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   if (!track_loads_p () && !track_stores_p ())
>>>>>>>>>>>>>>>     return;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> in pair_fusion::run () and then remove the corresponding code 
>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>> pass_ldp_fusion::gate?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Addressed.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>    ldp_fusion_destroy ();
>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>> 2.39.3
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Alex
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks & Regards
>>>>>>>>>>>>>> Ajit

Re: [PATCH V4 1/3] aarch64: Place target independent and dependent changed code in one file

Reply via email to