On 22/04/2024 13:01, Ajit Agarwal wrote: > Hello Alex: > > On 14/04/24 10:29 pm, Ajit Agarwal wrote: > > Hello Alex: > > > > On 12/04/24 11:02 pm, Ajit Agarwal wrote: > >> Hello Alex: > >> > >> On 12/04/24 8:15 pm, Alex Coplan wrote: > >>> On 12/04/2024 20:02, Ajit Agarwal wrote: > >>>> Hello Alex: > >>>> > >>>> On 11/04/24 7:55 pm, Alex Coplan wrote: > >>>>> On 10/04/2024 23:48, Ajit Agarwal wrote: > >>>>>> Hello Alex: > >>>>>> > >>>>>> On 10/04/24 7:52 pm, Alex Coplan wrote: > >>>>>>> Hi Ajit, > >>>>>>> > >>>>>>> On 10/04/2024 15:31, Ajit Agarwal wrote: > >>>>>>>> Hello Alex: > >>>>>>>> > >>>>>>>> On 10/04/24 1:42 pm, Alex Coplan wrote: > >>>>>>>>> Hi Ajit, > >>>>>>>>> > >>>>>>>>> On 09/04/2024 20:59, Ajit Agarwal wrote: > >>>>>>>>>> Hello Alex: > >>>>>>>>>> > >>>>>>>>>> On 09/04/24 8:39 pm, Alex Coplan wrote: > >>>>>>>>>>> On 09/04/2024 20:01, Ajit Agarwal wrote: > >>>>>>>>>>>> Hello Alex: > >>>>>>>>>>>> > >>>>>>>>>>>> On 09/04/24 7:29 pm, Alex Coplan wrote: > >>>>>>>>>>>>> On 09/04/2024 17:30, Ajit Agarwal wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On 05/04/24 10:03 pm, Alex Coplan wrote: > >>>>>>>>>>>>>>> On 05/04/2024 13:53, Ajit Agarwal wrote: > >>>>>>>>>>>>>>>> Hello Alex/Richard: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> All review comments are incorporated. > >>>>>>> <snip> > >>>>>>>>>>>>>>>> @@ -2890,8 +3018,8 @@ ldp_bb_info::merge_pairs (insn_list_t > >>>>>>>>>>>>>>>> &left_list, > >>>>>>>>>>>>>>>> // of accesses. If we find two sets of adjacent accesses, > >>>>>>>>>>>>>>>> call > >>>>>>>>>>>>>>>> // merge_pairs. > >>>>>>>>>>>>>>>> void > >>>>>>>>>>>>>>>> -ldp_bb_info::transform_for_base (int encoded_lfs, > >>>>>>>>>>>>>>>> - access_group &group) > >>>>>>>>>>>>>>>> +pair_fusion_bb_info::transform_for_base (int encoded_lfs, > >>>>>>>>>>>>>>>> + access_group &group) > >>>>>>>>>>>>>>>> { > >>>>>>>>>>>>>>>> const auto lfs = decode_lfs (encoded_lfs); > >>>>>>>>>>>>>>>> const unsigned access_size = lfs.size; > >>>>>>>>>>>>>>>> @@ -2909,7 +3037,7 @@ ldp_bb_info::transform_for_base (int > >>>>>>>>>>>>>>>> encoded_lfs, > >>>>>>>>>>>>>>>> access.cand_insns, > >>>>>>>>>>>>>>>> lfs.load_p, > >>>>>>>>>>>>>>>> access_size); > >>>>>>>>>>>>>>>> - skip_next = access.cand_insns.empty (); > >>>>>>>>>>>>>>>> + skip_next = bb_state->cand_insns_empty_p > >>>>>>>>>>>>>>>> (access.cand_insns); > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> As above, why is this needed? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> For rs6000 we want to return always true. as load store pair > >>>>>>>>>>>>>> that are to be merged with 8/16 16/32 32/64 is occuring for > >>>>>>>>>>>>>> rs6000. > >>>>>>>>>>>>>> And we want load store pair to 8/16 32/64. Thats why we want > >>>>>>>>>>>>>> to generate always true for rs6000 to skip pairs as above. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Hmm, sorry, I'm not sure I follow. Are you saying that for > >>>>>>>>>>>>> rs6000 you have > >>>>>>>>>>>>> load/store pair instructions where the two arms of the access > >>>>>>>>>>>>> are storing > >>>>>>>>>>>>> operands of different sizes? Or something else? > >>>>>>>>>>>>> > >>>>>>>>>>>>> As it stands the logic is to skip the next iteration only if we > >>>>>>>>>>>>> exhausted all the candidate insns for the current access. In > >>>>>>>>>>>>> the case > >>>>>>>>>>>>> that we didn't exhaust all such candidates, then the idea is > >>>>>>>>>>>>> that when > >>>>>>>>>>>>> access becomes prev_access, we can attempt to use those > >>>>>>>>>>>>> candidates as > >>>>>>>>>>>>> the "left-hand side" of a pair in the next iteration since we > >>>>>>>>>>>>> failed to > >>>>>>>>>>>>> use them as the "right-hand side" of a pair in the current > >>>>>>>>>>>>> iteration. > >>>>>>>>>>>>> I don't see why you wouldn't want that behaviour. Please can > >>>>>>>>>>>>> you > >>>>>>>>>>>>> explain? > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> In merge_pair we get the 2 load candiates one load from 0 offset > >>>>>>>>>>>> and > >>>>>>>>>>>> other load is from 16th offset. Then in next iteration we get > >>>>>>>>>>>> load > >>>>>>>>>>>> from 16th offset and other load from 32 offset. In next iteration > >>>>>>>>>>>> we get load from 32 offset and other load from 48 offset. > >>>>>>>>>>>> > >>>>>>>>>>>> For example: > >>>>>>>>>>>> > >>>>>>>>>>>> Currently we get the load candiates as follows. > >>>>>>>>>>>> > >>>>>>>>>>>> pairs: > >>>>>>>>>>>> > >>>>>>>>>>>> load from 0th offset. > >>>>>>>>>>>> load from 16th offset. > >>>>>>>>>>>> > >>>>>>>>>>>> next pairs: > >>>>>>>>>>>> > >>>>>>>>>>>> load from 16th offset. > >>>>>>>>>>>> load from 32th offset. > >>>>>>>>>>>> > >>>>>>>>>>>> next pairs: > >>>>>>>>>>>> > >>>>>>>>>>>> load from 32th offset > >>>>>>>>>>>> load from 48th offset. > >>>>>>>>>>>> > >>>>>>>>>>>> Instead in rs6000 we should get: > >>>>>>>>>>>> > >>>>>>>>>>>> pairs: > >>>>>>>>>>>> > >>>>>>>>>>>> load from 0th offset > >>>>>>>>>>>> load from 16th offset. > >>>>>>>>>>>> > >>>>>>>>>>>> next pairs: > >>>>>>>>>>>> > >>>>>>>>>>>> load from 32th offset > >>>>>>>>>>>> load from 48th offset. > >>>>>>>>>>> > >>>>>>>>>>> Hmm, so then I guess my question is: why wouldn't you consider > >>>>>>>>>>> merging > >>>>>>>>>>> the pair with offsets (16,32) for rs6000? Is it because you have > >>>>>>>>>>> a > >>>>>>>>>>> stricter alignment requirement on the base pair offsets (i.e. > >>>>>>>>>>> they have > >>>>>>>>>>> to be a multiple of 32 when the operand size is 16)? So the pair > >>>>>>>>>>> offsets have to be a multiple of the entire pair size rather than > >>>>>>>>>>> a > >>>>>>>>>>> single operand size> > >>>>>>>>>> > >>>>>>>>>> We get load pair at a certain point with (0,16) and other program > >>>>>>>>>> point we get load pair (32, 48). > >>>>>>>>>> > >>>>>>>>>> In current implementation it takes offsets loads as (0, 16), > >>>>>>>>>> (16, 32), (32, 48). > >>>>>>>>>> > >>>>>>>>>> But In rs6000 we want the load pair to be merged at different > >>>>>>>>>> points > >>>>>>>>>> as (0,16) and (32, 48). for (0,16) we want to replace load lxvp > >>>>>>>>>> with > >>>>>>>>>> 0 offset and other load (32, 48) with lxvp with 32 offset. > >>>>>>>>>> > >>>>>>>>>> In current case it will merge with lxvp with 0 offset and lxvp with > >>>>>>>>>> 16 offset, then lxvp with 32 offset and lxvp with 48 offset which > >>>>>>>>>> is incorrect in our case as the (16-32) case 16 offset will not > >>>>>>>>>> load from even register and break for rs6000. > >>>>>>>>> > >>>>>>>>> Sorry, I think I'm still missing something here. Why does the > >>>>>>>>> address offset > >>>>>>>>> affect the parity of the tranfser register? ISTM they needn't be > >>>>>>>>> related at > >>>>>>>>> all (and indeed we can't even know the parity of the transfer > >>>>>>>>> register before > >>>>>>>>> RA, but perhaps you're only intending on running the pass after RA?) > >>>>>>>>> > >>>>>>>> > >>>>>>>> We have load pair with (0,16) wherein these loads are adjacent and > >>>>>>>> replaced with lxvp. > >>>>>>>> > >>>>>>>> Semantic of lxvp instruction is that it loads adjacent load pair in > >>>>>>>> even register and even_register + 1. > >>>>>>>> > >>>>>>>> We replace the above load pair with lxvp instruction and then we > >>>>>>>> dont need to merge (16,32) as (0, 16) is already merged and instead > >>>>>>>> we merge (32,48). > >>>>>>> > >>>>>>> Ok, but the existing logic should already account for this. I.e. if > >>>>>>> we > >>>>>>> successfully merge (0,16), then we don't attempt to merge (16,32). > >>>>>>> We'd only > >>>>>>> attempt to merge (16,32) if the merge of (0,16) failed (for whatever > >>>>>>> reason). > >>>>>>> So I don't see that there's anything specific to lxvp that requires > >>>>>>> this logic > >>>>>>> to change, _unless_ you have a stricter alignment requirement on the > >>>>>>> offsets as > >>>>>>> I mentioned before. > >>>>>>> > >>>>>> > >>>>>> Thanks for the suggestion. It worked for rs6000 also with current > >>>>>> changes. > >>>>>> Sorry for the confusion. > >>>>> > >>>>> Alright, glad we got to the bottom of this! > >>>> > >>>> Thanks. > >>>>> > >>>>>> > >>>>>>>> > >>>>>>>> Yes you are correct, the addresss offset doesn't affect the parity of > >>>>>>>> the register transfer as we are doing fusion pass before RA. > >>>>>>>> If that is the case then I think it would be better to introduce a > >>>>>>>>>>> virtual function (say pair_offset_alignment_ok_p) that vets the > >>>>>>>>>>> base > >>>>>>>>>>> offset of the pair (prev_access->offset in transform_for_base). > >>>>>>>>>>> I guess > >>>>>>>>>>> it would also take access_size as a parameter and for aarch64 it > >>>>>>>>>>> should > >>>>>>>>>>> check: > >>>>>>>>>>> > >>>>>>>>>>> multiple_p (offset, access_size) > >>>>>>>>>>> > >>>>>>>>>>> and for rs6000 it could check: > >>>>>>>>>>> > >>>>>>>>>>> multiple_p (offset, access_size * 2) > >>>>>>>>>>> > >>>>>>>>>>> and we would then incorporate a call to that predicate in the > >>>>>>>>>>> else if > >>>>>>>>>>> condition of tranform_for_base. > >>>>>>>>>>> > >>>>>>>>>>> It would have the same limitation whereby we assume that MEM_EXPR > >>>>>>>>>>> offset > >>>>>>>>>>> alignment is a good proxy for RTL offset alignment, but we already > >>>>>>>>>>> depend on that with the multiple_p check in track_via_mem_expr. > >>>>>>>>>>> > >>>>>>>> I have addressed the above hooks and it worked fine with both rs6000 > >>>>>>>> and aarch64. I am sending subsequent patch in some time that address > >>>>>>>> above. > >>>>>>>> > >>>>>>>>> How do you plan on handling this even-odd requirement for rs6000? > >>>>>>>>> > >>>>>>>> > >>>>>>>> We plan to handle with V16QI subreg: 0 and V16QI subreg : 16 to > >>>>>>>> generate register pair and thats what we generate and implement > >>>>>>>> in rs6000 target > >>>>>>>> code. > >>>>>>> > >>>>>>> Ah, this is coming back to me now. Sorry, I should have remembered > >>>>>>> this from > >>>>>>> the previous discussion with Richard S. > >>>>>>> > >>>>>>> Apologies for going on a slight tangent here, but if you're running > >>>>>>> before RA are you planning to create a new OImode pseudo register for > >>>>>>> the lxvp insn and then somehow update uses of the old transfer > >>>>>>> registers > >>>>>>> to replace them with subregs of that OImode pseudo? > >>>>>> > >>>>>> Yes I do the same as you have mentioned. We generate register pairs > >>>>>> with 256 bit mode with two subregs of 128 bit modes with 0 and > >>>>>> 16 offset. > >>>>>> > >>>>>> Or do you just plan > >>>>>>> or replacing the individual loads with moves (instead of deleting > >>>>>>> them)? > >>>>>>> I guess the latter would be simpler and might work better in the > >>>>>>> presence of hard regs. > >>>>>>> > >>>>>> > >>>>>> Would you mind explaining how to generate register pairs with lxvp by > >>>>>> replacing loads with moves. > >>>>> > >>>>> Yeah, so suppose you have something like: > >>>>> > >>>>> (set (reg:V16QI v1) (mem:V16QI addr)) > >>>>> (set (reg:V16QI v2) (mem:V16QI addr+16)) > >>>>> > >>>>> then when you insert the lxvp you can then (logically) replace the > >>>>> original load instructions with moves from the appropriate subreg, as > >>>>> follows: > >>>>> > >>>>> (set (reg:OI pair-pseudo) (mem:OI addr)) ; lxvp > >>>>> (set (reg:V16QI v1) (subreg:V16QI (reg:OI pair-pseudo) 0)) > >>>>> (set (reg:V16QI v2) (subreg:V16QI (reg:OI pair-pseudo) 16)) > >>>>> > >>>> > >>>> Any Pseudo created with gen_rtx_REG like > >>>> gen_rtx_REG (OOmode, REGNO (dest_exp) will error'ed out > >>>> with unrecognize insn by LRA. > >>> > >>> I'm not surprised that goes wrong: you can't just create a new REG > >>> rtx in a different mode and reuse the regno of an existing pseudo. > >>> > > > > Thanks for the suggestion. > >>>> > >>>> If I create pseudo with gen_reg_rtx (OOmode) will error'ed > >>>> out with new_defs Pseudo register is not found in > >>>> change->new_defs. > >>> > >>> Yeah, I suppose you'd need to add an RTL-SSA def for the new pseudo. > >>> > >> > >> Would you mind explaining how can I add and RTL-SSA def for the > >> new pseudo. > > > > I have added and RTL-SSA def for the new pseudo. With that I could > > get register oairs correctly. > >> > >>>> > >>>> Also the sequential register pairs are not generated by > >>>> Register Allocator. > >>> > >>> So how do you get the sequential pairs with your current approach? My > >>> understanding was that what I suggested above doesn't really change what > >>> you're doing w.r.t the lxvp insn itself, but maybe I didn't properly > >>> understand the approach taken in your initial patchset. > >>> > >> > >> I generate (set (reg:OO pair-pseudo) (mem:OI addr)) ; lxvp > >> and then at the use point of pair_pseudo generate the following. > >> > >> (subreg:V16QI (reg:OI pair-pseudo) 0)) > >> (subreg:V16QI (reg:OI pair-pseudo) 16)) > >> > > > > I get register pairs correctly generating RTL as you have > > suggested but we get extra moves that would impact the performance. > > > > Please let me know what do you think. > > > > For the below testcase: > > #include <altivec.h> > > void > foo (__vector_quad *dst, vector unsigned char *ptr, vector unsigned char src) > { > __vector_quad acc; > __builtin_mma_xvf32ger(&acc, src, ptr[0]); > __builtin_mma_xvf32gerpp(&acc, src, ptr[1]); > *dst = acc; > } > > > My earlier implementation without moves generates the following assembly > for the above testcase: > > .LFB0: > .cfi_startproc > .localentry _Z3fooPu13__vector_quadPDv16_hS1_,1 > lxvp %vs32,0(%r4) > xvf32ger 0,%vs34,%vs33 > xvf32gerpp 0,%vs34,%vs32 > xxmfacc 0 > stxvp %vs2,0(%r3) > stxvp %vs0,32(%r3) > blr > > > With moves you have suggested I get the below generated assembly: > .LFB0: > .cfi_startproc > .localentry _Z3fooPu13__vector_quadPDv16_hS1_,1 > lxvp %vs0,0(%r4) > xxlor %vs33,%vs1,%vs1 > xxlor %vs32,%vs0,%vs0 > xvf32ger 0,%vs34,%vs33 > xvf32gerpp 0,%vs34,%vs32 > xxmfacc 0 > stxvp %vs2,0(%r3) > stxvp %vs0,32(%r3) > blr > > > As you see with register moves two extra xxlor instructions are > generated.
I see. There's no need to change your approach for now, then, it was just a suggestion. I was hoping the RA would be able to eliminate the redundant moves, but it looks like that isn't happening here. Richard S might have some pointers for you w.r.t. this approach when he gets back. Sorry for the delay in getting back to you, I was busy with other things last week. I'll try to look at the latest version of the patch you sent for the pass. Thanks, Alex > > Please let me know what do you think. > > Thanks & Regards > Ajit > > Thanks & Regards > > Ajit > >> Thanks & Regards > >> Ajit > >>> Thanks, > >>> Alex > >>> > >>>> > >>>> Thats why I haven't used above method as I also thought > >>>> through. > >>>> > >>>> Please let me know what do you think. > >>>> > >>>> Thanks & Regards > >>>> Ajit > >>>>> now I'm not sure off-hand if this exact combination of subregs and mode > >>>>> changes is valid, but hopefully you get the idea. The benefit of this > >>>>> approach is that it keeps the transformation local and is more > >>>>> compile-time efficient (we don't have to look through all mentions of > >>>>> v1/v2 and replace them with subregs). I'd hope that the RA can then > >>>>> clean up any redundant moves (especially in the case that v1/v2 are > >>>>> pseudos). > >>>>> > >>>>> That would mean that you don't need all the grubbing around with DF > >>>>> looking for occurrences of the transfer regs. Instead we'd just need > >>>>> some logic to insert the extra moves after the lxvp for rs6000, and I > >>>>> think this might fit more cleanly into the current pass. > >>>>> > >>>>> Does that make sense? In any case, this shouldn't really affect the > >>>>> preparatory aarch64 patch because we were planning to defer adding any > >>>>> hooks that are only needed for rs6000 from the initial aarch64/generic > >>>>> split. > >>>>> > >>>>> Thanks, > >>>>> Alex > >>>>> > >>>>>> > >>>>>> Thanks & Regards > >>>>>> Ajit > >>>>>> > >>>>>>> Thanks, > >>>>>>> Alex > >>>>>>> > >>>>>>>> > >>>>>>>> Thanks & Regards > >>>>>>>> Ajit > >>>>>>>>> Thanks, > >>>>>>>>> Alex > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> lxvp should load from even registers and then loaded value will > >>>>>>>>>> be in even register and even register +1 (which is odd). > >>>>>>>>>> > >>>>>>>>>> Thanks & Regards > >>>>>>>>>> Ajit > >>>>>>>>>>> If that is the case then I think it would be better to introduce a > >>>>>>>>>>> virtual function (say pair_offset_alignment_ok_p) that vets the > >>>>>>>>>>> base > >>>>>>>>>>> offset of the pair (prev_access->offset in transform_for_base). > >>>>>>>>>>> I guess > >>>>>>>>>>> it would also take access_size as a parameter and for aarch64 it > >>>>>>>>>>> should > >>>>>>>>>>> check: > >>>>>>>>>>> > >>>>>>>>>>> multiple_p (offset, access_size) > >>>>>>>>>>> > >>>>>>>>>>> and for rs6000 it could check: > >>>>>>>>>>> > >>>>>>>>>>> multiple_p (offset, access_size * 2) > >>>>>>>>>>> > >>>>>>>>>>> and we would then incorporate a call to that predicate in the > >>>>>>>>>>> else if > >>>>>>>>>>> condition of tranform_for_base. > >>>>>>>>>>> > >>>>>>>>>>> It would have the same limitation whereby we assume that MEM_EXPR > >>>>>>>>>>> offset > >>>>>>>>>>> alignment is a good proxy for RTL offset alignment, but we already > >>>>>>>>>>> depend on that with the multiple_p check in track_via_mem_expr. > >>>>>>>>>>> > >>>>>>>>>>> Thanks, > >>>>>>>>>>> Alex > >>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks & Regards > >>>>>>>>>>>> Ajit > >>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>> Alex > >>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> } > >>>>>>>>>>>>>>>> prev_access = &access; > >>>>>>>>>>>>>>>> } > >>>>>>>>>>>>>>>> @@ -2919,7 +3047,7 @@ ldp_bb_info::transform_for_base (int > >>>>>>>>>>>>>>>> encoded_lfs, > >>>>>>>>>>>>>>>> // and remove all the tombstone insns, being sure to > >>>>>>>>>>>>>>>> reparent any uses > >>>>>>>>>>>>>>>> // of mem to previous defs when we do this. > >>>>>>>>>>>>>>>> void > >>>>>>>>>>>>>>>> -ldp_bb_info::cleanup_tombstones () > >>>>>>>>>>>>>>>> +pair_fusion_bb_info::cleanup_tombstones () > >>>>>>>>>>>>>>>> { > >>>>>>>>>>>>>>>> // No need to do anything if we didn't emit a tombstone > >>>>>>>>>>>>>>>> insn for this BB. > >>>>>>>>>>>>>>>> if (!m_emitted_tombstone) > >>>>>>>>>>>>>>>> @@ -2947,7 +3075,7 @@ ldp_bb_info::cleanup_tombstones () > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> template<typename Map> > >>>>>>>>>>>>>>>> void > >>>>>>>>>>>>>>>> -ldp_bb_info::traverse_base_map (Map &map) > >>>>>>>>>>>>>>>> +pair_fusion_bb_info::traverse_base_map (Map &map) > >>>>>>>>>>>>>>>> { > >>>>>>>>>>>>>>>> for (auto kv : map) > >>>>>>>>>>>>>>>> { > >>>>>>>>>>>>>>>> @@ -2958,7 +3086,7 @@ ldp_bb_info::traverse_base_map (Map > >>>>>>>>>>>>>>>> &map) > >>>>>>>>>>>>>>>> } > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> void > >>>>>>>>>>>>>>>> -ldp_bb_info::transform () > >>>>>>>>>>>>>>>> +pair_fusion_bb_info::transform () > >>>>>>>>>>>>>>>> { > >>>>>>>>>>>>>>>> traverse_base_map (expr_map); > >>>>>>>>>>>>>>>> traverse_base_map (def_map); > >>>>>>>>>>>>>>>> @@ -3167,14 +3295,13 @@ try_promote_writeback (insn_info > >>>>>>>>>>>>>>>> *insn) > >>>>>>>>>>>>>>>> // for load/store candidates. If running after RA, also > >>>>>>>>>>>>>>>> try and promote > >>>>>>>>>>>>>>>> // non-writeback pairs to use writeback addressing. Then > >>>>>>>>>>>>>>>> try to fuse > >>>>>>>>>>>>>>>> // candidates into pairs. > >>>>>>>>>>>>>>>> -void ldp_fusion_bb (bb_info *bb) > >>>>>>>>>>>>>>>> +void pair_fusion::ldp_fusion_bb (bb_info *bb) > >>>>>>>>>>>>>>>> { > >>>>>>>>>>>>>>>> - const bool track_loads > >>>>>>>>>>>>>>>> - = aarch64_tune_params.ldp_policy_model != > >>>>>>>>>>>>>>>> AARCH64_LDP_STP_POLICY_NEVER; > >>>>>>>>>>>>>>>> - const bool track_stores > >>>>>>>>>>>>>>>> - = aarch64_tune_params.stp_policy_model != > >>>>>>>>>>>>>>>> AARCH64_LDP_STP_POLICY_NEVER; > >>>>>>>>>>>>>>>> + const bool track_loads = track_load_p (); > >>>>>>>>>>>>>>>> + const bool track_stores = track_store_p (); > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> - ldp_bb_info bb_state (bb); > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> This: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> + aarch64_pair_fusion derived; > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> can be deleted and then: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> + pair_fusion_bb_info bb_info (bb, &derived); > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> can just be: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> pair_fusion_bb_info bb_info (bb, this); > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> (or you can pass *this if you make bb_info take a reference). > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I don't think there's a particular need to change the > >>>>>>>>>>>>>>> variable name > >>>>>>>>>>>>>>> (bb_state -> bb_info). I chose the former because it doens't > >>>>>>>>>>>>>>> clash > >>>>>>>>>>>>>>> with the RTL-SSA structure of the same name as the latter. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Addressed. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> for (auto insn : bb->nondebug_insns ()) > >>>>>>>>>>>>>>>> { > >>>>>>>>>>>>>>>> @@ -3184,31 +3311,31 @@ void ldp_fusion_bb (bb_info *bb) > >>>>>>>>>>>>>>>> continue; > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> rtx pat = PATTERN (rti); > >>>>>>>>>>>>>>>> - if (reload_completed > >>>>>>>>>>>>>>>> - && aarch64_ldp_writeback > 1 > >>>>>>>>>>>>>>>> - && GET_CODE (pat) == PARALLEL > >>>>>>>>>>>>>>>> - && XVECLEN (pat, 0) == 2) > >>>>>>>>>>>>>>>> + if (pair_mem_promote_writeback_p (pat)) > >>>>>>>>>>>>>>>> try_promote_writeback (insn); > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> It looks like try_promote_writeback itself will need some > >>>>>>>>>>>>>>> further work > >>>>>>>>>>>>>>> to make it target-independent. I suppose this check: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> auto rti = insn->rtl (); > >>>>>>>>>>>>>>> const auto attr = get_attr_ldpstp (rti); > >>>>>>>>>>>>>>> if (attr == LDPSTP_NONE) > >>>>>>>>>>>>>>> return; > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> bool load_p = (attr == LDPSTP_LDP); > >>>>>>>>>>>>>>> gcc_checking_assert (load_p || attr == LDPSTP_STP); > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> will need to become part of the pair_mem_promote_writeback_p > >>>>>>>>>>>>>>> hook that you > >>>>>>>>>>>>>>> added, potentially changing it to return a boolean for load_p. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Then I guess we will need hooks for destructuring the pair > >>>>>>>>>>>>>>> insn and > >>>>>>>>>>>>>>> another hook to wrap aarch64_gen_writeback_pair. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Addressed. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> if (GET_CODE (pat) != SET) > >>>>>>>>>>>>>>>> continue; > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> if (track_stores && MEM_P (XEXP (pat, 0))) > >>>>>>>>>>>>>>>> - bb_state.track_access (insn, false, XEXP (pat, 0)); > >>>>>>>>>>>>>>>> + bb_info.track_access (insn, false, XEXP (pat, 0)); > >>>>>>>>>>>>>>>> else if (track_loads && MEM_P (XEXP (pat, 1))) > >>>>>>>>>>>>>>>> - bb_state.track_access (insn, true, XEXP (pat, 1)); > >>>>>>>>>>>>>>>> + bb_info.track_access (insn, true, XEXP (pat, 1)); > >>>>>>>>>>>>>>>> } > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> - bb_state.transform (); > >>>>>>>>>>>>>>>> - bb_state.cleanup_tombstones (); > >>>>>>>>>>>>>>>> + bb_info.transform (); > >>>>>>>>>>>>>>>> + bb_info.cleanup_tombstones (); > >>>>>>>>>>>>>>>> } > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> void ldp_fusion () > >>>>>>>>>>>>>>>> { > >>>>>>>>>>>>>>>> ldp_fusion_init (); > >>>>>>>>>>>>>>>> + pair_fusion *pfuse; > >>>>>>>>>>>>>>>> + aarch64_pair_fusion derived; > >>>>>>>>>>>>>>>> + pfuse = &derived; > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> This is indeed the one place where I think it is acceptable to > >>>>>>>>>>>>>>> instantiate aarch64_pair_fusion. But again there's no need > >>>>>>>>>>>>>>> to create a > >>>>>>>>>>>>>>> pointer to the parent class, just call any function you like > >>>>>>>>>>>>>>> directly. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Addressed. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> for (auto bb : crtl->ssa->bbs ()) > >>>>>>>>>>>>>>>> - ldp_fusion_bb (bb); > >>>>>>>>>>>>>>>> + pfuse->ldp_fusion_bb (bb); > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I think even the code to iterate over bbs should itself be a > >>>>>>>>>>>>>>> member > >>>>>>>>>>>>>>> function of pair_fusion (say "run") and then that becomes > >>>>>>>>>>>>>>> part of the > >>>>>>>>>>>>>>> generic code. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> So this function would just become: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> aarch64_pair_fusion pass; > >>>>>>>>>>>>>>> pass.run (); > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> and could be inlined into the caller. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Addressed. > >>>>>>>>>>>>>>> Perhaps you could also add an early return like: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> if (!track_loads_p () && !track_stores_p ()) > >>>>>>>>>>>>>>> return; > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> in pair_fusion::run () and then remove the corresponding code > >>>>>>>>>>>>>>> from > >>>>>>>>>>>>>>> pass_ldp_fusion::gate? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Addressed. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> ldp_fusion_destroy (); > >>>>>>>>>>>>>>>> } > >>>>>>>>>>>>>>>> -- > >>>>>>>>>>>>>>>> 2.39.3 > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>> Alex > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks & Regards > >>>>>>>>>>>>>> Ajit