Hello Alex: On 14/04/24 10:29 pm, Ajit Agarwal wrote: > Hello Alex: > > On 12/04/24 11:02 pm, Ajit Agarwal wrote: >> Hello Alex: >> >> On 12/04/24 8:15 pm, Alex Coplan wrote: >>> On 12/04/2024 20:02, Ajit Agarwal wrote: >>>> Hello Alex: >>>> >>>> On 11/04/24 7:55 pm, Alex Coplan wrote: >>>>> On 10/04/2024 23:48, Ajit Agarwal wrote: >>>>>> Hello Alex: >>>>>> >>>>>> On 10/04/24 7:52 pm, Alex Coplan wrote: >>>>>>> Hi Ajit, >>>>>>> >>>>>>> On 10/04/2024 15:31, Ajit Agarwal wrote: >>>>>>>> Hello Alex: >>>>>>>> >>>>>>>> On 10/04/24 1:42 pm, Alex Coplan wrote: >>>>>>>>> Hi Ajit, >>>>>>>>> >>>>>>>>> On 09/04/2024 20:59, Ajit Agarwal wrote: >>>>>>>>>> Hello Alex: >>>>>>>>>> >>>>>>>>>> On 09/04/24 8:39 pm, Alex Coplan wrote: >>>>>>>>>>> On 09/04/2024 20:01, Ajit Agarwal wrote: >>>>>>>>>>>> Hello Alex: >>>>>>>>>>>> >>>>>>>>>>>> On 09/04/24 7:29 pm, Alex Coplan wrote: >>>>>>>>>>>>> On 09/04/2024 17:30, Ajit Agarwal wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 05/04/24 10:03 pm, Alex Coplan wrote: >>>>>>>>>>>>>>> On 05/04/2024 13:53, Ajit Agarwal wrote: >>>>>>>>>>>>>>>> Hello Alex/Richard: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> All review comments are incorporated. >>>>>>> <snip> >>>>>>>>>>>>>>>> @@ -2890,8 +3018,8 @@ ldp_bb_info::merge_pairs (insn_list_t >>>>>>>>>>>>>>>> &left_list, >>>>>>>>>>>>>>>> // of accesses. If we find two sets of adjacent accesses, >>>>>>>>>>>>>>>> call >>>>>>>>>>>>>>>> // merge_pairs. >>>>>>>>>>>>>>>> void >>>>>>>>>>>>>>>> -ldp_bb_info::transform_for_base (int encoded_lfs, >>>>>>>>>>>>>>>> - access_group &group) >>>>>>>>>>>>>>>> +pair_fusion_bb_info::transform_for_base (int encoded_lfs, >>>>>>>>>>>>>>>> + access_group &group) >>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>> const auto lfs = decode_lfs (encoded_lfs); >>>>>>>>>>>>>>>> const unsigned access_size = lfs.size; >>>>>>>>>>>>>>>> @@ -2909,7 +3037,7 @@ ldp_bb_info::transform_for_base (int >>>>>>>>>>>>>>>> encoded_lfs, >>>>>>>>>>>>>>>> access.cand_insns, >>>>>>>>>>>>>>>> lfs.load_p, >>>>>>>>>>>>>>>> access_size); >>>>>>>>>>>>>>>> - skip_next = access.cand_insns.empty (); >>>>>>>>>>>>>>>> + skip_next = bb_state->cand_insns_empty_p >>>>>>>>>>>>>>>> (access.cand_insns); >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> As above, why is this needed? >>>>>>>>>>>>>> >>>>>>>>>>>>>> For rs6000 we want to return always true. as load store pair >>>>>>>>>>>>>> that are to be merged with 8/16 16/32 32/64 is occuring for >>>>>>>>>>>>>> rs6000. >>>>>>>>>>>>>> And we want load store pair to 8/16 32/64. Thats why we want >>>>>>>>>>>>>> to generate always true for rs6000 to skip pairs as above. >>>>>>>>>>>>> >>>>>>>>>>>>> Hmm, sorry, I'm not sure I follow. Are you saying that for >>>>>>>>>>>>> rs6000 you have >>>>>>>>>>>>> load/store pair instructions where the two arms of the access are >>>>>>>>>>>>> storing >>>>>>>>>>>>> operands of different sizes? Or something else? >>>>>>>>>>>>> >>>>>>>>>>>>> As it stands the logic is to skip the next iteration only if we >>>>>>>>>>>>> exhausted all the candidate insns for the current access. In the >>>>>>>>>>>>> case >>>>>>>>>>>>> that we didn't exhaust all such candidates, then the idea is that >>>>>>>>>>>>> when >>>>>>>>>>>>> access becomes prev_access, we can attempt to use those >>>>>>>>>>>>> candidates as >>>>>>>>>>>>> the "left-hand side" of a pair in the next iteration since we >>>>>>>>>>>>> failed to >>>>>>>>>>>>> use them as the "right-hand side" of a pair in the current >>>>>>>>>>>>> iteration. >>>>>>>>>>>>> I don't see why you wouldn't want that behaviour. Please can you >>>>>>>>>>>>> explain? >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> In merge_pair we get the 2 load candiates one load from 0 offset >>>>>>>>>>>> and >>>>>>>>>>>> other load is from 16th offset. Then in next iteration we get load >>>>>>>>>>>> from 16th offset and other load from 32 offset. In next iteration >>>>>>>>>>>> we get load from 32 offset and other load from 48 offset. >>>>>>>>>>>> >>>>>>>>>>>> For example: >>>>>>>>>>>> >>>>>>>>>>>> Currently we get the load candiates as follows. >>>>>>>>>>>> >>>>>>>>>>>> pairs: >>>>>>>>>>>> >>>>>>>>>>>> load from 0th offset. >>>>>>>>>>>> load from 16th offset. >>>>>>>>>>>> >>>>>>>>>>>> next pairs: >>>>>>>>>>>> >>>>>>>>>>>> load from 16th offset. >>>>>>>>>>>> load from 32th offset. >>>>>>>>>>>> >>>>>>>>>>>> next pairs: >>>>>>>>>>>> >>>>>>>>>>>> load from 32th offset >>>>>>>>>>>> load from 48th offset. >>>>>>>>>>>> >>>>>>>>>>>> Instead in rs6000 we should get: >>>>>>>>>>>> >>>>>>>>>>>> pairs: >>>>>>>>>>>> >>>>>>>>>>>> load from 0th offset >>>>>>>>>>>> load from 16th offset. >>>>>>>>>>>> >>>>>>>>>>>> next pairs: >>>>>>>>>>>> >>>>>>>>>>>> load from 32th offset >>>>>>>>>>>> load from 48th offset. >>>>>>>>>>> >>>>>>>>>>> Hmm, so then I guess my question is: why wouldn't you consider >>>>>>>>>>> merging >>>>>>>>>>> the pair with offsets (16,32) for rs6000? Is it because you have a >>>>>>>>>>> stricter alignment requirement on the base pair offsets (i.e. they >>>>>>>>>>> have >>>>>>>>>>> to be a multiple of 32 when the operand size is 16)? So the pair >>>>>>>>>>> offsets have to be a multiple of the entire pair size rather than a >>>>>>>>>>> single operand size> >>>>>>>>>> >>>>>>>>>> We get load pair at a certain point with (0,16) and other program >>>>>>>>>> point we get load pair (32, 48). >>>>>>>>>> >>>>>>>>>> In current implementation it takes offsets loads as (0, 16), >>>>>>>>>> (16, 32), (32, 48). >>>>>>>>>> >>>>>>>>>> But In rs6000 we want the load pair to be merged at different points >>>>>>>>>> as (0,16) and (32, 48). for (0,16) we want to replace load lxvp with >>>>>>>>>> 0 offset and other load (32, 48) with lxvp with 32 offset. >>>>>>>>>> >>>>>>>>>> In current case it will merge with lxvp with 0 offset and lxvp with >>>>>>>>>> 16 offset, then lxvp with 32 offset and lxvp with 48 offset which >>>>>>>>>> is incorrect in our case as the (16-32) case 16 offset will not >>>>>>>>>> load from even register and break for rs6000. >>>>>>>>> >>>>>>>>> Sorry, I think I'm still missing something here. Why does the >>>>>>>>> address offset >>>>>>>>> affect the parity of the tranfser register? ISTM they needn't be >>>>>>>>> related at >>>>>>>>> all (and indeed we can't even know the parity of the transfer >>>>>>>>> register before >>>>>>>>> RA, but perhaps you're only intending on running the pass after RA?) >>>>>>>>> >>>>>>>> >>>>>>>> We have load pair with (0,16) wherein these loads are adjacent and >>>>>>>> replaced with lxvp. >>>>>>>> >>>>>>>> Semantic of lxvp instruction is that it loads adjacent load pair in >>>>>>>> even register and even_register + 1. >>>>>>>> >>>>>>>> We replace the above load pair with lxvp instruction and then we >>>>>>>> dont need to merge (16,32) as (0, 16) is already merged and instead >>>>>>>> we merge (32,48). >>>>>>> >>>>>>> Ok, but the existing logic should already account for this. I.e. if we >>>>>>> successfully merge (0,16), then we don't attempt to merge (16,32). >>>>>>> We'd only >>>>>>> attempt to merge (16,32) if the merge of (0,16) failed (for whatever >>>>>>> reason). >>>>>>> So I don't see that there's anything specific to lxvp that requires >>>>>>> this logic >>>>>>> to change, _unless_ you have a stricter alignment requirement on the >>>>>>> offsets as >>>>>>> I mentioned before. >>>>>>> >>>>>> >>>>>> Thanks for the suggestion. It worked for rs6000 also with current >>>>>> changes. >>>>>> Sorry for the confusion. >>>>> >>>>> Alright, glad we got to the bottom of this! >>>> >>>> Thanks. >>>>> >>>>>> >>>>>>>> >>>>>>>> Yes you are correct, the addresss offset doesn't affect the parity of >>>>>>>> the register transfer as we are doing fusion pass before RA. >>>>>>>> If that is the case then I think it would be better to introduce a >>>>>>>>>>> virtual function (say pair_offset_alignment_ok_p) that vets the base >>>>>>>>>>> offset of the pair (prev_access->offset in transform_for_base). I >>>>>>>>>>> guess >>>>>>>>>>> it would also take access_size as a parameter and for aarch64 it >>>>>>>>>>> should >>>>>>>>>>> check: >>>>>>>>>>> >>>>>>>>>>> multiple_p (offset, access_size) >>>>>>>>>>> >>>>>>>>>>> and for rs6000 it could check: >>>>>>>>>>> >>>>>>>>>>> multiple_p (offset, access_size * 2) >>>>>>>>>>> >>>>>>>>>>> and we would then incorporate a call to that predicate in the else >>>>>>>>>>> if >>>>>>>>>>> condition of tranform_for_base. >>>>>>>>>>> >>>>>>>>>>> It would have the same limitation whereby we assume that MEM_EXPR >>>>>>>>>>> offset >>>>>>>>>>> alignment is a good proxy for RTL offset alignment, but we already >>>>>>>>>>> depend on that with the multiple_p check in track_via_mem_expr. >>>>>>>>>>> >>>>>>>> I have addressed the above hooks and it worked fine with both rs6000 >>>>>>>> and aarch64. I am sending subsequent patch in some time that address >>>>>>>> above. >>>>>>>> >>>>>>>>> How do you plan on handling this even-odd requirement for rs6000? >>>>>>>>> >>>>>>>> >>>>>>>> We plan to handle with V16QI subreg: 0 and V16QI subreg : 16 to >>>>>>>> generate register pair and thats what we generate and implement >>>>>>>> in rs6000 target >>>>>>>> code. >>>>>>> >>>>>>> Ah, this is coming back to me now. Sorry, I should have remembered >>>>>>> this from >>>>>>> the previous discussion with Richard S. >>>>>>> >>>>>>> Apologies for going on a slight tangent here, but if you're running >>>>>>> before RA are you planning to create a new OImode pseudo register for >>>>>>> the lxvp insn and then somehow update uses of the old transfer registers >>>>>>> to replace them with subregs of that OImode pseudo? >>>>>> >>>>>> Yes I do the same as you have mentioned. We generate register pairs >>>>>> with 256 bit mode with two subregs of 128 bit modes with 0 and >>>>>> 16 offset. >>>>>> >>>>>> Or do you just plan >>>>>>> or replacing the individual loads with moves (instead of deleting them)? >>>>>>> I guess the latter would be simpler and might work better in the >>>>>>> presence of hard regs. >>>>>>> >>>>>> >>>>>> Would you mind explaining how to generate register pairs with lxvp by >>>>>> replacing loads with moves. >>>>> >>>>> Yeah, so suppose you have something like: >>>>> >>>>> (set (reg:V16QI v1) (mem:V16QI addr)) >>>>> (set (reg:V16QI v2) (mem:V16QI addr+16)) >>>>> >>>>> then when you insert the lxvp you can then (logically) replace the >>>>> original load instructions with moves from the appropriate subreg, as >>>>> follows: >>>>> >>>>> (set (reg:OI pair-pseudo) (mem:OI addr)) ; lxvp >>>>> (set (reg:V16QI v1) (subreg:V16QI (reg:OI pair-pseudo) 0)) >>>>> (set (reg:V16QI v2) (subreg:V16QI (reg:OI pair-pseudo) 16)) >>>>> >>>> >>>> Any Pseudo created with gen_rtx_REG like >>>> gen_rtx_REG (OOmode, REGNO (dest_exp) will error'ed out >>>> with unrecognize insn by LRA. >>> >>> I'm not surprised that goes wrong: you can't just create a new REG >>> rtx in a different mode and reuse the regno of an existing pseudo. >>> > > Thanks for the suggestion. >>>> >>>> If I create pseudo with gen_reg_rtx (OOmode) will error'ed >>>> out with new_defs Pseudo register is not found in >>>> change->new_defs. >>> >>> Yeah, I suppose you'd need to add an RTL-SSA def for the new pseudo. >>> >> >> Would you mind explaining how can I add and RTL-SSA def for the >> new pseudo. > > I have added and RTL-SSA def for the new pseudo. With that I could > get register oairs correctly. >> >>>> >>>> Also the sequential register pairs are not generated by >>>> Register Allocator. >>> >>> So how do you get the sequential pairs with your current approach? My >>> understanding was that what I suggested above doesn't really change what >>> you're doing w.r.t the lxvp insn itself, but maybe I didn't properly >>> understand the approach taken in your initial patchset. >>> >> >> I generate (set (reg:OO pair-pseudo) (mem:OI addr)) ; lxvp >> and then at the use point of pair_pseudo generate the following. >> >> (subreg:V16QI (reg:OI pair-pseudo) 0)) >> (subreg:V16QI (reg:OI pair-pseudo) 16)) >> > > I get register pairs correctly generating RTL as you have > suggested but we get extra moves that would impact the performance. > > Please let me know what do you think. >
For the below testcase: #include <altivec.h> void foo (__vector_quad *dst, vector unsigned char *ptr, vector unsigned char src) { __vector_quad acc; __builtin_mma_xvf32ger(&acc, src, ptr[0]); __builtin_mma_xvf32gerpp(&acc, src, ptr[1]); *dst = acc; } My earlier implementation without moves generates the following assembly for the above testcase: .LFB0: .cfi_startproc .localentry _Z3fooPu13__vector_quadPDv16_hS1_,1 lxvp %vs32,0(%r4) xvf32ger 0,%vs34,%vs33 xvf32gerpp 0,%vs34,%vs32 xxmfacc 0 stxvp %vs2,0(%r3) stxvp %vs0,32(%r3) blr With moves you have suggested I get the below generated assembly: .LFB0: .cfi_startproc .localentry _Z3fooPu13__vector_quadPDv16_hS1_,1 lxvp %vs0,0(%r4) xxlor %vs33,%vs1,%vs1 xxlor %vs32,%vs0,%vs0 xvf32ger 0,%vs34,%vs33 xvf32gerpp 0,%vs34,%vs32 xxmfacc 0 stxvp %vs2,0(%r3) stxvp %vs0,32(%r3) blr As you see with register moves two extra xxlor instructions are generated. Please let me know what do you think. Thanks & Regards Ajit > Thanks & Regards > Ajit >> Thanks & Regards >> Ajit >>> Thanks, >>> Alex >>> >>>> >>>> Thats why I haven't used above method as I also thought >>>> through. >>>> >>>> Please let me know what do you think. >>>> >>>> Thanks & Regards >>>> Ajit >>>>> now I'm not sure off-hand if this exact combination of subregs and mode >>>>> changes is valid, but hopefully you get the idea. The benefit of this >>>>> approach is that it keeps the transformation local and is more >>>>> compile-time efficient (we don't have to look through all mentions of >>>>> v1/v2 and replace them with subregs). I'd hope that the RA can then >>>>> clean up any redundant moves (especially in the case that v1/v2 are >>>>> pseudos). >>>>> >>>>> That would mean that you don't need all the grubbing around with DF >>>>> looking for occurrences of the transfer regs. Instead we'd just need >>>>> some logic to insert the extra moves after the lxvp for rs6000, and I >>>>> think this might fit more cleanly into the current pass. >>>>> >>>>> Does that make sense? In any case, this shouldn't really affect the >>>>> preparatory aarch64 patch because we were planning to defer adding any >>>>> hooks that are only needed for rs6000 from the initial aarch64/generic >>>>> split. >>>>> >>>>> Thanks, >>>>> Alex >>>>> >>>>>> >>>>>> Thanks & Regards >>>>>> Ajit >>>>>> >>>>>>> Thanks, >>>>>>> Alex >>>>>>> >>>>>>>> >>>>>>>> Thanks & Regards >>>>>>>> Ajit >>>>>>>>> Thanks, >>>>>>>>> Alex >>>>>>>>> >>>>>>>>>> >>>>>>>>>> lxvp should load from even registers and then loaded value will >>>>>>>>>> be in even register and even register +1 (which is odd). >>>>>>>>>> >>>>>>>>>> Thanks & Regards >>>>>>>>>> Ajit >>>>>>>>>>> If that is the case then I think it would be better to introduce a >>>>>>>>>>> virtual function (say pair_offset_alignment_ok_p) that vets the base >>>>>>>>>>> offset of the pair (prev_access->offset in transform_for_base). I >>>>>>>>>>> guess >>>>>>>>>>> it would also take access_size as a parameter and for aarch64 it >>>>>>>>>>> should >>>>>>>>>>> check: >>>>>>>>>>> >>>>>>>>>>> multiple_p (offset, access_size) >>>>>>>>>>> >>>>>>>>>>> and for rs6000 it could check: >>>>>>>>>>> >>>>>>>>>>> multiple_p (offset, access_size * 2) >>>>>>>>>>> >>>>>>>>>>> and we would then incorporate a call to that predicate in the else >>>>>>>>>>> if >>>>>>>>>>> condition of tranform_for_base. >>>>>>>>>>> >>>>>>>>>>> It would have the same limitation whereby we assume that MEM_EXPR >>>>>>>>>>> offset >>>>>>>>>>> alignment is a good proxy for RTL offset alignment, but we already >>>>>>>>>>> depend on that with the multiple_p check in track_via_mem_expr. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Alex >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Thanks & Regards >>>>>>>>>>>> Ajit >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Alex >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>> prev_access = &access; >>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>> @@ -2919,7 +3047,7 @@ ldp_bb_info::transform_for_base (int >>>>>>>>>>>>>>>> encoded_lfs, >>>>>>>>>>>>>>>> // and remove all the tombstone insns, being sure to reparent >>>>>>>>>>>>>>>> any uses >>>>>>>>>>>>>>>> // of mem to previous defs when we do this. >>>>>>>>>>>>>>>> void >>>>>>>>>>>>>>>> -ldp_bb_info::cleanup_tombstones () >>>>>>>>>>>>>>>> +pair_fusion_bb_info::cleanup_tombstones () >>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>> // No need to do anything if we didn't emit a tombstone >>>>>>>>>>>>>>>> insn for this BB. >>>>>>>>>>>>>>>> if (!m_emitted_tombstone) >>>>>>>>>>>>>>>> @@ -2947,7 +3075,7 @@ ldp_bb_info::cleanup_tombstones () >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> template<typename Map> >>>>>>>>>>>>>>>> void >>>>>>>>>>>>>>>> -ldp_bb_info::traverse_base_map (Map &map) >>>>>>>>>>>>>>>> +pair_fusion_bb_info::traverse_base_map (Map &map) >>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>> for (auto kv : map) >>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>> @@ -2958,7 +3086,7 @@ ldp_bb_info::traverse_base_map (Map &map) >>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> void >>>>>>>>>>>>>>>> -ldp_bb_info::transform () >>>>>>>>>>>>>>>> +pair_fusion_bb_info::transform () >>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>> traverse_base_map (expr_map); >>>>>>>>>>>>>>>> traverse_base_map (def_map); >>>>>>>>>>>>>>>> @@ -3167,14 +3295,13 @@ try_promote_writeback (insn_info *insn) >>>>>>>>>>>>>>>> // for load/store candidates. If running after RA, also try >>>>>>>>>>>>>>>> and promote >>>>>>>>>>>>>>>> // non-writeback pairs to use writeback addressing. Then try >>>>>>>>>>>>>>>> to fuse >>>>>>>>>>>>>>>> // candidates into pairs. >>>>>>>>>>>>>>>> -void ldp_fusion_bb (bb_info *bb) >>>>>>>>>>>>>>>> +void pair_fusion::ldp_fusion_bb (bb_info *bb) >>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>> - const bool track_loads >>>>>>>>>>>>>>>> - = aarch64_tune_params.ldp_policy_model != >>>>>>>>>>>>>>>> AARCH64_LDP_STP_POLICY_NEVER; >>>>>>>>>>>>>>>> - const bool track_stores >>>>>>>>>>>>>>>> - = aarch64_tune_params.stp_policy_model != >>>>>>>>>>>>>>>> AARCH64_LDP_STP_POLICY_NEVER; >>>>>>>>>>>>>>>> + const bool track_loads = track_load_p (); >>>>>>>>>>>>>>>> + const bool track_stores = track_store_p (); >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> - ldp_bb_info bb_state (bb); >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> + aarch64_pair_fusion derived; >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> can be deleted and then: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> + pair_fusion_bb_info bb_info (bb, &derived); >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> can just be: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> pair_fusion_bb_info bb_info (bb, this); >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> (or you can pass *this if you make bb_info take a reference). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I don't think there's a particular need to change the variable >>>>>>>>>>>>>>> name >>>>>>>>>>>>>>> (bb_state -> bb_info). I chose the former because it doens't >>>>>>>>>>>>>>> clash >>>>>>>>>>>>>>> with the RTL-SSA structure of the same name as the latter. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Addressed. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> for (auto insn : bb->nondebug_insns ()) >>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>> @@ -3184,31 +3311,31 @@ void ldp_fusion_bb (bb_info *bb) >>>>>>>>>>>>>>>> continue; >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> rtx pat = PATTERN (rti); >>>>>>>>>>>>>>>> - if (reload_completed >>>>>>>>>>>>>>>> - && aarch64_ldp_writeback > 1 >>>>>>>>>>>>>>>> - && GET_CODE (pat) == PARALLEL >>>>>>>>>>>>>>>> - && XVECLEN (pat, 0) == 2) >>>>>>>>>>>>>>>> + if (pair_mem_promote_writeback_p (pat)) >>>>>>>>>>>>>>>> try_promote_writeback (insn); >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> It looks like try_promote_writeback itself will need some >>>>>>>>>>>>>>> further work >>>>>>>>>>>>>>> to make it target-independent. I suppose this check: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> auto rti = insn->rtl (); >>>>>>>>>>>>>>> const auto attr = get_attr_ldpstp (rti); >>>>>>>>>>>>>>> if (attr == LDPSTP_NONE) >>>>>>>>>>>>>>> return; >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> bool load_p = (attr == LDPSTP_LDP); >>>>>>>>>>>>>>> gcc_checking_assert (load_p || attr == LDPSTP_STP); >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> will need to become part of the pair_mem_promote_writeback_p >>>>>>>>>>>>>>> hook that you >>>>>>>>>>>>>>> added, potentially changing it to return a boolean for load_p. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Then I guess we will need hooks for destructuring the pair insn >>>>>>>>>>>>>>> and >>>>>>>>>>>>>>> another hook to wrap aarch64_gen_writeback_pair. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Addressed. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> if (GET_CODE (pat) != SET) >>>>>>>>>>>>>>>> continue; >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> if (track_stores && MEM_P (XEXP (pat, 0))) >>>>>>>>>>>>>>>> - bb_state.track_access (insn, false, XEXP (pat, 0)); >>>>>>>>>>>>>>>> + bb_info.track_access (insn, false, XEXP (pat, 0)); >>>>>>>>>>>>>>>> else if (track_loads && MEM_P (XEXP (pat, 1))) >>>>>>>>>>>>>>>> - bb_state.track_access (insn, true, XEXP (pat, 1)); >>>>>>>>>>>>>>>> + bb_info.track_access (insn, true, XEXP (pat, 1)); >>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> - bb_state.transform (); >>>>>>>>>>>>>>>> - bb_state.cleanup_tombstones (); >>>>>>>>>>>>>>>> + bb_info.transform (); >>>>>>>>>>>>>>>> + bb_info.cleanup_tombstones (); >>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> void ldp_fusion () >>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>> ldp_fusion_init (); >>>>>>>>>>>>>>>> + pair_fusion *pfuse; >>>>>>>>>>>>>>>> + aarch64_pair_fusion derived; >>>>>>>>>>>>>>>> + pfuse = &derived; >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This is indeed the one place where I think it is acceptable to >>>>>>>>>>>>>>> instantiate aarch64_pair_fusion. But again there's no need to >>>>>>>>>>>>>>> create a >>>>>>>>>>>>>>> pointer to the parent class, just call any function you like >>>>>>>>>>>>>>> directly. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Addressed. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> for (auto bb : crtl->ssa->bbs ()) >>>>>>>>>>>>>>>> - ldp_fusion_bb (bb); >>>>>>>>>>>>>>>> + pfuse->ldp_fusion_bb (bb); >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I think even the code to iterate over bbs should itself be a >>>>>>>>>>>>>>> member >>>>>>>>>>>>>>> function of pair_fusion (say "run") and then that becomes part >>>>>>>>>>>>>>> of the >>>>>>>>>>>>>>> generic code. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> So this function would just become: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> aarch64_pair_fusion pass; >>>>>>>>>>>>>>> pass.run (); >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> and could be inlined into the caller. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Addressed. >>>>>>>>>>>>>>> Perhaps you could also add an early return like: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> if (!track_loads_p () && !track_stores_p ()) >>>>>>>>>>>>>>> return; >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> in pair_fusion::run () and then remove the corresponding code >>>>>>>>>>>>>>> from >>>>>>>>>>>>>>> pass_ldp_fusion::gate? >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Addressed. >>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> ldp_fusion_destroy (); >>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> 2.39.3 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> Alex >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks & Regards >>>>>>>>>>>>>> Ajit