Hello Richard: On 31/05/24 10:29 pm, Ajit Agarwal wrote: > Hello Richard: > > On 31/05/24 8:08 pm, Richard Sandiford wrote: >> Ajit Agarwal <aagar...@linux.ibm.com> writes: >>> On 31/05/24 3:23 pm, Richard Sandiford wrote: >>>> Ajit Agarwal <aagar...@linux.ibm.com> writes: >>>>> Hello All: >>>>> >>>>> Common infrastructure using generic code for pair mem fusion of different >>>>> targets. >>>>> >>>>> rs6000 target specific specific code implements virtual functions defined >>>>> by generic code. >>>>> >>>>> Code is implemented with pure virtual functions to interface with target >>>>> code. >>>>> >>>>> Target specific code are added in rs6000-mem-fusion.cc and additional >>>>> virtual >>>>> function implementation required for rs6000 are added in >>>>> aarch64-ldp-fusion.cc. >>>>> >>>>> Bootstrapped and regtested for aarch64-linux-gnu and powerpc64-linux-gnu. >>>>> >>>>> Thanks & Regards >>>>> Ajit >>>>> >>>>> >>>>> aarch64, rs6000, middle-end: Add implementation for different targets for >>>>> pair mem fusion >>>>> >>>>> Common infrastructure using generic code for pair mem fusion of different >>>>> targets. >>>>> >>>>> rs6000 target specific specific code implements virtual functions defined >>>>> by generic code. >>>>> >>>>> Code is implemented with pure virtual functions to interface with target >>>>> code. >>>>> >>>>> Target specific code are added in rs6000-mem-fusion.cc and additional >>>>> virtual >>>>> function implementation required for rs6000 are added in >>>>> aarch64-ldp-fusion.cc. >>>>> >>>>> 2024-05-31 Ajit Kumar Agarwal <aagar...@linux.ibm.com> >>>>> >>>>> gcc/ChangeLog: >>>>> >>>>> * config/aarch64/aarch64-ldp-fusion.cc: Add target specific >>>>> implementation of additional virtual functions added in pair_fusion >>>>> struct. >>>>> * config/rs6000/rs6000-passes.def: New mem fusion pass >>>>> before pass_early_remat. >>>>> * config/rs6000/rs6000-mem-fusion.cc: Add new pass. >>>>> Add target specific implementation using pure virtual >>>>> functions. >>>>> * config.gcc: Add new object file. >>>>> * config/rs6000/rs6000-protos.h: Add new prototype for mem >>>>> fusion pass. >>>>> * config/rs6000/t-rs6000: Add new rule. >>>>> * rtl-ssa/accesses.h: Moved set_is_live_out_use as public >>>>> from private. >>>>> >>>>> gcc/testsuite/ChangeLog: >>>>> >>>>> * g++.target/powerpc/me-fusion.C: New test. >>>>> * g++.target/powerpc/mem-fusion-1.C: New test. >>>>> * gcc.target/powerpc/mma-builtin-1.c: Modify test. >>>>> --- >>>> >>>> This isn't a complete review, just some initial questions & comments >>>> about selected parts. >>>> >>>>> [...] >>>>> +/* Check whether load can be fusable or not. >>>>> + Return true if dependent use is UNSPEC otherwise false. */ >>>>> +bool >>>>> +rs6000_pair_fusion::fuseable_load_p (insn_info *info) >>>>> +{ >>>>> + rtx_insn *insn = info->rtl (); >>>>> + >>>>> + for (rtx note = REG_NOTES (insn); note; note = XEXP (note, 1)) >>>>> + if (REG_NOTE_KIND (note) == REG_EQUAL >>>>> + || REG_NOTE_KIND (note) == REG_EQUIV) >>>>> + return false; >>>> >>>> It's unusual to punt on an optimisation because of a REG_EQUAL/EQUIV >>>> note. What's the reason for doing this? Are you trying to avoid >>>> fusing pairs before reload that are equivalent to a MEM (i.e. have >>>> a natural spill slot)? I think Alex hit a similar situation. >>>> >>> >>> We have used the above check because of some SPEC benchmarks failing with >>> with MEM pairs having REG_EQUAL/EQUIV notes. >>> >>> By adding the checks the benchmarks passes and also it improves the >>> performance. >>> >>> This checks were added during initial implementation of pair fusion >>> pass. >>> >>> I will investigate further if this check is still required or not. >> >> Thanks. If it does affect SPEC results, it would be good to look >> at the underlying reason, as a justification for the check. >> >> AIUI, the case Alex was due to the way that the RA recognises: >> >> (set (reg R) (mem address-of-a-stack-vare ciable)) >> REG_EQUIV: (mem address-of-a-stack-variable) >> >> where the REG_EQUIV is either explicit or detected by the RA. >> If R needs to be spilled, it can then be spilled to its existing >> location on the stack. And if R needs to be spilled in the >> instruction above (because of register pressure before the first >> use of R), the RA is able to delete the instruction. >> >> But if that is the reason, the condition should be restricted >> to cases in which the note is a memory. >> >> I think Alex had tried something similar and found that it wasn't >> always effective. >> > > Sure I will check.
Spec looks good without the above check of REG_EQUIV/REG_EQUAL. I dont see the above condition in Spec benchmarks any more for MEM pairs. Hence I will remove the above check from the patch and send new patch without the above check. Thanks & Regards Ajit >>> [...] >>>>> + && info->is_real ()) >>>>> + { >>>>> + rtx_insn *rtl_insn = info->rtl (); >>>>> + rtx set = single_set (rtl_insn); >>>>> + >>>>> + if (set == NULL_RTX) >>>>> + return false; >>>>> + >>>>> + rtx op0 = SET_SRC (set); >>>>> + if (GET_CODE (op0) != UNSPEC) >>>>> + return false; >>>> >>>> What's the motivation for rejecting unspecs? It's unsual to treat >>>> all unspecs as a distinct group. >>>> >>>> Also, using single_set means that the function still lets through >>>> parallels of two sets in which the sources are unspecs. Is that >>>> intentional? >>>> >>>> The reasons behind things like the REG_EQUAL/EQUIV and UNSPEC decisions >>>> need to be described in comments, so that other people coming to this >>>> code later can understand the motivation. The same thing applies to >>>> other decisions in the patch. >>>> >>> >>> Adjacent load pair fusion with 256 bit OOmode is seen and valid with use of >>> load >>> in UNSPEC. Thats why this check is added. >> >> Can you give an example (in terms of rtl)? > > > (insn 23 72 25 4 (set (reg:OO 124 [ vect__1.24 ]) > (mem:OO (reg:DI 118 [ ivtmp.73 ]) [0 MEM <vector(16) unsigned char> > [(unsigned char *)_65]+0 S16 A8])) {*movoo} > (nil)) > (insn 25 23 103 4 (set (reg:OO 174 [ vect__1.26 ]) > (mem:OO (plus:DI (reg:DI 118 [ ivtmp.73 ]) > (const_int 32 [0x20])) [0 MEM <vector(16) unsigned char> > [(unsigned char *)_65 + 32B]+0 S16 A8])) {*movoo} > (nil)) > (insn 103 25 27 4 (set (reg:DI 118 [ ivtmp.73 ]) > (plus:DI (reg:DI 118 [ ivtmp.73 ]) > (const_int 64 [0x40]))) 66 {*adddi3} > (nil)) > (insn 27 103 29 4 (set (reg:V16QI 176 [ vect_perm_even_123 ]) > (unspec:V16QI [ > (subreg:V16QI (reg:OO 124 [ vect__1.24 ]) 0) > (subreg:V16QI (reg:OO 124 [ vect__1.24 ]) 16) > (reg:V16QI 300) > ] UNSPEC_VPERM)) {altivec_vperm_v16qi_direct} > (nil)) > > This is the example where UNSPEC is used with 256 bit OOmode. > >> >>> [...] >>>>> [...] >>>>> + // Given insn_info pair I1 and I2, return true if offsets are in order. >>>>> + virtual bool should_handle_unordered_insns (rtl_ssa::insn_info *i1, >>>>> + rtl_ssa::insn_info *i2) = 0; >>>>> + >>>> >>>> This name seems a bit misleading. The function is used in: >>>> >>>> @@ -2401,6 +2405,9 @@ pair_fusion_bb_info::try_fuse_pair (bool load_p, >>>> unsigned access_size, >>>> reversed = true; >>>> } >>>> >>>> + if (!m_pass->should_handle_unordered_insns (i1, i2)) >>>> + return false; >>>> + >>>> rtx cand_mems[2]; >>>> rtx reg_ops[2]; >>>> rtx pats[2]; >>>> >>>> and so it acts as a general opt-out. The insns aren't known to be >>>> unordered. >>>> >>>> It looks like the rs6000 override requires the original insns to be >>>> in offset order. Could you say why that's necessary? (Both in email >>>> and as a comment in the code.) >>>> >>> >>> Yes rs6000 requires the original load insns to be in offset order. >>> Some regression tests like vect-outer-4f fails if we do load pair >>> fusion with load offsets are not in offset order as this breaks lxvp >>> semantics. >> >> How does it break the semantics though? In principle, the generic code >> only fuses if it has "proved" that the loads can happen in either order. >> So it shouldn't matter which order the hardware does things in. >> >> Could you give an example of the kind of situation that you want >> to avoid, and why it generates the wrong result? >> > > (insn 31 62 32 2 (set (reg:V16QI 177 [ MEM <vector(8) short unsigned int> > [(short unsigned int *)vectp.62_36 + 64B] ]) > (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ]) > (const_int 64 [0x40])) [1 MEM <vector(8) short unsigned int> > [(short unsigned int *)vectp.62_36 + 64B]+0 S16 A16])) {vsx_movv16qi_64bit} > (nil)) > (insn 32 31 16 2 (set (reg:V16QI 178 [ MEM <vector(8) short unsigned int> > [(short unsigned int *)vectp.62_36 + 80B] ]) > (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ]) > (const_int 80 [0x50])) [1 MEM <vector(8) short unsigned int> > [(short unsigned int *)vectp.62_36 + 80B]+0 S16 A16])) {vsx_movv16qi_64bit} > (nil)) > (insn 16 32 21 2 (set (reg:V16QI 159 [ MEM <vector(8) short unsigned int> > [(short unsigned int *)vectp.62_36 + 16B] ]) > (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ]) > (const_int 16 [0x10])) [1 MEM <vector(8) short unsigned int> > [(short unsigned int *)vectp.62_36 + 16B]+0 S16 A16])) {vsx_movv16qi_64bit} > (nil)) > (insn 21 16 22 2 (set (reg:V16QI 165 [ MEM <vector(8) short unsigned int> > [(short unsigned int *)vectp.62_36 + 32B] ]) > (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ]) > (const_int 32 [0x20])) [1 MEM <vector(8) short unsigned int> > [(short unsigned int *)vectp.62_36 + 32B]+0 S16 A16])) {vsx_movv16qi_64bit} > (nil)) > (insn 22 21 37 2 (set (reg:V16QI 166 [ MEM <vector(8) short unsigned int> > [(short unsigned int *)vectp.62_36 + 48B] ]) > (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ]) > (const_int 48 [0x30])) [1 MEM <vector(8) short unsigned int> > [(short unsigned int *)vectp.62_36 + 48B]+0 S16 A16])) {vsx_movv16qi_64bit} > (nil)) > > insn 22 and insn 31 is merged in the failure case and breaks the code. > > >> Thanks, >> Richard >> > > Thanks & Regards > Ajit >