Re: [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion

Ajit Agarwal Sat, 01 Jun 2024 22:52:52 -0700

Hello Richard:

On 31/05/24 10:29 pm, Ajit Agarwal wrote:
> Hello Richard:
> 
> On 31/05/24 8:08 pm, Richard Sandiford wrote:
>> Ajit Agarwal <aagar...@linux.ibm.com> writes:
>>> On 31/05/24 3:23 pm, Richard Sandiford wrote:
>>>> Ajit Agarwal <aagar...@linux.ibm.com> writes:
>>>>> Hello All:
>>>>>
>>>>> Common infrastructure using generic code for pair mem fusion of different
>>>>> targets.
>>>>>
>>>>> rs6000 target specific specific code implements virtual functions defined
>>>>> by generic code.
>>>>>
>>>>> Code is implemented with pure virtual functions to interface with target
>>>>> code.
>>>>>
>>>>> Target specific code are added in rs6000-mem-fusion.cc and additional 
>>>>> virtual
>>>>> function implementation required for rs6000 are added in 
>>>>> aarch64-ldp-fusion.cc.
>>>>>
>>>>> Bootstrapped and regtested for aarch64-linux-gnu and powerpc64-linux-gnu.
>>>>>
>>>>> Thanks & Regards
>>>>> Ajit
>>>>>
>>>>>
>>>>> aarch64, rs6000, middle-end: Add implementation for different targets for 
>>>>> pair mem fusion
>>>>>
>>>>> Common infrastructure using generic code for pair mem fusion of different
>>>>> targets.
>>>>>
>>>>> rs6000 target specific specific code implements virtual functions defined
>>>>> by generic code.
>>>>>
>>>>> Code is implemented with pure virtual functions to interface with target
>>>>> code.
>>>>>
>>>>> Target specific code are added in rs6000-mem-fusion.cc and additional 
>>>>> virtual
>>>>> function implementation required for rs6000 are added in 
>>>>> aarch64-ldp-fusion.cc.
>>>>>
>>>>> 2024-05-31  Ajit Kumar Agarwal  <aagar...@linux.ibm.com>
>>>>>
>>>>> gcc/ChangeLog:
>>>>>
>>>>>   * config/aarch64/aarch64-ldp-fusion.cc: Add target specific
>>>>>   implementation of additional virtual functions added in pair_fusion
>>>>>   struct.
>>>>>   * config/rs6000/rs6000-passes.def: New mem fusion pass
>>>>>   before pass_early_remat.
>>>>>   * config/rs6000/rs6000-mem-fusion.cc: Add new pass.
>>>>>   Add target specific implementation using pure virtual
>>>>>   functions.
>>>>>   * config.gcc: Add new object file.
>>>>>   * config/rs6000/rs6000-protos.h: Add new prototype for mem
>>>>>   fusion pass.
>>>>>   * config/rs6000/t-rs6000: Add new rule.
>>>>>   * rtl-ssa/accesses.h: Moved set_is_live_out_use as public
>>>>>   from private.
>>>>>
>>>>> gcc/testsuite/ChangeLog:
>>>>>
>>>>>   * g++.target/powerpc/me-fusion.C: New test.
>>>>>   * g++.target/powerpc/mem-fusion-1.C: New test.
>>>>>   * gcc.target/powerpc/mma-builtin-1.c: Modify test.
>>>>> ---
>>>>
>>>> This isn't a complete review, just some initial questions & comments
>>>> about selected parts.
>>>>
>>>>> [...]
>>>>> +/* Check whether load can be fusable or not.
>>>>> +   Return true if dependent use is UNSPEC otherwise false.  */
>>>>> +bool
>>>>> +rs6000_pair_fusion::fuseable_load_p (insn_info *info)
>>>>> +{
>>>>> +  rtx_insn *insn = info->rtl ();
>>>>> +
>>>>> +  for (rtx note = REG_NOTES (insn); note; note = XEXP (note, 1))
>>>>> +    if (REG_NOTE_KIND (note) == REG_EQUAL
>>>>> + || REG_NOTE_KIND (note) == REG_EQUIV)
>>>>> +      return false;
>>>>
>>>> It's unusual to punt on an optimisation because of a REG_EQUAL/EQUIV
>>>> note.  What's the reason for doing this?  Are you trying to avoid
>>>> fusing pairs before reload that are equivalent to a MEM (i.e. have
>>>> a natural spill slot)?  I think Alex hit a similar situation.
>>>>
>>>
>>> We have used the above check because of some SPEC benchmarks failing with
>>> with MEM pairs having REG_EQUAL/EQUIV notes.
>>>
>>> By adding the checks the benchmarks passes and also it improves the
>>> performance.
>>>
>>> This checks were added during initial implementation of pair fusion
>>> pass.
>>>
>>> I will investigate further if this check is still required or not.
>>
>> Thanks.  If it does affect SPEC results, it would be good to look
>> at the underlying reason, as a justification for the check.
>>
>> AIUI, the case Alex was due to the way that the RA recognises:
>>
>>   (set (reg R) (mem address-of-a-stack-vare ciable))
>>     REG_EQUIV: (mem address-of-a-stack-variable)
>>
>> where the REG_EQUIV is either explicit or detected by the RA.
>> If R needs to be spilled, it can then be spilled to its existing
>> location on the stack.  And if R needs to be spilled in the
>> instruction above (because of register pressure before the first
>> use of R), the RA is able to delete the instruction.
>>
>> But if that is the reason, the condition should be restricted
>> to cases in which the note is a memory.
>>
>> I think Alex had tried something similar and found that it wasn't
>> always effective.
>>
> 
> Sure I will check.


Spec looks good without the above check of REG_EQUIV/REG_EQUAL.
I dont see the above condition in Spec benchmarks any more for 
MEM pairs.
Hence I will remove the above check from the patch and send
new patch without the above check.

Thanks & Regards
Ajit
>>> [...]
>>>>> +            && info->is_real ())
>>>>> +           {
>>>>> +             rtx_insn *rtl_insn = info->rtl ();
>>>>> +             rtx set = single_set (rtl_insn);
>>>>> +
>>>>> +             if (set == NULL_RTX)
>>>>> +               return false;
>>>>> +
>>>>> +             rtx op0 = SET_SRC (set);
>>>>> +             if (GET_CODE (op0) != UNSPEC)
>>>>> +               return false;
>>>>
>>>> What's the motivation for rejecting unspecs?  It's unsual to treat
>>>> all unspecs as a distinct group.
>>>>
>>>> Also, using single_set means that the function still lets through
>>>> parallels of two sets in which the sources are unspecs.  Is that
>>>> intentional?
>>>>
>>>> The reasons behind things like the REG_EQUAL/EQUIV and UNSPEC decisions
>>>> need to be described in comments, so that other people coming to this
>>>> code later can understand the motivation.  The same thing applies to
>>>> other decisions in the patch.
>>>>
>>>
>>> Adjacent load pair fusion with 256 bit OOmode is seen and valid with use of 
>>> load
>>> in UNSPEC. Thats why this check is added.
>>
>> Can you give an example (in terms of rtl)?
> 
> 
> (insn 23 72 25 4 (set (reg:OO 124 [ vect__1.24 ])
>         (mem:OO (reg:DI 118 [ ivtmp.73 ]) [0 MEM <vector(16) unsigned char> 
> [(unsigned char *)_65]+0 S16 A8]))  {*movoo}
>      (nil))
> (insn 25 23 103 4 (set (reg:OO 174 [ vect__1.26 ])
>         (mem:OO (plus:DI (reg:DI 118 [ ivtmp.73 ])
>                 (const_int 32 [0x20])) [0 MEM <vector(16) unsigned char> 
> [(unsigned char *)_65 + 32B]+0 S16 A8]))  {*movoo}
>      (nil))
> (insn 103 25 27 4 (set (reg:DI 118 [ ivtmp.73 ])
>         (plus:DI (reg:DI 118 [ ivtmp.73 ])
>             (const_int 64 [0x40]))) 66 {*adddi3}
>      (nil))
> (insn 27 103 29 4 (set (reg:V16QI 176 [ vect_perm_even_123 ])
>         (unspec:V16QI [
>                 (subreg:V16QI (reg:OO 124 [ vect__1.24 ]) 0)
>                 (subreg:V16QI (reg:OO 124 [ vect__1.24 ]) 16)
>                 (reg:V16QI 300)
>             ] UNSPEC_VPERM))  {altivec_vperm_v16qi_direct}
>      (nil))
> 
> This is the example where UNSPEC is used with 256 bit OOmode.
> 
>>
>>> [...]
>>>>> [...]
>>>>> +  // Given insn_info pair I1 and I2, return true if offsets are in order.
>>>>> +  virtual bool should_handle_unordered_insns (rtl_ssa::insn_info *i1,
>>>>> +                                       rtl_ssa::insn_info *i2) = 0;
>>>>> +
>>>>
>>>> This name seems a bit misleading.  The function is used in:
>>>>
>>>> @@ -2401,6 +2405,9 @@ pair_fusion_bb_info::try_fuse_pair (bool load_p, 
>>>> unsigned access_size,
>>>>        reversed = true;
>>>>      }
>>>>  
>>>> +  if (!m_pass->should_handle_unordered_insns (i1, i2))
>>>> +    return false;
>>>> +
>>>>    rtx cand_mems[2];
>>>>    rtx reg_ops[2];
>>>>    rtx pats[2];
>>>>
>>>> and so it acts as a general opt-out.  The insns aren't known to be 
>>>> unordered.
>>>>
>>>> It looks like the rs6000 override requires the original insns to be
>>>> in offset order.  Could you say why that's necessary?  (Both in email
>>>> and as a comment in the code.)
>>>>
>>>
>>> Yes rs6000 requires the original load insns to be in offset order.
>>> Some regression tests like vect-outer-4f fails if we do load pair
>>> fusion with load offsets are not in offset order as this breaks lxvp 
>>> semantics.
>>
>> How does it break the semantics though?  In principle, the generic code
>> only fuses if it has "proved" that the loads can happen in either order.
>> So it shouldn't matter which order the hardware does things in.
>>
>> Could you give an example of the kind of situation that you want
>> to avoid, and why it generates the wrong result?
>>
> 
> (insn 31 62 32 2 (set (reg:V16QI 177 [ MEM <vector(8) short unsigned int> 
> [(short unsigned int *)vectp.62_36 + 64B] ])
>         (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>                 (const_int 64 [0x40])) [1 MEM <vector(8) short unsigned int> 
> [(short unsigned int *)vectp.62_36 + 64B]+0 S16 A16]))  {vsx_movv16qi_64bit}
>      (nil))
> (insn 32 31 16 2 (set (reg:V16QI 178 [ MEM <vector(8) short unsigned int> 
> [(short unsigned int *)vectp.62_36 + 80B] ])
>         (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>                 (const_int 80 [0x50])) [1 MEM <vector(8) short unsigned int> 
> [(short unsigned int *)vectp.62_36 + 80B]+0 S16 A16]))  {vsx_movv16qi_64bit}
>      (nil))
> (insn 16 32 21 2 (set (reg:V16QI 159 [ MEM <vector(8) short unsigned int> 
> [(short unsigned int *)vectp.62_36 + 16B] ])
>         (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>                 (const_int 16 [0x10])) [1 MEM <vector(8) short unsigned int> 
> [(short unsigned int *)vectp.62_36 + 16B]+0 S16 A16]))  {vsx_movv16qi_64bit}
>      (nil))
> (insn 21 16 22 2 (set (reg:V16QI 165 [ MEM <vector(8) short unsigned int> 
> [(short unsigned int *)vectp.62_36 + 32B] ])
>         (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>                 (const_int 32 [0x20])) [1 MEM <vector(8) short unsigned int> 
> [(short unsigned int *)vectp.62_36 + 32B]+0 S16 A16])) {vsx_movv16qi_64bit}
>      (nil))
> (insn 22 21 37 2 (set (reg:V16QI 166 [ MEM <vector(8) short unsigned int> 
> [(short unsigned int *)vectp.62_36 + 48B] ])
>         (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>                 (const_int 48 [0x30])) [1 MEM <vector(8) short unsigned int> 
> [(short unsigned int *)vectp.62_36 + 48B]+0 S16 A16])) {vsx_movv16qi_64bit}
>      (nil))
> 
> insn 22 and insn 31 is merged in the failure case and breaks the code.
> 
>  
>> Thanks,
>> Richard
>>
> 
> Thanks & Regards
> Ajit
>

Re: [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion

Reply via email to