Re: [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion

Ajit Agarwal Fri, 31 May 2024 09:59:47 -0700

Hello Richard:

On 31/05/24 8:08 pm, Richard Sandiford wrote:
> Ajit Agarwal <aagar...@linux.ibm.com> writes:
>> On 31/05/24 3:23 pm, Richard Sandiford wrote:
>>> Ajit Agarwal <aagar...@linux.ibm.com> writes:
>>>> Hello All:
>>>>
>>>> Common infrastructure using generic code for pair mem fusion of different
>>>> targets.
>>>>
>>>> rs6000 target specific specific code implements virtual functions defined
>>>> by generic code.
>>>>
>>>> Code is implemented with pure virtual functions to interface with target
>>>> code.
>>>>
>>>> Target specific code are added in rs6000-mem-fusion.cc and additional 
>>>> virtual
>>>> function implementation required for rs6000 are added in 
>>>> aarch64-ldp-fusion.cc.
>>>>
>>>> Bootstrapped and regtested for aarch64-linux-gnu and powerpc64-linux-gnu.
>>>>
>>>> Thanks & Regards
>>>> Ajit
>>>>
>>>>
>>>> aarch64, rs6000, middle-end: Add implementation for different targets for 
>>>> pair mem fusion
>>>>
>>>> Common infrastructure using generic code for pair mem fusion of different
>>>> targets.
>>>>
>>>> rs6000 target specific specific code implements virtual functions defined
>>>> by generic code.
>>>>
>>>> Code is implemented with pure virtual functions to interface with target
>>>> code.
>>>>
>>>> Target specific code are added in rs6000-mem-fusion.cc and additional 
>>>> virtual
>>>> function implementation required for rs6000 are added in 
>>>> aarch64-ldp-fusion.cc.
>>>>
>>>> 2024-05-31  Ajit Kumar Agarwal  <aagar...@linux.ibm.com>
>>>>
>>>> gcc/ChangeLog:
>>>>
>>>>    * config/aarch64/aarch64-ldp-fusion.cc: Add target specific
>>>>    implementation of additional virtual functions added in pair_fusion
>>>>    struct.
>>>>    * config/rs6000/rs6000-passes.def: New mem fusion pass
>>>>    before pass_early_remat.
>>>>    * config/rs6000/rs6000-mem-fusion.cc: Add new pass.
>>>>    Add target specific implementation using pure virtual
>>>>    functions.
>>>>    * config.gcc: Add new object file.
>>>>    * config/rs6000/rs6000-protos.h: Add new prototype for mem
>>>>    fusion pass.
>>>>    * config/rs6000/t-rs6000: Add new rule.
>>>>    * rtl-ssa/accesses.h: Moved set_is_live_out_use as public
>>>>    from private.
>>>>
>>>> gcc/testsuite/ChangeLog:
>>>>
>>>>    * g++.target/powerpc/me-fusion.C: New test.
>>>>    * g++.target/powerpc/mem-fusion-1.C: New test.
>>>>    * gcc.target/powerpc/mma-builtin-1.c: Modify test.
>>>> ---
>>>
>>> This isn't a complete review, just some initial questions & comments
>>> about selected parts.
>>>
>>>> [...]
>>>> +/* Check whether load can be fusable or not.
>>>> +   Return true if dependent use is UNSPEC otherwise false.  */
>>>> +bool
>>>> +rs6000_pair_fusion::fuseable_load_p (insn_info *info)
>>>> +{
>>>> +  rtx_insn *insn = info->rtl ();
>>>> +
>>>> +  for (rtx note = REG_NOTES (insn); note; note = XEXP (note, 1))
>>>> +    if (REG_NOTE_KIND (note) == REG_EQUAL
>>>> +  || REG_NOTE_KIND (note) == REG_EQUIV)
>>>> +      return false;
>>>
>>> It's unusual to punt on an optimisation because of a REG_EQUAL/EQUIV
>>> note.  What's the reason for doing this?  Are you trying to avoid
>>> fusing pairs before reload that are equivalent to a MEM (i.e. have
>>> a natural spill slot)?  I think Alex hit a similar situation.
>>>
>>
>> We have used the above check because of some SPEC benchmarks failing with
>> with MEM pairs having REG_EQUAL/EQUIV notes.
>>
>> By adding the checks the benchmarks passes and also it improves the
>> performance.
>>
>> This checks were added during initial implementation of pair fusion
>> pass.
>>
>> I will investigate further if this check is still required or not.
> 
> Thanks.  If it does affect SPEC results, it would be good to look
> at the underlying reason, as a justification for the check.
> 
> AIUI, the case Alex was due to the way that the RA recognises:
> 
>   (set (reg R) (mem address-of-a-stack-variable))
>     REG_EQUIV: (mem address-of-a-stack-variable)
> 
> where the REG_EQUIV is either explicit or detected by the RA.
> If R needs to be spilled, it can then be spilled to its existing
> location on the stack.  And if R needs to be spilled in the
> instruction above (because of register pressure before the first
> use of R), the RA is able to delete the instruction.
> 
> But if that is the reason, the condition should be restricted
> to cases in which the note is a memory.
> 
> I think Alex had tried something similar and found that it wasn't
> always effective.
>


Sure I will check.
>> [...]
>>>> +             && info->is_real ())
>>>> +            {
>>>> +              rtx_insn *rtl_insn = info->rtl ();
>>>> +              rtx set = single_set (rtl_insn);
>>>> +
>>>> +              if (set == NULL_RTX)
>>>> +                return false;
>>>> +
>>>> +              rtx op0 = SET_SRC (set);
>>>> +              if (GET_CODE (op0) != UNSPEC)
>>>> +                return false;
>>>
>>> What's the motivation for rejecting unspecs?  It's unsual to treat
>>> all unspecs as a distinct group.
>>>
>>> Also, using single_set means that the function still lets through
>>> parallels of two sets in which the sources are unspecs.  Is that
>>> intentional?
>>>
>>> The reasons behind things like the REG_EQUAL/EQUIV and UNSPEC decisions
>>> need to be described in comments, so that other people coming to this
>>> code later can understand the motivation.  The same thing applies to
>>> other decisions in the patch.
>>>
>>
>> Adjacent load pair fusion with 256 bit OOmode is seen and valid with use of 
>> load
>> in UNSPEC. Thats why this check is added.
> 
> Can you give an example (in terms of rtl)?


(insn 23 72 25 4 (set (reg:OO 124 [ vect__1.24 ])
        (mem:OO (reg:DI 118 [ ivtmp.73 ]) [0 MEM <vector(16) unsigned char> 
[(unsigned char *)_65]+0 S16 A8]))  {*movoo}
     (nil))
(insn 25 23 103 4 (set (reg:OO 174 [ vect__1.26 ])
        (mem:OO (plus:DI (reg:DI 118 [ ivtmp.73 ])
                (const_int 32 [0x20])) [0 MEM <vector(16) unsigned char> 
[(unsigned char *)_65 + 32B]+0 S16 A8]))  {*movoo}
     (nil))
(insn 103 25 27 4 (set (reg:DI 118 [ ivtmp.73 ])
        (plus:DI (reg:DI 118 [ ivtmp.73 ])
            (const_int 64 [0x40]))) 66 {*adddi3}
     (nil))
(insn 27 103 29 4 (set (reg:V16QI 176 [ vect_perm_even_123 ])
        (unspec:V16QI [
                (subreg:V16QI (reg:OO 124 [ vect__1.24 ]) 0)
                (subreg:V16QI (reg:OO 124 [ vect__1.24 ]) 16)
                (reg:V16QI 300)
            ] UNSPEC_VPERM))  {altivec_vperm_v16qi_direct}
     (nil))

This is the example where UNSPEC is used with 256 bit OOmode.

> 
>> [...]
>>>> [...]
>>>> +  // Given insn_info pair I1 and I2, return true if offsets are in order.
>>>> +  virtual bool should_handle_unordered_insns (rtl_ssa::insn_info *i1,
>>>> +                                        rtl_ssa::insn_info *i2) = 0;
>>>> +
>>>
>>> This name seems a bit misleading.  The function is used in:
>>>
>>> @@ -2401,6 +2405,9 @@ pair_fusion_bb_info::try_fuse_pair (bool load_p, 
>>> unsigned access_size,
>>>        reversed = true;
>>>      }
>>>  
>>> +  if (!m_pass->should_handle_unordered_insns (i1, i2))
>>> +    return false;
>>> +
>>>    rtx cand_mems[2];
>>>    rtx reg_ops[2];
>>>    rtx pats[2];
>>>
>>> and so it acts as a general opt-out.  The insns aren't known to be 
>>> unordered.
>>>
>>> It looks like the rs6000 override requires the original insns to be
>>> in offset order.  Could you say why that's necessary?  (Both in email
>>> and as a comment in the code.)
>>>
>>
>> Yes rs6000 requires the original load insns to be in offset order.
>> Some regression tests like vect-outer-4f fails if we do load pair
>> fusion with load offsets are not in offset order as this breaks lxvp 
>> semantics.
> 
> How does it break the semantics though?  In principle, the generic code
> only fuses if it has "proved" that the loads can happen in either order.
> So it shouldn't matter which order the hardware does things in.
> 
> Could you give an example of the kind of situation that you want
> to avoid, and why it generates the wrong result?
>

(insn 31 62 32 2 (set (reg:V16QI 177 [ MEM <vector(8) short unsigned int> 
[(short unsigned int *)vectp.62_36 + 64B] ])
        (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
                (const_int 64 [0x40])) [1 MEM <vector(8) short unsigned int> 
[(short unsigned int *)vectp.62_36 + 64B]+0 S16 A16]))  {vsx_movv16qi_64bit}
     (nil))
(insn 32 31 16 2 (set (reg:V16QI 178 [ MEM <vector(8) short unsigned int> 
[(short unsigned int *)vectp.62_36 + 80B] ])
        (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
                (const_int 80 [0x50])) [1 MEM <vector(8) short unsigned int> 
[(short unsigned int *)vectp.62_36 + 80B]+0 S16 A16]))  {vsx_movv16qi_64bit}
     (nil))
(insn 16 32 21 2 (set (reg:V16QI 159 [ MEM <vector(8) short unsigned int> 
[(short unsigned int *)vectp.62_36 + 16B] ])
        (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
                (const_int 16 [0x10])) [1 MEM <vector(8) short unsigned int> 
[(short unsigned int *)vectp.62_36 + 16B]+0 S16 A16]))  {vsx_movv16qi_64bit}
     (nil))
(insn 21 16 22 2 (set (reg:V16QI 165 [ MEM <vector(8) short unsigned int> 
[(short unsigned int *)vectp.62_36 + 32B] ])
        (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
                (const_int 32 [0x20])) [1 MEM <vector(8) short unsigned int> 
[(short unsigned int *)vectp.62_36 + 32B]+0 S16 A16])) {vsx_movv16qi_64bit}
     (nil))
(insn 22 21 37 2 (set (reg:V16QI 166 [ MEM <vector(8) short unsigned int> 
[(short unsigned int *)vectp.62_36 + 48B] ])
        (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
                (const_int 48 [0x30])) [1 MEM <vector(8) short unsigned int> 
[(short unsigned int *)vectp.62_36 + 48B]+0 S16 A16])) {vsx_movv16qi_64bit}
     (nil))

insn 22 and insn 31 is merged in the failure case and breaks the code.

 
> Thanks,
> Richard
>

Thanks & Regards
Ajit

Re: [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion

Reply via email to