Ajit Agarwal <aagar...@linux.ibm.com> writes:
> On 31/05/24 3:23 pm, Richard Sandiford wrote:
>> Ajit Agarwal <aagar...@linux.ibm.com> writes:
>>> Hello All:
>>>
>>> Common infrastructure using generic code for pair mem fusion of different
>>> targets.
>>>
>>> rs6000 target specific specific code implements virtual functions defined
>>> by generic code.
>>>
>>> Code is implemented with pure virtual functions to interface with target
>>> code.
>>>
>>> Target specific code are added in rs6000-mem-fusion.cc and additional 
>>> virtual
>>> function implementation required for rs6000 are added in 
>>> aarch64-ldp-fusion.cc.
>>>
>>> Bootstrapped and regtested for aarch64-linux-gnu and powerpc64-linux-gnu.
>>>
>>> Thanks & Regards
>>> Ajit
>>>
>>>
>>> aarch64, rs6000, middle-end: Add implementation for different targets for 
>>> pair mem fusion
>>>
>>> Common infrastructure using generic code for pair mem fusion of different
>>> targets.
>>>
>>> rs6000 target specific specific code implements virtual functions defined
>>> by generic code.
>>>
>>> Code is implemented with pure virtual functions to interface with target
>>> code.
>>>
>>> Target specific code are added in rs6000-mem-fusion.cc and additional 
>>> virtual
>>> function implementation required for rs6000 are added in 
>>> aarch64-ldp-fusion.cc.
>>>
>>> 2024-05-31  Ajit Kumar Agarwal  <aagar...@linux.ibm.com>
>>>
>>> gcc/ChangeLog:
>>>
>>>     * config/aarch64/aarch64-ldp-fusion.cc: Add target specific
>>>     implementation of additional virtual functions added in pair_fusion
>>>     struct.
>>>     * config/rs6000/rs6000-passes.def: New mem fusion pass
>>>     before pass_early_remat.
>>>     * config/rs6000/rs6000-mem-fusion.cc: Add new pass.
>>>     Add target specific implementation using pure virtual
>>>     functions.
>>>     * config.gcc: Add new object file.
>>>     * config/rs6000/rs6000-protos.h: Add new prototype for mem
>>>     fusion pass.
>>>     * config/rs6000/t-rs6000: Add new rule.
>>>     * rtl-ssa/accesses.h: Moved set_is_live_out_use as public
>>>     from private.
>>>
>>> gcc/testsuite/ChangeLog:
>>>
>>>     * g++.target/powerpc/me-fusion.C: New test.
>>>     * g++.target/powerpc/mem-fusion-1.C: New test.
>>>     * gcc.target/powerpc/mma-builtin-1.c: Modify test.
>>> ---
>> 
>> This isn't a complete review, just some initial questions & comments
>> about selected parts.
>> 
>>> [...]
>>> +/* Check whether load can be fusable or not.
>>> +   Return true if dependent use is UNSPEC otherwise false.  */
>>> +bool
>>> +rs6000_pair_fusion::fuseable_load_p (insn_info *info)
>>> +{
>>> +  rtx_insn *insn = info->rtl ();
>>> +
>>> +  for (rtx note = REG_NOTES (insn); note; note = XEXP (note, 1))
>>> +    if (REG_NOTE_KIND (note) == REG_EQUAL
>>> +   || REG_NOTE_KIND (note) == REG_EQUIV)
>>> +      return false;
>> 
>> It's unusual to punt on an optimisation because of a REG_EQUAL/EQUIV
>> note.  What's the reason for doing this?  Are you trying to avoid
>> fusing pairs before reload that are equivalent to a MEM (i.e. have
>> a natural spill slot)?  I think Alex hit a similar situation.
>> 
>
> We have used the above check because of some SPEC benchmarks failing with
> with MEM pairs having REG_EQUAL/EQUIV notes.
>
> By adding the checks the benchmarks passes and also it improves the
> performance.
>
> This checks were added during initial implementation of pair fusion
> pass.
>
> I will investigate further if this check is still required or not.

Thanks.  If it does affect SPEC results, it would be good to look
at the underlying reason, as a justification for the check.

AIUI, the case Alex was due to the way that the RA recognises:

  (set (reg R) (mem address-of-a-stack-variable))
    REG_EQUIV: (mem address-of-a-stack-variable)

where the REG_EQUIV is either explicit or detected by the RA.
If R needs to be spilled, it can then be spilled to its existing
location on the stack.  And if R needs to be spilled in the
instruction above (because of register pressure before the first
use of R), the RA is able to delete the instruction.

But if that is the reason, the condition should be restricted
to cases in which the note is a memory.

I think Alex had tried something similar and found that it wasn't
always effective.

> [...]
>>> +              && info->is_real ())
>>> +             {
>>> +               rtx_insn *rtl_insn = info->rtl ();
>>> +               rtx set = single_set (rtl_insn);
>>> +
>>> +               if (set == NULL_RTX)
>>> +                 return false;
>>> +
>>> +               rtx op0 = SET_SRC (set);
>>> +               if (GET_CODE (op0) != UNSPEC)
>>> +                 return false;
>> 
>> What's the motivation for rejecting unspecs?  It's unsual to treat
>> all unspecs as a distinct group.
>> 
>> Also, using single_set means that the function still lets through
>> parallels of two sets in which the sources are unspecs.  Is that
>> intentional?
>> 
>> The reasons behind things like the REG_EQUAL/EQUIV and UNSPEC decisions
>> need to be described in comments, so that other people coming to this
>> code later can understand the motivation.  The same thing applies to
>> other decisions in the patch.
>> 
>
> Adjacent load pair fusion with 256 bit OOmode is seen and valid with use of 
> load
> in UNSPEC. Thats why this check is added.

Can you give an example (in terms of rtl)?

> [...]
>>> [...]
>>> +  // Given insn_info pair I1 and I2, return true if offsets are in order.
>>> +  virtual bool should_handle_unordered_insns (rtl_ssa::insn_info *i1,
>>> +                                         rtl_ssa::insn_info *i2) = 0;
>>> +
>> 
>> This name seems a bit misleading.  The function is used in:
>> 
>> @@ -2401,6 +2405,9 @@ pair_fusion_bb_info::try_fuse_pair (bool load_p, 
>> unsigned access_size,
>>        reversed = true;
>>      }
>>  
>> +  if (!m_pass->should_handle_unordered_insns (i1, i2))
>> +    return false;
>> +
>>    rtx cand_mems[2];
>>    rtx reg_ops[2];
>>    rtx pats[2];
>> 
>> and so it acts as a general opt-out.  The insns aren't known to be unordered.
>> 
>> It looks like the rs6000 override requires the original insns to be
>> in offset order.  Could you say why that's necessary?  (Both in email
>> and as a comment in the code.)
>>
>
> Yes rs6000 requires the original load insns to be in offset order.
> Some regression tests like vect-outer-4f fails if we do load pair
> fusion with load offsets are not in offset order as this breaks lxvp 
> semantics.

How does it break the semantics though?  In principle, the generic code
only fuses if it has "proved" that the loads can happen in either order.
So it shouldn't matter which order the hardware does things in.

Could you give an example of the kind of situation that you want
to avoid, and why it generates the wrong result?

Thanks,
Richard

Reply via email to