Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion

Richard Sandiford Mon, 10 Jun 2024 02:50:26 -0700

Ajit Agarwal <aagar...@linux.ibm.com> writes:
> Hello Richard:
>
> On 10/06/24 2:52 pm, Richard Sandiford wrote:
>> Ajit Agarwal <aagar...@linux.ibm.com> writes:
>>> On 10/06/24 2:12 pm, Richard Sandiford wrote:
>>>> Ajit Agarwal <aagar...@linux.ibm.com> writes:
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +   rtx set = single_set (insn);
>>>>>>>>>>>>> +   if (set == NULL_RTX)
>>>>>>>>>>>>> +     return false;
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +   rtx op0 = SET_SRC (set);
>>>>>>>>>>>>> +   rtx_code code = GET_CODE (op0);
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +   // This check is added as register pairs are not generated
>>>>>>>>>>>>> +   // by RA for neg:V2DF (fma: V2DF (reg1)
>>>>>>>>>>>>> +   //                  (reg2)
>>>>>>>>>>>>> +   //                  (neg:V2DF (reg3)))
>>>>>>>>>>>>> +   if (GET_RTX_CLASS (code) == RTX_UNARY)
>>>>>>>>>>>>> +     return false;
>>>>>>>>>>>>
>>>>>>>>>>>> What's special about (neg (fma ...))?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I am not sure why register allocator fails allocating register 
>>>>>>>>>>> pairs with
>>>>>>>>>>> NEG Unary operation with fma operand. I have not debugged register 
>>>>>>>>>>> allocator why the NEG
>>>>>>>>>>> Unary operation with fma operand. 
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> For neg (fma ...) cases because of subreg 128 bits from OOmode 256 
>>>>>>>>> bits are
>>>>>>>>> set correctly. 
>>>>>>>>> IRA marked them spill candidates as spill priority is zero.
>>>>>>>>>
>>>>>>>>> Due to this LRA reload pass couldn't allocate register pairs.
>>>>>>>>
>>>>>>>> I think this is just restating the symptom though.  I suppose the same
>>>>>>>> kind of questions apply here too: what was the instruction before the
>>>>>>>> pass runs, what was the instruction after the pass runs, and why is
>>>>>>>> the rtl change incorrect (by the meaning above)?
>>>>>>>>
>>>>>>>
>>>>>>> Original case where we dont do load fusion, spill happens, in that
>>>>>>> case we dont require sequential register pairs to be generated for 2 
>>>>>>> loads
>>>>>>> for. Hence it worked.
>>>>>>>
>>>>>>> rtl change is correct and there is no error.
>>>>>>>
>>>>>>> for load fusion spill happens and we dont generate sequential register 
>>>>>>> pairs
>>>>>>> because pf spill candidate and lxvp gives incorrect results as 
>>>>>>> sequential register
>>>>>>> pairs are required for lxvp.
>>>>>>
>>>>>> Can you go into more detail?  How is the lxvp represented?  And how do
>>>>>> we end up not getting a sequential register pair?  What does the rtl
>>>>>> look like (before and after things have gone wrong)?
>>>>>>
>>>>>> It seems like either the rtl is not describing the result of the fusion
>>>>>> correctly or there is some problem in the .md description of lxvp.
>>>>>>
>>>>>
>>>>> After fusion pass:
>>>>>
>>>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] 
>>>>> [240])
>>>>>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>>>>>                 (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> 
>>>>> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 
>>>>> {vsx_movv2df_64bit}
>>>>>      (nil))
>>>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] 
>>>>> [240])
>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> 
>>>>> [(real(kind=8) *)_4050]+16 ])
>>>>>                 (reg:V2DF 44 12 [3119])
>>>>>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] 
>>>>> [240]))))) {*vsx_nfmsv2df4}
>>>>>      (nil))
>>>>>
>>>>> In LRA reload.
>>>>>
>>>>> (insn 2472 2461 2412 161 (set (reg:OO 2572 [ vect__300.543_236 ])
>>>>>         (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] [1285]) [1 MEM 
>>>>> <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 A64])) 
>>>>> "shell_lam.fppized.f":238:72 2187 {*movoo}
>>>>>      (expr_list:REG_EQUIV (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] 
>>>>> [1285]) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 
>>>>> A64])
>>>>>         (nil)))
>>>>> (insn 2412 2472 2477 161 (set (reg:V2DF 240 [ vect__302.545 ])
>>>>>         (neg:V2DF (fma:V2DF (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) 
>>>>> real(kind=8)> [(real(kind=8) *)_4050] ]) 16)
>>>>>                 (reg:V2DF 4283 [3119])
>>>>>                 (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 
>>>>> ]) 16)))))  {*vsx_nfmsv2df4}
>>>>>      (nil))
>>>>>
>>>>>
>>>>> In LRA reload sequential registers are not generated as r2572 is splled 
>>>>> and move to spill location
>>>>> in stack and subsequent uses loads from stack. Hence sequential registers 
>>>>> pairs are not generated.
>>>>>
>>>>> lxvp vsx0, 0(r1).
>>>>>
>>>>> It loads from from r1+0 into vsx0 and vsx1 and appropriate uses use 
>>>>> sequential register pairs.
>>>>>
>>>>> Without load fusion since 2 loads exists and 2 loads need not require 
>>>>> sequential registers
>>>>> hence it worked but with load fusion and using lxvp it requires 
>>>>> sequential register pairs.
>>>>
>>>> Do you mean that this is a performance regression?  I.e. the fact that
>>>> lxvp requires sequential registers causes extra spilling, due to having
>>>> less allocation freedom?
>>>>
>>>> Or is it a correctness problem?  If so, what is it?  Nothing in the rtl
>>>> above looks wrong in principle (although I've no idea if the REG_EQUIV
>>>> is correct in this context).  What does the allocated code look like,
>>>> and why is it wrong?
>>>>
>>>> If (reg:OO 2561) is spilled and then one half of it used, only that half
>>>> needs to be loaded from the spill slot.  E.g. if (reg:OO 2561) is reloaded
>>>> for insn 2412 on its own, only the second half of the register needs to be
>>>> loaded from memory.
>>>>
>>>
>>> This is bwaves spec 2017 benchmark. Spill happening in register allocator
>>> could be because of less registers available in order to generate
>>> sequential registers for lxvp.
>>>
>>> Because of spill sequential registers are not generated and breaks the
>>> correctness. REG_EQUIV is generated by IRA.
>>>
>>> Allocated code because of spill doesn't generate sequential registers.
>>> In LRA reload because of spill marked for IRA adjust to another
>>> register causing not to generate sequential registers.
>>>
>>> reg:OO 2572 is spilled not reg:OO 2561. Because of spilled its
>>> loaded from memory instead of generating sequential registers.
>>>
>>> Other reasons for spilling because of long live range for reg:OO 2572.
>>>
>>> We can add heuristics in fusion code not to fuse for longer live
>>> ranges that should solve the problem.
>> 
>> This doesn't describe the real problem though.  It's natural for
>> registers to be spilled sometimes.  That would happen for OOmode
>> registers even if we didn't form them in the fusion pass.  And it's
>> normal GCC semantics that the two hard registers allocated to an OOmode
>> pseudo are consecutive.
>> 
>> Like I asked above, please show the allocated code (including spills
>> and reloads) and explain why it's wrong.
>> 
>
> After LRA reload:
>
> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>                 (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> 
> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 
> {vsx_movv2df_64bit}
>      (nil))
> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> 
> [(real(kind=8) *)_4050]+16 ])
>                 (reg:V2DF 44 12 [3119])
>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] 
> [240]))))) {*vsx_nfmsv2df4}
>      (nil))
>
> (insn 2473 9311 9312 187 (set (reg:V2DF 38 6 [orig:905 vect__302.545 ] [905])
>         (neg:V2DF (fma:V2DF (reg:V2DF 44 12 [3119])
>                 (reg:V2DF 38 6 [orig:2561 MEM <vector(2) real(kind=8)> 
> [(real(kind=8) *)_4050] ] [2561])
>                 (neg:V2DF (reg:V2DF 47 15 [5266]))))) {*vsx_nfmsv2df4}
>      (nil))
>
> In the above allocated code it assign registers 51 and 47 and they are not 
> sequential.


The reload for 2412 looks valid.  What was the original pre-reload
version of insn 2473?  Also, what happened to insn 2472?  Was it deleted?

Richard

Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion

Reply via email to