Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion

Ajit Agarwal Mon, 10 Jun 2024 02:43:47 -0700

Hello Richard:

On 10/06/24 2:52 pm, Richard Sandiford wrote:
> Ajit Agarwal <aagar...@linux.ibm.com> writes:
>> On 10/06/24 2:12 pm, Richard Sandiford wrote:
>>> Ajit Agarwal <aagar...@linux.ibm.com> writes:
>>>>>>>>>>>> +
>>>>>>>>>>>> +    rtx set = single_set (insn);
>>>>>>>>>>>> +    if (set == NULL_RTX)
>>>>>>>>>>>> +      return false;
>>>>>>>>>>>> +
>>>>>>>>>>>> +    rtx op0 = SET_SRC (set);
>>>>>>>>>>>> +    rtx_code code = GET_CODE (op0);
>>>>>>>>>>>> +
>>>>>>>>>>>> +    // This check is added as register pairs are not generated
>>>>>>>>>>>> +    // by RA for neg:V2DF (fma: V2DF (reg1)
>>>>>>>>>>>> +    //                  (reg2)
>>>>>>>>>>>> +    //                  (neg:V2DF (reg3)))
>>>>>>>>>>>> +    if (GET_RTX_CLASS (code) == RTX_UNARY)
>>>>>>>>>>>> +      return false;
>>>>>>>>>>>
>>>>>>>>>>> What's special about (neg (fma ...))?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I am not sure why register allocator fails allocating register pairs 
>>>>>>>>>> with
>>>>>>>>>> NEG Unary operation with fma operand. I have not debugged register 
>>>>>>>>>> allocator why the NEG
>>>>>>>>>> Unary operation with fma operand. 
>>>>>>>>>
>>>>>>>>
>>>>>>>> For neg (fma ...) cases because of subreg 128 bits from OOmode 256 
>>>>>>>> bits are
>>>>>>>> set correctly. 
>>>>>>>> IRA marked them spill candidates as spill priority is zero.
>>>>>>>>
>>>>>>>> Due to this LRA reload pass couldn't allocate register pairs.
>>>>>>>
>>>>>>> I think this is just restating the symptom though.  I suppose the same
>>>>>>> kind of questions apply here too: what was the instruction before the
>>>>>>> pass runs, what was the instruction after the pass runs, and why is
>>>>>>> the rtl change incorrect (by the meaning above)?
>>>>>>>
>>>>>>
>>>>>> Original case where we dont do load fusion, spill happens, in that
>>>>>> case we dont require sequential register pairs to be generated for 2 
>>>>>> loads
>>>>>> for. Hence it worked.
>>>>>>
>>>>>> rtl change is correct and there is no error.
>>>>>>
>>>>>> for load fusion spill happens and we dont generate sequential register 
>>>>>> pairs
>>>>>> because pf spill candidate and lxvp gives incorrect results as 
>>>>>> sequential register
>>>>>> pairs are required for lxvp.
>>>>>
>>>>> Can you go into more detail?  How is the lxvp represented?  And how do
>>>>> we end up not getting a sequential register pair?  What does the rtl
>>>>> look like (before and after things have gone wrong)?
>>>>>
>>>>> It seems like either the rtl is not describing the result of the fusion
>>>>> correctly or there is some problem in the .md description of lxvp.
>>>>>
>>>>
>>>> After fusion pass:
>>>>
>>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] 
>>>> [240])
>>>>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>>>>                 (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> 
>>>> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 
>>>> {vsx_movv2df_64bit}
>>>>      (nil))
>>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] 
>>>> [240])
>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> 
>>>> [(real(kind=8) *)_4050]+16 ])
>>>>                 (reg:V2DF 44 12 [3119])
>>>>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] 
>>>> [240]))))) {*vsx_nfmsv2df4}
>>>>      (nil))
>>>>
>>>> In LRA reload.
>>>>
>>>> (insn 2472 2461 2412 161 (set (reg:OO 2572 [ vect__300.543_236 ])
>>>>         (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] [1285]) [1 MEM 
>>>> <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 A64])) 
>>>> "shell_lam.fppized.f":238:72 2187 {*movoo}
>>>>      (expr_list:REG_EQUIV (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] 
>>>> [1285]) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 A64])
>>>>         (nil)))
>>>> (insn 2412 2472 2477 161 (set (reg:V2DF 240 [ vect__302.545 ])
>>>>         (neg:V2DF (fma:V2DF (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) 
>>>> real(kind=8)> [(real(kind=8) *)_4050] ]) 16)
>>>>                 (reg:V2DF 4283 [3119])
>>>>                 (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 ]) 
>>>> 16)))))  {*vsx_nfmsv2df4}
>>>>      (nil))
>>>>
>>>>
>>>> In LRA reload sequential registers are not generated as r2572 is splled 
>>>> and move to spill location
>>>> in stack and subsequent uses loads from stack. Hence sequential registers 
>>>> pairs are not generated.
>>>>
>>>> lxvp vsx0, 0(r1).
>>>>
>>>> It loads from from r1+0 into vsx0 and vsx1 and appropriate uses use 
>>>> sequential register pairs.
>>>>
>>>> Without load fusion since 2 loads exists and 2 loads need not require 
>>>> sequential registers
>>>> hence it worked but with load fusion and using lxvp it requires sequential 
>>>> register pairs.
>>>
>>> Do you mean that this is a performance regression?  I.e. the fact that
>>> lxvp requires sequential registers causes extra spilling, due to having
>>> less allocation freedom?
>>>
>>> Or is it a correctness problem?  If so, what is it?  Nothing in the rtl
>>> above looks wrong in principle (although I've no idea if the REG_EQUIV
>>> is correct in this context).  What does the allocated code look like,
>>> and why is it wrong?
>>>
>>> If (reg:OO 2561) is spilled and then one half of it used, only that half
>>> needs to be loaded from the spill slot.  E.g. if (reg:OO 2561) is reloaded
>>> for insn 2412 on its own, only the second half of the register needs to be
>>> loaded from memory.
>>>
>>
>> This is bwaves spec 2017 benchmark. Spill happening in register allocator
>> could be because of less registers available in order to generate
>> sequential registers for lxvp.
>>
>> Because of spill sequential registers are not generated and breaks the
>> correctness. REG_EQUIV is generated by IRA.
>>
>> Allocated code because of spill doesn't generate sequential registers.
>> In LRA reload because of spill marked for IRA adjust to another
>> register causing not to generate sequential registers.
>>
>> reg:OO 2572 is spilled not reg:OO 2561. Because of spilled its
>> loaded from memory instead of generating sequential registers.
>>
>> Other reasons for spilling because of long live range for reg:OO 2572.
>>
>> We can add heuristics in fusion code not to fuse for longer live
>> ranges that should solve the problem.
> 
> This doesn't describe the real problem though.  It's natural for
> registers to be spilled sometimes.  That would happen for OOmode
> registers even if we didn't form them in the fusion pass.  And it's
> normal GCC semantics that the two hard registers allocated to an OOmode
> pseudo are consecutive.
> 
> Like I asked above, please show the allocated code (including spills
> and reloads) and explain why it's wrong.
>


After LRA reload:

(insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
        (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
                (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> 
[(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 
{vsx_movv2df_64bit}
     (nil))
(insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
        (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> 
[(real(kind=8) *)_4050]+16 ])
                (reg:V2DF 44 12 [3119])
                (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240]))))) 
{*vsx_nfmsv2df4}
     (nil))

(insn 2473 9311 9312 187 (set (reg:V2DF 38 6 [orig:905 vect__302.545 ] [905])
        (neg:V2DF (fma:V2DF (reg:V2DF 44 12 [3119])
                (reg:V2DF 38 6 [orig:2561 MEM <vector(2) real(kind=8)> 
[(real(kind=8) *)_4050] ] [2561])
                (neg:V2DF (reg:V2DF 47 15 [5266]))))) {*vsx_nfmsv2df4}
     (nil))

In the above allocated code it assign registers 51 and 47 and they are not 
sequential.

> Richard

Thanks & Regards
Ajit

Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion

Reply via email to