Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion

Ajit Agarwal Mon, 10 Jun 2024 03:14:03 -0700

Hello Richard:

On 10/06/24 3:20 pm, Richard Sandiford wrote:
> Ajit Agarwal <aagar...@linux.ibm.com> writes:
>> Hello Richard:
>>
>> On 10/06/24 2:52 pm, Richard Sandiford wrote:
>>> Ajit Agarwal <aagar...@linux.ibm.com> writes:
>>>> On 10/06/24 2:12 pm, Richard Sandiford wrote:
>>>>> Ajit Agarwal <aagar...@linux.ibm.com> writes:
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +          rtx set = single_set (insn);
>>>>>>>>>>>>>> +          if (set == NULL_RTX)
>>>>>>>>>>>>>> +            return false;
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +          rtx op0 = SET_SRC (set);
>>>>>>>>>>>>>> +          rtx_code code = GET_CODE (op0);
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +          // This check is added as register pairs are not 
>>>>>>>>>>>>>> generated
>>>>>>>>>>>>>> +          // by RA for neg:V2DF (fma: V2DF (reg1)
>>>>>>>>>>>>>> +          //                  (reg2)
>>>>>>>>>>>>>> +          //                  (neg:V2DF (reg3)))
>>>>>>>>>>>>>> +          if (GET_RTX_CLASS (code) == RTX_UNARY)
>>>>>>>>>>>>>> +            return false;
>>>>>>>>>>>>>
>>>>>>>>>>>>> What's special about (neg (fma ...))?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I am not sure why register allocator fails allocating register 
>>>>>>>>>>>> pairs with
>>>>>>>>>>>> NEG Unary operation with fma operand. I have not debugged register 
>>>>>>>>>>>> allocator why the NEG
>>>>>>>>>>>> Unary operation with fma operand. 
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> For neg (fma ...) cases because of subreg 128 bits from OOmode 256 
>>>>>>>>>> bits are
>>>>>>>>>> set correctly. 
>>>>>>>>>> IRA marked them spill candidates as spill priority is zero.
>>>>>>>>>>
>>>>>>>>>> Due to this LRA reload pass couldn't allocate register pairs.
>>>>>>>>>
>>>>>>>>> I think this is just restating the symptom though.  I suppose the same
>>>>>>>>> kind of questions apply here too: what was the instruction before the
>>>>>>>>> pass runs, what was the instruction after the pass runs, and why is
>>>>>>>>> the rtl change incorrect (by the meaning above)?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Original case where we dont do load fusion, spill happens, in that
>>>>>>>> case we dont require sequential register pairs to be generated for 2 
>>>>>>>> loads
>>>>>>>> for. Hence it worked.
>>>>>>>>
>>>>>>>> rtl change is correct and there is no error.
>>>>>>>>
>>>>>>>> for load fusion spill happens and we dont generate sequential register 
>>>>>>>> pairs
>>>>>>>> because pf spill candidate and lxvp gives incorrect results as 
>>>>>>>> sequential register
>>>>>>>> pairs are required for lxvp.
>>>>>>>
>>>>>>> Can you go into more detail?  How is the lxvp represented?  And how do
>>>>>>> we end up not getting a sequential register pair?  What does the rtl
>>>>>>> look like (before and after things have gone wrong)?
>>>>>>>
>>>>>>> It seems like either the rtl is not describing the result of the fusion
>>>>>>> correctly or there is some problem in the .md description of lxvp.
>>>>>>>
>>>>>>
>>>>>> After fusion pass:
>>>>>>
>>>>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] 
>>>>>> [240])
>>>>>>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>>>>>>                 (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> 
>>>>>> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 
>>>>>> {vsx_movv2df_64bit}
>>>>>>      (nil))
>>>>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] 
>>>>>> [240])
>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) 
>>>>>> real(kind=8)> [(real(kind=8) *)_4050]+16 ])
>>>>>>                 (reg:V2DF 44 12 [3119])
>>>>>>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] 
>>>>>> [240]))))) {*vsx_nfmsv2df4}
>>>>>>      (nil))
>>>>>>
>>>>>> In LRA reload.
>>>>>>
>>>>>> (insn 2472 2461 2412 161 (set (reg:OO 2572 [ vect__300.543_236 ])
>>>>>>         (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] [1285]) [1 MEM 
>>>>>> <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 A64])) 
>>>>>> "shell_lam.fppized.f":238:72 2187 {*movoo}
>>>>>>      (expr_list:REG_EQUIV (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] 
>>>>>> [1285]) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 
>>>>>> A64])
>>>>>>         (nil)))
>>>>>> (insn 2412 2472 2477 161 (set (reg:V2DF 240 [ vect__302.545 ])
>>>>>>         (neg:V2DF (fma:V2DF (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) 
>>>>>> real(kind=8)> [(real(kind=8) *)_4050] ]) 16)
>>>>>>                 (reg:V2DF 4283 [3119])
>>>>>>                 (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 
>>>>>> ]) 16)))))  {*vsx_nfmsv2df4}
>>>>>>      (nil))
>>>>>>
>>>>>>
>>>>>> In LRA reload sequential registers are not generated as r2572 is splled 
>>>>>> and move to spill location
>>>>>> in stack and subsequent uses loads from stack. Hence sequential 
>>>>>> registers pairs are not generated.
>>>>>>
>>>>>> lxvp vsx0, 0(r1).
>>>>>>
>>>>>> It loads from from r1+0 into vsx0 and vsx1 and appropriate uses use 
>>>>>> sequential register pairs.
>>>>>>
>>>>>> Without load fusion since 2 loads exists and 2 loads need not require 
>>>>>> sequential registers
>>>>>> hence it worked but with load fusion and using lxvp it requires 
>>>>>> sequential register pairs.
>>>>>
>>>>> Do you mean that this is a performance regression?  I.e. the fact that
>>>>> lxvp requires sequential registers causes extra spilling, due to having
>>>>> less allocation freedom?
>>>>>
>>>>> Or is it a correctness problem?  If so, what is it?  Nothing in the rtl
>>>>> above looks wrong in principle (although I've no idea if the REG_EQUIV
>>>>> is correct in this context).  What does the allocated code look like,
>>>>> and why is it wrong?
>>>>>
>>>>> If (reg:OO 2561) is spilled and then one half of it used, only that half
>>>>> needs to be loaded from the spill slot.  E.g. if (reg:OO 2561) is reloaded
>>>>> for insn 2412 on its own, only the second half of the register needs to be
>>>>> loaded from memory.
>>>>>
>>>>
>>>> This is bwaves spec 2017 benchmark. Spill happening in register allocator
>>>> could be because of less registers available in order to generate
>>>> sequential registers for lxvp.
>>>>
>>>> Because of spill sequential registers are not generated and breaks the
>>>> correctness. REG_EQUIV is generated by IRA.
>>>>
>>>> Allocated code because of spill doesn't generate sequential registers.
>>>> In LRA reload because of spill marked for IRA adjust to another
>>>> register causing not to generate sequential registers.
>>>>
>>>> reg:OO 2572 is spilled not reg:OO 2561. Because of spilled its
>>>> loaded from memory instead of generating sequential registers.
>>>>
>>>> Other reasons for spilling because of long live range for reg:OO 2572.
>>>>
>>>> We can add heuristics in fusion code not to fuse for longer live
>>>> ranges that should solve the problem.
>>>
>>> This doesn't describe the real problem though.  It's natural for
>>> registers to be spilled sometimes.  That would happen for OOmode
>>> registers even if we didn't form them in the fusion pass.  And it's
>>> normal GCC semantics that the two hard registers allocated to an OOmode
>>> pseudo are consecutive.
>>>
>>> Like I asked above, please show the allocated code (including spills
>>> and reloads) and explain why it's wrong.
>>>
>>
>> After LRA reload:
>>
>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] 
>> [240])
>>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>>                 (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> 
>> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 
>> {vsx_movv2df_64bit}
>>      (nil))
>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] 
>> [240])
>>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> 
>> [(real(kind=8) *)_4050]+16 ])
>>                 (reg:V2DF 44 12 [3119])
>>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] 
>> [240]))))) {*vsx_nfmsv2df4}
>>      (nil))
>>
>> (insn 2473 9311 9312 187 (set (reg:V2DF 38 6 [orig:905 vect__302.545 ] [905])
>>         (neg:V2DF (fma:V2DF (reg:V2DF 44 12 [3119])
>>                 (reg:V2DF 38 6 [orig:2561 MEM <vector(2) real(kind=8)> 
>> [(real(kind=8) *)_4050] ] [2561])
>>                 (neg:V2DF (reg:V2DF 47 15 [5266]))))) {*vsx_nfmsv2df4}
>>      (nil))
>>
>> In the above allocated code it assign registers 51 and 47 and they are not 
>> sequential.
> 
> The reload for 2412 looks valid.  What was the original pre-reload
> version of insn 2473?  Also, what happened to insn 2472?  Was it deleted?
>


This is preload version of 2473:

(insn 2473 2396 2478 161 (set (reg:V2DF 905 [ vect__302.545 ])
        (neg:V2DF (fma:V2DF (reg:V2DF 4283 [3119])
                (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) real(kind=8)> 
[(real(kind=8) *)_4050] ]) 0)
                (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 ]) 
0))))) {*vsx_nfmsv2df4}
     (expr_list:REG_DEAD (reg:OO 2572 [ vect__300.543_236 ])
        (expr_list:REG_DEAD (reg:OO 2561 [ MEM <vector(2) real(kind=8)> 
[(real(kind=8) *)_4050] ])
            (nil))))

insn 2472 is replaced with 9299 after reload.

> Richard

Thanks & Regards
Ajit

Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion

Reply via email to