Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion

Richard Sandiford Mon, 10 Jun 2024 01:43:05 -0700

Ajit Agarwal <aagar...@linux.ibm.com> writes:
>>>>>>>>> +
>>>>>>>>> +       rtx set = single_set (insn);
>>>>>>>>> +       if (set == NULL_RTX)
>>>>>>>>> +         return false;
>>>>>>>>> +
>>>>>>>>> +       rtx op0 = SET_SRC (set);
>>>>>>>>> +       rtx_code code = GET_CODE (op0);
>>>>>>>>> +
>>>>>>>>> +       // This check is added as register pairs are not generated
>>>>>>>>> +       // by RA for neg:V2DF (fma: V2DF (reg1)
>>>>>>>>> +       //                  (reg2)
>>>>>>>>> +       //                  (neg:V2DF (reg3)))
>>>>>>>>> +       if (GET_RTX_CLASS (code) == RTX_UNARY)
>>>>>>>>> +         return false;
>>>>>>>>
>>>>>>>> What's special about (neg (fma ...))?
>>>>>>>>
>>>>>>>
>>>>>>> I am not sure why register allocator fails allocating register pairs 
>>>>>>> with
>>>>>>> NEG Unary operation with fma operand. I have not debugged register 
>>>>>>> allocator why the NEG
>>>>>>> Unary operation with fma operand. 
>>>>>>
>>>>>
>>>>> For neg (fma ...) cases because of subreg 128 bits from OOmode 256 bits 
>>>>> are
>>>>> set correctly. 
>>>>> IRA marked them spill candidates as spill priority is zero.
>>>>>
>>>>> Due to this LRA reload pass couldn't allocate register pairs.
>>>>
>>>> I think this is just restating the symptom though.  I suppose the same
>>>> kind of questions apply here too: what was the instruction before the
>>>> pass runs, what was the instruction after the pass runs, and why is
>>>> the rtl change incorrect (by the meaning above)?
>>>>
>>>
>>> Original case where we dont do load fusion, spill happens, in that
>>> case we dont require sequential register pairs to be generated for 2 loads
>>> for. Hence it worked.
>>>
>>> rtl change is correct and there is no error.
>>>
>>> for load fusion spill happens and we dont generate sequential register pairs
>>> because pf spill candidate and lxvp gives incorrect results as sequential 
>>> register
>>> pairs are required for lxvp.
>> 
>> Can you go into more detail?  How is the lxvp represented?  And how do
>> we end up not getting a sequential register pair?  What does the rtl
>> look like (before and after things have gone wrong)?
>> 
>> It seems like either the rtl is not describing the result of the fusion
>> correctly or there is some problem in the .md description of lxvp.
>> 
>
> After fusion pass:
>
> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>                 (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> 
> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 
> {vsx_movv2df_64bit}
>      (nil))
> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> 
> [(real(kind=8) *)_4050]+16 ])
>                 (reg:V2DF 44 12 [3119])
>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] 
> [240]))))) {*vsx_nfmsv2df4}
>      (nil))
>
> In LRA reload.
>
> (insn 2472 2461 2412 161 (set (reg:OO 2572 [ vect__300.543_236 ])
>         (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] [1285]) [1 MEM <vector(2) 
> real(kind=8)> [(real(kind=8) *)_4188]+0 S16 A64])) 
> "shell_lam.fppized.f":238:72 2187 {*movoo}
>      (expr_list:REG_EQUIV (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] [1285]) 
> [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 A64])
>         (nil)))
> (insn 2412 2472 2477 161 (set (reg:V2DF 240 [ vect__302.545 ])
>         (neg:V2DF (fma:V2DF (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) 
> real(kind=8)> [(real(kind=8) *)_4050] ]) 16)
>                 (reg:V2DF 4283 [3119])
>                 (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 ]) 
> 16)))))  {*vsx_nfmsv2df4}
>      (nil))
>
>
> In LRA reload sequential registers are not generated as r2572 is splled and 
> move to spill location
> in stack and subsequent uses loads from stack. Hence sequential registers 
> pairs are not generated.
>
> lxvp vsx0, 0(r1).
>
> It loads from from r1+0 into vsx0 and vsx1 and appropriate uses use 
> sequential register pairs.
>
> Without load fusion since 2 loads exists and 2 loads need not require 
> sequential registers
> hence it worked but with load fusion and using lxvp it requires sequential 
> register pairs.


Do you mean that this is a performance regression?  I.e. the fact that
lxvp requires sequential registers causes extra spilling, due to having
less allocation freedom?

Or is it a correctness problem?  If so, what is it?  Nothing in the rtl
above looks wrong in principle (although I've no idea if the REG_EQUIV
is correct in this context).  What does the allocated code look like,
and why is it wrong?

If (reg:OO 2561) is spilled and then one half of it used, only that half
needs to be loaded from the spill slot.  E.g. if (reg:OO 2561) is reloaded
for insn 2412 on its own, only the second half of the register needs to be
loaded from memory.

Richard

Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion

Reply via email to