Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion

Richard Sandiford Tue, 11 Jun 2024 06:38:06 -0700

Ajit Agarwal <aagar...@linux.ibm.com> writes:
> Hello Richard:
> On 11/06/24 6:12 pm, Richard Sandiford wrote:
>> Ajit Agarwal <aagar...@linux.ibm.com> writes:
>>> Hello Richard:
>>>
>>> On 11/06/24 5:15 pm, Richard Sandiford wrote:
>>>> Ajit Agarwal <aagar...@linux.ibm.com> writes:
>>>>> Hello Richard:
>>>>> On 11/06/24 4:56 pm, Ajit Agarwal wrote:
>>>>>> Hello Richard:
>>>>>>
>>>>>> On 11/06/24 4:36 pm, Richard Sandiford wrote:
>>>>>>> Ajit Agarwal <aagar...@linux.ibm.com> writes:
>>>>>>>>>>>> After LRA reload:
>>>>>>>>>>>>
>>>>>>>>>>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 
>>>>>>>>>>>> vect__302.545 ] [240])
>>>>>>>>>>>>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] 
>>>>>>>>>>>> [1285])
>>>>>>>>>>>>                 (const_int 16 [0x10])) [1 MEM <vector(2) 
>>>>>>>>>>>> real(kind=8)> [(real(kind=8) *)_4188]+16 S16 A64])) 
>>>>>>>>>>>> "shell_lam.fppized.f":238:72 1190 {vsx_movv2df_64bit}
>>>>>>>>>>>>      (nil))
>>>>>>>>>>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 
>>>>>>>>>>>> vect__302.545 ] [240])
>>>>>>>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) 
>>>>>>>>>>>> real(kind=8)> [(real(kind=8) *)_4050]+16 ])
>>>>>>>>>>>>                 (reg:V2DF 44 12 [3119])
>>>>>>>>>>>>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 
>>>>>>>>>>>> ] [240]))))) {*vsx_nfmsv2df4}
>>>>>>>>>>>>      (nil))
>>>>>>>>>>>>
>>>>>>>>>>>> (insn 2473 9311 9312 187 (set (reg:V2DF 38 6 [orig:905 
>>>>>>>>>>>> vect__302.545 ] [905])
>>>>>>>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 44 12 [3119])
>>>>>>>>>>>>                 (reg:V2DF 38 6 [orig:2561 MEM <vector(2) 
>>>>>>>>>>>> real(kind=8)> [(real(kind=8) *)_4050] ] [2561])
>>>>>>>>>>>>                 (neg:V2DF (reg:V2DF 47 15 [5266]))))) 
>>>>>>>>>>>> {*vsx_nfmsv2df4}
>>>>>>>>>>>>      (nil))
>>>>>>>>>>>>
>>>>>>>>>>>> In the above allocated code it assign registers 51 and 47 and they 
>>>>>>>>>>>> are not sequential.
>>>>>>>>>>>
>>>>>>>>>>> The reload for 2412 looks valid.  What was the original pre-reload
>>>>>>>>>>> version of insn 2473?  Also, what happened to insn 2472?  Was it 
>>>>>>>>>>> deleted?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This is preload version of 2473:
>>>>>>>>>>
>>>>>>>>>> (insn 2473 2396 2478 161 (set (reg:V2DF 905 [ vect__302.545 ])
>>>>>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 4283 [3119])
>>>>>>>>>>                 (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) 
>>>>>>>>>> real(kind=8)> [(real(kind=8) *)_4050] ]) 0)
>>>>>>>>>>                 (neg:V2DF (subreg:V2DF (reg:OO 2572 [ 
>>>>>>>>>> vect__300.543_236 ]) 0))))) {*vsx_nfmsv2df4}
>>>>>>>>>>      (expr_list:REG_DEAD (reg:OO 2572 [ vect__300.543_236 ])
>>>>>>>>>>         (expr_list:REG_DEAD (reg:OO 2561 [ MEM <vector(2) 
>>>>>>>>>> real(kind=8)> [(real(kind=8) *)_4050] ])
>>>>>>>>>>             (nil))))
>>>>>>>>>>
>>>>>>>>>> insn 2472 is replaced with 9299 after reload.
>>>>>>>>>
>>>>>>>>> You'd have to check the dumps to be sure, but I think 9299 is instead
>>>>>>>>> generated as an input reload of 2412, rather than being a replacement
>>>>>>>>> of insn 2472.  T
>>>>>>>>
>>>>>>>> Yes it is generated for 2412. The predecessor of 2412 is load from
>>>>>>>> plus offset as in 2412 we have subreg:V2DF (reg OO 2572) 16).
>>>>>>>>
>>>>>>>> This is not correct as we are not generating lxvp and it is 
>>>>>>>> normal load lxv.
>>>>>>>> As normal load is generated in predecessor insn of 2412 with
>>>>>>>> plus constant offset it breaks the correctness.
>>>>>>>
>>>>>>> Not using lxvp is a deliberate choice though.
>>>>>>>
>>>>>>> If a (reg:OO R) has been spilled, there's no requirement for LRA
>>>>>>> to load both halves of R when only one half is needed.  LRA just
>>>>>>> loads what it needs into whichever registers happen to be free.
>>>>>>>
>>>>>>> If the reload of R instead used lxvp, LRA would be forced to free
>>>>>>> up another register for the other half of R, even though that value
>>>>>>> would never be used.
>>>>>>>
>>>>>>
>>>>>> If a (reg:OO R ) 16 is loaded when it is spilled then loaded value
>>>>>> will be from plus offset 16 instead it should be loaded value 
>>>>>> from zero offset. As in load fusion pass we are replacing
>>>>>> (reg:V2DI R) with subreg (reg:OO R) 16 and hence loaded value
>>>>>> is from plus 16 offsets and thats why its breaking the correctness.
>>>>>>
>>>>>> Similarly we are replacing (reg:V2DI R) 16 with subreg (reg:OO R) 0
>>>>>> and loaded value is from 16 offset instead its loading from zero
>>>>>> offset and thats why we are breaking the correctness.
>>>>>>
>>>>>
>>>>> If a (reg:OO R ) 16 is loaded when it is spilled then loaded value
>>>>> will be from plus offset 16 instead it should be loaded value 
>>>>> from zero offset. As in load fusion pass we are replacing
>>>>> (reg:V2DI R) with subreg (reg:OO R) 16 and hence loaded value
>>>>> is from plus 16 offsets instead it should load from zero offset.
>>>>> Thats why its breaking the correctness.
>>>>>  
>>>>> Similarly we are replacing (reg:V2DI R) 16 with subreg (reg:OO R) 0
>>>>> and loaded value is from 16 offset instead its loading from zero
>>>>> offset and thats why we are breaking the correctness.
>>>>
>>>> I don't understand, sorry.  (subreg:V2DI (reg:OO R) 0) is always
>>>>
>>>> (a) the first hard register in (reg:OO R), when the whole of R
>>>>     is stored in hard registers
>>>> (b) at address offset 0 from the start of (reg:OO R), when R is
>>>>     spilled to memory
>>>>
>>>> Similarly, (subreg:V2DI (reg:OO R) 16) is always
>>>>
>>>> (c) the second hard register in (reg:OO R), when the whole of R
>>>>     is stored in hard registers
>>>> (d) at address offset 16 from the start of (reg:OO R), when R is
>>>>     spilled to memory
>>>>
>>>
>>> Yes but we are replacing use of loaded value from plus 16 offset
>>> with subreg (reg OO ) 0 and similarly we are replacing use of loaded
>>> value from 0 offset with subreg (reg OO ) 16 as we are swapping
>>> the use operand.
>>>
>>> When it is spilled its vice versa subreg (reg OO ) 16 should be
>>> loaded from 0 offset and subreg (reg OO) 0 should be loaded
>>> from 16 offset as we are swapping the use operand.
>>>
>>> This is the semantics of lxvp.
>> 
>> Hmm, OK.  Does that mean that:
>> 
>>    lxvp A,B
>> 
>> loads A+1 from B and A from B+16?  (I couldn't find an online
>> description of the instruction btw -- is there one?)
>> 
>
> Yes thats correct. Even I didn't find online document that
> describes the same.


Thanks, I think I get it now.

>> If lxvp does behave like that, it shouldn't be described as a normal move:
>> 
>>   [(set (match_operand:OO 0 "nonimmediate_operand" "=wa,ZwO,wa")
>>      (match_operand:OO 1 "input_operand" "ZwO,wa,wa"))]
>> 
>> since the rules I listed above are target-independent.
>>
>
> Rules for load for lxvp should as normal, but how the use
> of lxvp defines should be different for lxvp than normal
> load.

I don't think that's true though.  Because of the rules above,
GCC expects that:

  (set (reg:OO R) (mem:OO ADDR))

will load ADDR+0 into R and ADDR+16 into R+1.  Anything else will
cause exactly the kind of mix-up that you're seeing here.

In the worst case, the load should be described as:

  (set (reg:OO R)
       (unspec:OO [(mem:OO ADDR)] UNSPEC_LXVP))

although there are other alternatives that could be used if OOmode
weren't defined as an OPAQUE_MODE.

>> Alternatively, if lxvp is described as a normal move,
>> with the 128-bit pieces swapped compared to GCC's normal order,
>> TARGET_CAN_CHANGE_MODE_CLASS must prevent subregs that select
>> one half of the register (or less).  But that would defeat exactly
>> the optimisation that you're trying to do.
>> 
>
> Sorry, I didn't get how TARGET_CAN_CHANGE_MODE_CLASS prevents
> subreg that selects one half of the registers. 

It doesn't prevent the subregs at such, but it stops the mode
change from happening in certain registers.

For example, if we have (subreg:V2DI (reg:OO R) 0), and if
TARGET_CAN_CHANGE_MODE_CLASS disallows changes between V2DI
and OO in all register classes that can hold OO, then the register
allocator would handle the subreg by spilling register R to memory
and loading a V2DI piece of it from there.

This would allow the port to store OOmode in an unusual order.
But like I say, it would also defeat exactly what you're trying
to do here.

Thanks,
Richard

Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion

Reply via email to