Ajit Agarwal <aagar...@linux.ibm.com> writes: > Hello Richard: > > On 11/06/24 5:15 pm, Richard Sandiford wrote: >> Ajit Agarwal <aagar...@linux.ibm.com> writes: >>> Hello Richard: >>> On 11/06/24 4:56 pm, Ajit Agarwal wrote: >>>> Hello Richard: >>>> >>>> On 11/06/24 4:36 pm, Richard Sandiford wrote: >>>>> Ajit Agarwal <aagar...@linux.ibm.com> writes: >>>>>>>>>> After LRA reload: >>>>>>>>>> >>>>>>>>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 >>>>>>>>>> vect__302.545 ] [240]) >>>>>>>>>> (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285]) >>>>>>>>>> (const_int 16 [0x10])) [1 MEM <vector(2) >>>>>>>>>> real(kind=8)> [(real(kind=8) *)_4188]+16 S16 A64])) >>>>>>>>>> "shell_lam.fppized.f":238:72 1190 {vsx_movv2df_64bit} >>>>>>>>>> (nil)) >>>>>>>>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 >>>>>>>>>> vect__302.545 ] [240]) >>>>>>>>>> (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) >>>>>>>>>> real(kind=8)> [(real(kind=8) *)_4050]+16 ]) >>>>>>>>>> (reg:V2DF 44 12 [3119]) >>>>>>>>>> (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] >>>>>>>>>> [240]))))) {*vsx_nfmsv2df4} >>>>>>>>>> (nil)) >>>>>>>>>> >>>>>>>>>> (insn 2473 9311 9312 187 (set (reg:V2DF 38 6 [orig:905 vect__302.545 >>>>>>>>>> ] [905]) >>>>>>>>>> (neg:V2DF (fma:V2DF (reg:V2DF 44 12 [3119]) >>>>>>>>>> (reg:V2DF 38 6 [orig:2561 MEM <vector(2) >>>>>>>>>> real(kind=8)> [(real(kind=8) *)_4050] ] [2561]) >>>>>>>>>> (neg:V2DF (reg:V2DF 47 15 [5266]))))) >>>>>>>>>> {*vsx_nfmsv2df4} >>>>>>>>>> (nil)) >>>>>>>>>> >>>>>>>>>> In the above allocated code it assign registers 51 and 47 and they >>>>>>>>>> are not sequential. >>>>>>>>> >>>>>>>>> The reload for 2412 looks valid. What was the original pre-reload >>>>>>>>> version of insn 2473? Also, what happened to insn 2472? Was it >>>>>>>>> deleted? >>>>>>>>> >>>>>>>> >>>>>>>> This is preload version of 2473: >>>>>>>> >>>>>>>> (insn 2473 2396 2478 161 (set (reg:V2DF 905 [ vect__302.545 ]) >>>>>>>> (neg:V2DF (fma:V2DF (reg:V2DF 4283 [3119]) >>>>>>>> (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) >>>>>>>> real(kind=8)> [(real(kind=8) *)_4050] ]) 0) >>>>>>>> (neg:V2DF (subreg:V2DF (reg:OO 2572 [ >>>>>>>> vect__300.543_236 ]) 0))))) {*vsx_nfmsv2df4} >>>>>>>> (expr_list:REG_DEAD (reg:OO 2572 [ vect__300.543_236 ]) >>>>>>>> (expr_list:REG_DEAD (reg:OO 2561 [ MEM <vector(2) >>>>>>>> real(kind=8)> [(real(kind=8) *)_4050] ]) >>>>>>>> (nil)))) >>>>>>>> >>>>>>>> insn 2472 is replaced with 9299 after reload. >>>>>>> >>>>>>> You'd have to check the dumps to be sure, but I think 9299 is instead >>>>>>> generated as an input reload of 2412, rather than being a replacement >>>>>>> of insn 2472. T >>>>>> >>>>>> Yes it is generated for 2412. The predecessor of 2412 is load from >>>>>> plus offset as in 2412 we have subreg:V2DF (reg OO 2572) 16). >>>>>> >>>>>> This is not correct as we are not generating lxvp and it is >>>>>> normal load lxv. >>>>>> As normal load is generated in predecessor insn of 2412 with >>>>>> plus constant offset it breaks the correctness. >>>>> >>>>> Not using lxvp is a deliberate choice though. >>>>> >>>>> If a (reg:OO R) has been spilled, there's no requirement for LRA >>>>> to load both halves of R when only one half is needed. LRA just >>>>> loads what it needs into whichever registers happen to be free. >>>>> >>>>> If the reload of R instead used lxvp, LRA would be forced to free >>>>> up another register for the other half of R, even though that value >>>>> would never be used. >>>>> >>>> >>>> If a (reg:OO R ) 16 is loaded when it is spilled then loaded value >>>> will be from plus offset 16 instead it should be loaded value >>>> from zero offset. As in load fusion pass we are replacing >>>> (reg:V2DI R) with subreg (reg:OO R) 16 and hence loaded value >>>> is from plus 16 offsets and thats why its breaking the correctness. >>>> >>>> Similarly we are replacing (reg:V2DI R) 16 with subreg (reg:OO R) 0 >>>> and loaded value is from 16 offset instead its loading from zero >>>> offset and thats why we are breaking the correctness. >>>> >>> >>> If a (reg:OO R ) 16 is loaded when it is spilled then loaded value >>> will be from plus offset 16 instead it should be loaded value >>> from zero offset. As in load fusion pass we are replacing >>> (reg:V2DI R) with subreg (reg:OO R) 16 and hence loaded value >>> is from plus 16 offsets instead it should load from zero offset. >>> Thats why its breaking the correctness. >>> >>> Similarly we are replacing (reg:V2DI R) 16 with subreg (reg:OO R) 0 >>> and loaded value is from 16 offset instead its loading from zero >>> offset and thats why we are breaking the correctness. >> >> I don't understand, sorry. (subreg:V2DI (reg:OO R) 0) is always >> >> (a) the first hard register in (reg:OO R), when the whole of R >> is stored in hard registers >> (b) at address offset 0 from the start of (reg:OO R), when R is >> spilled to memory >> >> Similarly, (subreg:V2DI (reg:OO R) 16) is always >> >> (c) the second hard register in (reg:OO R), when the whole of R >> is stored in hard registers >> (d) at address offset 16 from the start of (reg:OO R), when R is >> spilled to memory >> > > Yes but we are replacing use of loaded value from plus 16 offset > with subreg (reg OO ) 0 and similarly we are replacing use of loaded > value from 0 offset with subreg (reg OO ) 16 as we are swapping > the use operand. > > When it is spilled its vice versa subreg (reg OO ) 16 should be > loaded from 0 offset and subreg (reg OO) 0 should be loaded > from 16 offset as we are swapping the use operand. > > This is the semantics of lxvp.
Hmm, OK. Does that mean that: lxvp A,B loads A+1 from B and A from B+16? (I couldn't find an online description of the instruction btw -- is there one?) If lxvp does behave like that, it shouldn't be described as a normal move: [(set (match_operand:OO 0 "nonimmediate_operand" "=wa,ZwO,wa") (match_operand:OO 1 "input_operand" "ZwO,wa,wa"))] since the rules I listed above are target-independent. Alternatively, if lxvp is described as a normal move, with the 128-bit pieces swapped compared to GCC's normal order, TARGET_CAN_CHANGE_MODE_CLASS must prevent subregs that select one half of the register (or less). But that would defeat exactly the optimisation that you're trying to do. Thanks, Richard