Ajit Agarwal <aagar...@linux.ibm.com> writes: > Hello Richard: > > On 10/06/24 2:52 pm, Richard Sandiford wrote: >> Ajit Agarwal <aagar...@linux.ibm.com> writes: >>> On 10/06/24 2:12 pm, Richard Sandiford wrote: >>>> Ajit Agarwal <aagar...@linux.ibm.com> writes: >>>>>>>>>>>>> + >>>>>>>>>>>>> + rtx set = single_set (insn); >>>>>>>>>>>>> + if (set == NULL_RTX) >>>>>>>>>>>>> + return false; >>>>>>>>>>>>> + >>>>>>>>>>>>> + rtx op0 = SET_SRC (set); >>>>>>>>>>>>> + rtx_code code = GET_CODE (op0); >>>>>>>>>>>>> + >>>>>>>>>>>>> + // This check is added as register pairs are not generated >>>>>>>>>>>>> + // by RA for neg:V2DF (fma: V2DF (reg1) >>>>>>>>>>>>> + // (reg2) >>>>>>>>>>>>> + // (neg:V2DF (reg3))) >>>>>>>>>>>>> + if (GET_RTX_CLASS (code) == RTX_UNARY) >>>>>>>>>>>>> + return false; >>>>>>>>>>>> >>>>>>>>>>>> What's special about (neg (fma ...))? >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I am not sure why register allocator fails allocating register >>>>>>>>>>> pairs with >>>>>>>>>>> NEG Unary operation with fma operand. I have not debugged register >>>>>>>>>>> allocator why the NEG >>>>>>>>>>> Unary operation with fma operand. >>>>>>>>>> >>>>>>>>> >>>>>>>>> For neg (fma ...) cases because of subreg 128 bits from OOmode 256 >>>>>>>>> bits are >>>>>>>>> set correctly. >>>>>>>>> IRA marked them spill candidates as spill priority is zero. >>>>>>>>> >>>>>>>>> Due to this LRA reload pass couldn't allocate register pairs. >>>>>>>> >>>>>>>> I think this is just restating the symptom though. I suppose the same >>>>>>>> kind of questions apply here too: what was the instruction before the >>>>>>>> pass runs, what was the instruction after the pass runs, and why is >>>>>>>> the rtl change incorrect (by the meaning above)? >>>>>>>> >>>>>>> >>>>>>> Original case where we dont do load fusion, spill happens, in that >>>>>>> case we dont require sequential register pairs to be generated for 2 >>>>>>> loads >>>>>>> for. Hence it worked. >>>>>>> >>>>>>> rtl change is correct and there is no error. >>>>>>> >>>>>>> for load fusion spill happens and we dont generate sequential register >>>>>>> pairs >>>>>>> because pf spill candidate and lxvp gives incorrect results as >>>>>>> sequential register >>>>>>> pairs are required for lxvp. >>>>>> >>>>>> Can you go into more detail? How is the lxvp represented? And how do >>>>>> we end up not getting a sequential register pair? What does the rtl >>>>>> look like (before and after things have gone wrong)? >>>>>> >>>>>> It seems like either the rtl is not describing the result of the fusion >>>>>> correctly or there is some problem in the .md description of lxvp. >>>>>> >>>>> >>>>> After fusion pass: >>>>> >>>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] >>>>> [240]) >>>>> (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285]) >>>>> (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> >>>>> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 >>>>> {vsx_movv2df_64bit} >>>>> (nil)) >>>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] >>>>> [240]) >>>>> (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> >>>>> [(real(kind=8) *)_4050]+16 ]) >>>>> (reg:V2DF 44 12 [3119]) >>>>> (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] >>>>> [240]))))) {*vsx_nfmsv2df4} >>>>> (nil)) >>>>> >>>>> In LRA reload. >>>>> >>>>> (insn 2472 2461 2412 161 (set (reg:OO 2572 [ vect__300.543_236 ]) >>>>> (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] [1285]) [1 MEM >>>>> <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 A64])) >>>>> "shell_lam.fppized.f":238:72 2187 {*movoo} >>>>> (expr_list:REG_EQUIV (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] >>>>> [1285]) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 >>>>> A64]) >>>>> (nil))) >>>>> (insn 2412 2472 2477 161 (set (reg:V2DF 240 [ vect__302.545 ]) >>>>> (neg:V2DF (fma:V2DF (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) >>>>> real(kind=8)> [(real(kind=8) *)_4050] ]) 16) >>>>> (reg:V2DF 4283 [3119]) >>>>> (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 >>>>> ]) 16))))) {*vsx_nfmsv2df4} >>>>> (nil)) >>>>> >>>>> >>>>> In LRA reload sequential registers are not generated as r2572 is splled >>>>> and move to spill location >>>>> in stack and subsequent uses loads from stack. Hence sequential registers >>>>> pairs are not generated. >>>>> >>>>> lxvp vsx0, 0(r1). >>>>> >>>>> It loads from from r1+0 into vsx0 and vsx1 and appropriate uses use >>>>> sequential register pairs. >>>>> >>>>> Without load fusion since 2 loads exists and 2 loads need not require >>>>> sequential registers >>>>> hence it worked but with load fusion and using lxvp it requires >>>>> sequential register pairs. >>>> >>>> Do you mean that this is a performance regression? I.e. the fact that >>>> lxvp requires sequential registers causes extra spilling, due to having >>>> less allocation freedom? >>>> >>>> Or is it a correctness problem? If so, what is it? Nothing in the rtl >>>> above looks wrong in principle (although I've no idea if the REG_EQUIV >>>> is correct in this context). What does the allocated code look like, >>>> and why is it wrong? >>>> >>>> If (reg:OO 2561) is spilled and then one half of it used, only that half >>>> needs to be loaded from the spill slot. E.g. if (reg:OO 2561) is reloaded >>>> for insn 2412 on its own, only the second half of the register needs to be >>>> loaded from memory. >>>> >>> >>> This is bwaves spec 2017 benchmark. Spill happening in register allocator >>> could be because of less registers available in order to generate >>> sequential registers for lxvp. >>> >>> Because of spill sequential registers are not generated and breaks the >>> correctness. REG_EQUIV is generated by IRA. >>> >>> Allocated code because of spill doesn't generate sequential registers. >>> In LRA reload because of spill marked for IRA adjust to another >>> register causing not to generate sequential registers. >>> >>> reg:OO 2572 is spilled not reg:OO 2561. Because of spilled its >>> loaded from memory instead of generating sequential registers. >>> >>> Other reasons for spilling because of long live range for reg:OO 2572. >>> >>> We can add heuristics in fusion code not to fuse for longer live >>> ranges that should solve the problem. >> >> This doesn't describe the real problem though. It's natural for >> registers to be spilled sometimes. That would happen for OOmode >> registers even if we didn't form them in the fusion pass. And it's >> normal GCC semantics that the two hard registers allocated to an OOmode >> pseudo are consecutive. >> >> Like I asked above, please show the allocated code (including spills >> and reloads) and explain why it's wrong. >> > > After LRA reload: > > (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240]) > (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285]) > (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> > [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 > {vsx_movv2df_64bit} > (nil)) > (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240]) > (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> > [(real(kind=8) *)_4050]+16 ]) > (reg:V2DF 44 12 [3119]) > (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] > [240]))))) {*vsx_nfmsv2df4} > (nil)) > > (insn 2473 9311 9312 187 (set (reg:V2DF 38 6 [orig:905 vect__302.545 ] [905]) > (neg:V2DF (fma:V2DF (reg:V2DF 44 12 [3119]) > (reg:V2DF 38 6 [orig:2561 MEM <vector(2) real(kind=8)> > [(real(kind=8) *)_4050] ] [2561]) > (neg:V2DF (reg:V2DF 47 15 [5266]))))) {*vsx_nfmsv2df4} > (nil)) > > In the above allocated code it assign registers 51 and 47 and they are not > sequential.
The reload for 2412 looks valid. What was the original pre-reload version of insn 2473? Also, what happened to insn 2472? Was it deleted? Richard