Hello Richard: On 10/06/24 2:52 pm, Richard Sandiford wrote: > Ajit Agarwal <aagar...@linux.ibm.com> writes: >> On 10/06/24 2:12 pm, Richard Sandiford wrote: >>> Ajit Agarwal <aagar...@linux.ibm.com> writes: >>>>>>>>>>>> + >>>>>>>>>>>> + rtx set = single_set (insn); >>>>>>>>>>>> + if (set == NULL_RTX) >>>>>>>>>>>> + return false; >>>>>>>>>>>> + >>>>>>>>>>>> + rtx op0 = SET_SRC (set); >>>>>>>>>>>> + rtx_code code = GET_CODE (op0); >>>>>>>>>>>> + >>>>>>>>>>>> + // This check is added as register pairs are not generated >>>>>>>>>>>> + // by RA for neg:V2DF (fma: V2DF (reg1) >>>>>>>>>>>> + // (reg2) >>>>>>>>>>>> + // (neg:V2DF (reg3))) >>>>>>>>>>>> + if (GET_RTX_CLASS (code) == RTX_UNARY) >>>>>>>>>>>> + return false; >>>>>>>>>>> >>>>>>>>>>> What's special about (neg (fma ...))? >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I am not sure why register allocator fails allocating register pairs >>>>>>>>>> with >>>>>>>>>> NEG Unary operation with fma operand. I have not debugged register >>>>>>>>>> allocator why the NEG >>>>>>>>>> Unary operation with fma operand. >>>>>>>>> >>>>>>>> >>>>>>>> For neg (fma ...) cases because of subreg 128 bits from OOmode 256 >>>>>>>> bits are >>>>>>>> set correctly. >>>>>>>> IRA marked them spill candidates as spill priority is zero. >>>>>>>> >>>>>>>> Due to this LRA reload pass couldn't allocate register pairs. >>>>>>> >>>>>>> I think this is just restating the symptom though. I suppose the same >>>>>>> kind of questions apply here too: what was the instruction before the >>>>>>> pass runs, what was the instruction after the pass runs, and why is >>>>>>> the rtl change incorrect (by the meaning above)? >>>>>>> >>>>>> >>>>>> Original case where we dont do load fusion, spill happens, in that >>>>>> case we dont require sequential register pairs to be generated for 2 >>>>>> loads >>>>>> for. Hence it worked. >>>>>> >>>>>> rtl change is correct and there is no error. >>>>>> >>>>>> for load fusion spill happens and we dont generate sequential register >>>>>> pairs >>>>>> because pf spill candidate and lxvp gives incorrect results as >>>>>> sequential register >>>>>> pairs are required for lxvp. >>>>> >>>>> Can you go into more detail? How is the lxvp represented? And how do >>>>> we end up not getting a sequential register pair? What does the rtl >>>>> look like (before and after things have gone wrong)? >>>>> >>>>> It seems like either the rtl is not describing the result of the fusion >>>>> correctly or there is some problem in the .md description of lxvp. >>>>> >>>> >>>> After fusion pass: >>>> >>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] >>>> [240]) >>>> (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285]) >>>> (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> >>>> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 >>>> {vsx_movv2df_64bit} >>>> (nil)) >>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] >>>> [240]) >>>> (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> >>>> [(real(kind=8) *)_4050]+16 ]) >>>> (reg:V2DF 44 12 [3119]) >>>> (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] >>>> [240]))))) {*vsx_nfmsv2df4} >>>> (nil)) >>>> >>>> In LRA reload. >>>> >>>> (insn 2472 2461 2412 161 (set (reg:OO 2572 [ vect__300.543_236 ]) >>>> (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] [1285]) [1 MEM >>>> <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 A64])) >>>> "shell_lam.fppized.f":238:72 2187 {*movoo} >>>> (expr_list:REG_EQUIV (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] >>>> [1285]) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 A64]) >>>> (nil))) >>>> (insn 2412 2472 2477 161 (set (reg:V2DF 240 [ vect__302.545 ]) >>>> (neg:V2DF (fma:V2DF (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) >>>> real(kind=8)> [(real(kind=8) *)_4050] ]) 16) >>>> (reg:V2DF 4283 [3119]) >>>> (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 ]) >>>> 16))))) {*vsx_nfmsv2df4} >>>> (nil)) >>>> >>>> >>>> In LRA reload sequential registers are not generated as r2572 is splled >>>> and move to spill location >>>> in stack and subsequent uses loads from stack. Hence sequential registers >>>> pairs are not generated. >>>> >>>> lxvp vsx0, 0(r1). >>>> >>>> It loads from from r1+0 into vsx0 and vsx1 and appropriate uses use >>>> sequential register pairs. >>>> >>>> Without load fusion since 2 loads exists and 2 loads need not require >>>> sequential registers >>>> hence it worked but with load fusion and using lxvp it requires sequential >>>> register pairs. >>> >>> Do you mean that this is a performance regression? I.e. the fact that >>> lxvp requires sequential registers causes extra spilling, due to having >>> less allocation freedom? >>> >>> Or is it a correctness problem? If so, what is it? Nothing in the rtl >>> above looks wrong in principle (although I've no idea if the REG_EQUIV >>> is correct in this context). What does the allocated code look like, >>> and why is it wrong? >>> >>> If (reg:OO 2561) is spilled and then one half of it used, only that half >>> needs to be loaded from the spill slot. E.g. if (reg:OO 2561) is reloaded >>> for insn 2412 on its own, only the second half of the register needs to be >>> loaded from memory. >>> >> >> This is bwaves spec 2017 benchmark. Spill happening in register allocator >> could be because of less registers available in order to generate >> sequential registers for lxvp. >> >> Because of spill sequential registers are not generated and breaks the >> correctness. REG_EQUIV is generated by IRA. >> >> Allocated code because of spill doesn't generate sequential registers. >> In LRA reload because of spill marked for IRA adjust to another >> register causing not to generate sequential registers. >> >> reg:OO 2572 is spilled not reg:OO 2561. Because of spilled its >> loaded from memory instead of generating sequential registers. >> >> Other reasons for spilling because of long live range for reg:OO 2572. >> >> We can add heuristics in fusion code not to fuse for longer live >> ranges that should solve the problem. > > This doesn't describe the real problem though. It's natural for > registers to be spilled sometimes. That would happen for OOmode > registers even if we didn't form them in the fusion pass. And it's > normal GCC semantics that the two hard registers allocated to an OOmode > pseudo are consecutive. > > Like I asked above, please show the allocated code (including spills > and reloads) and explain why it's wrong. >
After LRA reload: (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240]) (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285]) (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 {vsx_movv2df_64bit} (nil)) (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240]) (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050]+16 ]) (reg:V2DF 44 12 [3119]) (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240]))))) {*vsx_nfmsv2df4} (nil)) (insn 2473 9311 9312 187 (set (reg:V2DF 38 6 [orig:905 vect__302.545 ] [905]) (neg:V2DF (fma:V2DF (reg:V2DF 44 12 [3119]) (reg:V2DF 38 6 [orig:2561 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ] [2561]) (neg:V2DF (reg:V2DF 47 15 [5266]))))) {*vsx_nfmsv2df4} (nil)) In the above allocated code it assign registers 51 and 47 and they are not sequential. > Richard Thanks & Regards Ajit