On Wed, Nov 28, 2018 at 12:17 PM Jeff Law <l...@redhat.com> wrote: > > On 11/28/18 12:48 PM, H.J. Lu wrote: > > On Mon, Nov 5, 2018 at 7:29 AM Jan Hubicka <hubi...@ucw.cz> wrote: > >> > >>> On 11/5/18 7:21 AM, Jan Hubicka wrote: > >>>>> > >>>>> Did you mean "the nearest common dominator"? > >>>> > >>>> If the nearest common dominator appears in the loop while all uses are > >>>> out of loops, this will result in suboptimal xor placement. > >>>> In this case you want to split edges out of the loop. > >>>> > >>>> In general this is what the LCM framework will do for you if the problem > >>>> is modelled siimlar way as in mode_swtiching. At entry function mode is > >>>> "no zero register needed" and all conversions need mode "zero register > >>>> needed". Mode switching should then do the correct placement decisions > >>>> (reaching minimal number of executions of xor). > >>>> > >>>> Jeff, whan is your optinion on the approach taken by the patch? > >>>> It seems like a special case of more general issue, but I do not see > >>>> very elegant way to solve it at least in the GCC 9 horisont, so if > >>>> the placement is correct we can probalby go either with new pass or > >>>> making this part of mode swithcing (which is anyway run by x86 backend) > >>> So I haven't followed this discussion at all, but did touch on this > >>> issue with some patch a month or two ago with a target patch that was > >>> trying to avoid the partial stalls. > >>> > >>> My assumption is that we're trying to find one or more places to > >>> initialize the upper half of an avx register so as to avoid partial > >>> register stall at existing sites that set the upper half. > >>> > >>> This sounds like a classic PRE/LCM style problem (of which mode > >>> switching is just another variant). A common-dominator approach is > >>> closer to a classic GCSE and is going to result is more initializations > >>> at sub-optimal points than a PRE/LCM style. > >> > >> yes, it is usual code placement problem. It is special case because the > >> zero register is not modified by the conversion (just we need to have > >> zero somewhere). So basically we do not have kills to the zero except > >> for entry block. > >> > > > > Do you have testcase to show thatf the nearest common dominator > > in the loop, while all uses areout of loops, leads to suboptimal xor > > placement? > I don't have a testcase, but it's all but certain nearest common > dominator is going to be a suboptimal placement. That's going to create > paths where you're going to emit the xor when it's not used. > > The whole point of the LCM algorithms is they are optimal in terms of > expression evaluations.
We tried LCM and it didn't work well for this case. LCM places a single VXOR close to the location where it is needed, which can be inside a loop. There is nothing wrong with the LCM algorithms. But this doesn't solve https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87007 where VXOR is executed multiple times inside of a function, instead of just once. We are investigating to generate a single VXOR at entry of the nearest dominator for basic blocks with SF/DF conversions, which is in the the fake loop that contains the whole function: bb = nearest_common_dominator_for_set (CDI_DOMINATORS, convert_bbs); while (bb->loop_father->latch != EXIT_BLOCK_PTR_FOR_FN (cfun)) bb = get_immediate_dominator (CDI_DOMINATORS, bb->loop_father->header); insn = BB_HEAD (bb); if (!NONDEBUG_INSN_P (insn)) insn = next_nonnote_nondebug_insn (insn); set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode)); set_insn = emit_insn_before (set, insn); -- H.J.