Re: V2 [PATCH] i386: Add pass_remove_partial_avx_dependency

H.J. Lu Sun, 30 Dec 2018 08:41:51 -0800

On Wed, Nov 28, 2018 at 12:17 PM Jeff Law <l...@redhat.com> wrote:
>
> On 11/28/18 12:48 PM, H.J. Lu wrote:
> > On Mon, Nov 5, 2018 at 7:29 AM Jan Hubicka <hubi...@ucw.cz> wrote:
> >>
> >>> On 11/5/18 7:21 AM, Jan Hubicka wrote:
> >>>>>
> >>>>> Did you mean "the nearest common dominator"?
> >>>>
> >>>> If the nearest common dominator appears in the loop while all uses are
> >>>> out of loops, this will result in suboptimal xor placement.
> >>>> In this case you want to split edges out of the loop.
> >>>>
> >>>> In general this is what the LCM framework will do for you if the problem
> >>>> is modelled siimlar way as in mode_swtiching.  At entry function mode is
> >>>> "no zero register needed" and all conversions need mode "zero register
> >>>> needed".  Mode switching should then do the correct placement decisions
> >>>> (reaching minimal number of executions of xor).
> >>>>
> >>>> Jeff, whan is your optinion on the approach taken by the patch?
> >>>> It seems like a special case of more general issue, but I do not see
> >>>> very elegant way to solve it at least in the GCC 9 horisont, so if
> >>>> the placement is correct we can probalby go either with new pass or
> >>>> making this part of mode swithcing (which is anyway run by x86 backend)
> >>> So I haven't followed this discussion at all, but did touch on this
> >>> issue with some patch a month or two ago with a target patch that was
> >>> trying to avoid the partial stalls.
> >>>
> >>> My assumption is that we're trying to find one or more places to
> >>> initialize the upper half of an avx register so as to avoid partial
> >>> register stall at existing sites that set the upper half.
> >>>
> >>> This sounds like a classic PRE/LCM style problem (of which mode
> >>> switching is just another variant).   A common-dominator approach is
> >>> closer to a classic GCSE and is going to result is more initializations
> >>> at sub-optimal points than a PRE/LCM style.
> >>
> >> yes, it is usual code placement problem. It is special case because the
> >> zero register is not modified by the conversion (just we need to have
> >> zero somewhere).  So basically we do not have kills to the zero except
> >> for entry block.
> >>
> >
> > Do you have  testcase to show thatf the nearest common dominator
> > in the loop, while all uses areout of loops, leads to suboptimal xor
> > placement?
> I don't have a testcase, but it's all but certain nearest common
> dominator is going to be a suboptimal placement.  That's going to create
> paths where you're going to emit the xor when it's not used.
>
> The whole point of the LCM algorithms is they are optimal in terms of
> expression evaluations.


We tried LCM and it didn't work well for this case.  LCM places a single
VXOR close to the location where it is needed, which can be inside a
loop.  There is nothing wrong with the LCM algorithms.   But this doesn't
solve

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87007

where VXOR is executed multiple times inside of a function, instead of
just once.   We are investigating to generate a single VXOR at entry of the
nearest dominator for basic blocks with SF/DF conversions, which is in
the the fake loop that contains the whole function:

      bb = nearest_common_dominator_for_set (CDI_DOMINATORS,
                                             convert_bbs);
      while (bb->loop_father->latch
             != EXIT_BLOCK_PTR_FOR_FN (cfun))
        bb = get_immediate_dominator (CDI_DOMINATORS,
                                      bb->loop_father->header);

      insn = BB_HEAD (bb);
      if (!NONDEBUG_INSN_P (insn))
        insn = next_nonnote_nondebug_insn (insn);
      set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode));
      set_insn = emit_insn_before (set, insn);


-- 
H.J.

Re: V2 [PATCH] i386: Add pass_remove_partial_avx_dependency

Reply via email to