Re: V3 [PATCH] i386: Add pass_remove_partial_avx_dependency

Richard Biener Tue, 22 Jan 2019 04:09:50 -0800

On Mon, Jan 21, 2019 at 10:27 PM H.J. Lu <hjl.to...@gmail.com> wrote:
>
> On Mon, Jan 21, 2019 at 10:54 AM Jeff Law <l...@redhat.com> wrote:
> >
> > On 1/7/19 6:55 AM, H.J. Lu wrote:
> > > On Sun, Dec 30, 2018 at 8:40 AM H.J. Lu <hjl.to...@gmail.com> wrote:
> > >> On Wed, Nov 28, 2018 at 12:17 PM Jeff Law <l...@redhat.com> wrote:
> > >>> On 11/28/18 12:48 PM, H.J. Lu wrote:
> > >>>> On Mon, Nov 5, 2018 at 7:29 AM Jan Hubicka <hubi...@ucw.cz> wrote:
> > >>>>>> On 11/5/18 7:21 AM, Jan Hubicka wrote:
> > >>>>>>>> Did you mean "the nearest common dominator"?
> > >>>>>>> If the nearest common dominator appears in the loop while all uses 
> > >>>>>>> are
> > >>>>>>> out of loops, this will result in suboptimal xor placement.
> > >>>>>>> In this case you want to split edges out of the loop.
> > >>>>>>>
> > >>>>>>> In general this is what the LCM framework will do for you if the 
> > >>>>>>> problem
> > >>>>>>> is modelled siimlar way as in mode_swtiching.  At entry function 
> > >>>>>>> mode is
> > >>>>>>> "no zero register needed" and all conversions need mode "zero 
> > >>>>>>> register
> > >>>>>>> needed".  Mode switching should then do the correct placement 
> > >>>>>>> decisions
> > >>>>>>> (reaching minimal number of executions of xor).
> > >>>>>>>
> > >>>>>>> Jeff, whan is your optinion on the approach taken by the patch?
> > >>>>>>> It seems like a special case of more general issue, but I do not see
> > >>>>>>> very elegant way to solve it at least in the GCC 9 horisont, so if
> > >>>>>>> the placement is correct we can probalby go either with new pass or
> > >>>>>>> making this part of mode swithcing (which is anyway run by x86 
> > >>>>>>> backend)
> > >>>>>> So I haven't followed this discussion at all, but did touch on this
> > >>>>>> issue with some patch a month or two ago with a target patch that was
> > >>>>>> trying to avoid the partial stalls.
> > >>>>>>
> > >>>>>> My assumption is that we're trying to find one or more places to
> > >>>>>> initialize the upper half of an avx register so as to avoid partial
> > >>>>>> register stall at existing sites that set the upper half.
> > >>>>>>
> > >>>>>> This sounds like a classic PRE/LCM style problem (of which mode
> > >>>>>> switching is just another variant).   A common-dominator approach is
> > >>>>>> closer to a classic GCSE and is going to result is more 
> > >>>>>> initializations
> > >>>>>> at sub-optimal points than a PRE/LCM style.
> > >>>>> yes, it is usual code placement problem. It is special case because 
> > >>>>> the
> > >>>>> zero register is not modified by the conversion (just we need to have
> > >>>>> zero somewhere).  So basically we do not have kills to the zero except
> > >>>>> for entry block.
> > >>>>>
> > >>>> Do you have  testcase to show thatf the nearest common dominator
> > >>>> in the loop, while all uses areout of loops, leads to suboptimal xor
> > >>>> placement?
> > >>> I don't have a testcase, but it's all but certain nearest common
> > >>> dominator is going to be a suboptimal placement.  That's going to create
> > >>> paths where you're going to emit the xor when it's not used.
> > >>>
> > >>> The whole point of the LCM algorithms is they are optimal in terms of
> > >>> expression evaluations.
> > >> We tried LCM and it didn't work well for this case.  LCM places a single
> > >> VXOR close to the location where it is needed, which can be inside a
> > >> loop.  There is nothing wrong with the LCM algorithms.   But this doesn't
> > >> solve
> > >>
> > >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87007
> > >>
> > >> where VXOR is executed multiple times inside of a function, instead of
> > >> just once.   We are investigating to generate a single VXOR at entry of 
> > >> the
> > >> nearest dominator for basic blocks with SF/DF conversions, which is in
> > >> the the fake loop that contains the whole function:
> > >>
> > >>       bb = nearest_common_dominator_for_set (CDI_DOMINATORS,
> > >>                                              convert_bbs);
> > >>       while (bb->loop_father->latch
> > >>              != EXIT_BLOCK_PTR_FOR_FN (cfun))
> > >>         bb = get_immediate_dominator (CDI_DOMINATORS,
> > >>                                       bb->loop_father->header);
> > >>
> > >>       insn = BB_HEAD (bb);
> > >>       if (!NONDEBUG_INSN_P (insn))
> > >>         insn = next_nonnote_nondebug_insn (insn);
> > >>       set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode));
> > >>       set_insn = emit_insn_before (set, insn);
> > >>
> > > Here is the updated patch.  OK for trunk?
> > >
> > > Thanks.
> > >
> > > -- H.J.
> > >
> > >
> > > 0001-i386-Add-pass_remove_partial_avx_dependency.patch
> > >
> > > From 6eca7dbf282d7e2a5cde41bffeca66195d72d48e Mon Sep 17 00:00:00 2001
> > > From: "H.J. Lu" <hjl.to...@gmail.com>
> > > Date: Mon, 7 Jan 2019 05:44:59 -0800
> > > Subject: [PATCH] i386: Add pass_remove_partial_avx_dependency
> > >
> > > With -mavx, for
> > >
> > > $ cat foo.i
> > > extern float f;
> > > extern double d;
> > > extern int i;
> > >
> > > void
> > > foo (void)
> > > {
> > >   d = f;
> > >   f = i;
> > > }
> > >
> > > we need to generate
> > >
> > >       vxorp[ds]       %xmmN, %xmmN, %xmmN
> > >       ...
> > >       vcvtss2sd       f(%rip), %xmmN, %xmmX
> > >       ...
> > >       vcvtsi2ss       i(%rip), %xmmN, %xmmY
> > >
> > > to avoid partial XMM register stall.  This patch adds a pass to generate
> > > a single
> > >
> > >       vxorps          %xmmN, %xmmN, %xmmN
> > >
> > > at entry of the nearest dominator for basic blocks with SF/DF conversions,
> > > which is in the fake loop that contains the whole function, instead of
> > > generating one
> > >
> > >       vxorp[ds]       %xmmN, %xmmN, %xmmN
> > >
> > > for each SF/DF conversion.
> > >
> > > NB: The LCM algorithm isn't appropriate here since it may place a vxorps
> > > inside the loop.  Simple testcase show this:
> > >
> > > $ cat badcase.c
> > >
> > > extern float f;
> > > extern double d;
> > >
> > > void
> > > foo (int n, int k)
> > > {
> > >   for (int j = 0; j != n; j++)
> > >     if (j < k)
> > >       d = f;
> > > }
> > >
> > > It generates
> > >
> > >     ...
> > >     loop:
> > >       if(j < k)
> > >         vxorps  %xmm0, %xmm0, %xmm0
> > >         vcvtss2sd  %xmm1, %xmm0, %xmm0
> > >       ...
> > >     loopend
> > >     ...
> > >
> > > This is because LCM only works when there is a certain benifit.  But for
> > > conditional branch, LCM wouldn't move
> > >
> > >    vxorps  %xmm0, %xmm0, %xmm0
> > It works this way for a reason.  There are obviously paths through the
> > loop where the conversion does not happen and thus the vxor is not
> > needed or desirable on those paths.
> >
> > That's a fundamental property of the LCM algorithm -- it never inserts
> > an evaluation on a path through the CFG where it will not be used.
> >
> > Your algorithm of inserting into the dominator block will introduce
> > runtime executions of the vxor on paths where it is not needed.
> >
> > It's well known that relaxing that property of LCM can result in better
> > code generation in some circumstances.  Block copying and loop
> > restructuring are the gold standard for dealing with this kind of problem.
> >
> > In this case you could split the iteration space so that you have two
> > loops.  one for 0..k and the other for k..n.  Note that GCC has support
> > for this kind of loop restructuring.  This has the advantage that the j
> > < k test does not happen each iteration of the loop and the vxor stuff
> > via LCM would be optimal.
> >
> > There's many other cases where copying and restructuring results in
> > better common subexpression elimination (which is what you're doing).
> > Probably the best work I've seen in this space is Bodik's thesis.
> > Click's work from '95 touches on some of this as well, but isn't as
> > relevant to this specific instance.
> >
> > Anyway, whether or not the patch should move forward is really up to Jan
> > (and Uros if he wants to be involved) I think.  I'm not fundamentally
> > opposed to HJ's approach as I'm aware of the different tradeoffs.
> >
> > HJ's approach of pulling into the dominator block can result in
> > unnecessary evaluations.  But it can also reduce the number of
> > evaluations in other cases.  It really depends on the runtime behavior
> > of the code.   One could argue that the vxor stuff we're talking about
> > is most likely happening in loops, and probably not in deeply nested
> > control structures within those loops.  Thus pulling them out more
> > aggressively ala LICM may be better than LCM.
>
> True, there is a trade-off.   My approach inserts a vxorps at the last 
> possible
> position.  Yes, vxorps will always be executed even if it may not be required.
> Since it is executed only once in all cases, it is a win overall.


Hopefully a simple vpxor won't end up powering up the other AVX512
unit if it lay dormant ...

And if we ever get to the state of having two separate ISAs in the same
function then you'd need to make sure you can execute vpxor in the
place you are inserting since it may now be executed when it wasn't
before (and I assume you already check that you do not zero the
reg if there's a value life in it if the conditional def you are instrumenting
is not executed).

> > But where to throttle that aggressiveness isn't obvious.  A hybrid of
> > LICM + LCM would probably be better than either individually.
> >
>
> Thanks.
>
> --
> H.J.

Re: V3 [PATCH] i386: Add pass_remove_partial_avx_dependency

Reply via email to