Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs

Uros Bizjak Mon, 05 Aug 2019 05:51:31 -0700

On Mon, Aug 5, 2019 at 2:43 PM Uros Bizjak <ubiz...@gmail.com> wrote:
>
> On Mon, Aug 5, 2019 at 1:50 PM Richard Biener <rguent...@suse.de> wrote:
> >
> > On Sun, 4 Aug 2019, Uros Bizjak wrote:
> >
> > > On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <rguent...@suse.de> wrote:
> > > >
> > > > On Thu, 1 Aug 2019, Uros Bizjak wrote:
> > > >
> > > > > On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguent...@suse.de> 
> > > > > wrote:
> > > > >
> > > > >>>> So you unconditionally add a smaxdi3 pattern - indeed this looks
> > > > >>>> necessary even when going the STV route.  The actual regression
> > > > >>>> for the testcase could also be solved by turing the smaxsi3
> > > > >>>> back into a compare and jump rather than a conditional move 
> > > > >>>> sequence.
> > > > >>>> So I wonder how you'd do that given that there's 
> > > > >>>> pass_if_after_reload
> > > > >>>> after pass_split_after_reload and I'm not sure we can split
> > > > >>>> as late as pass_split_before_sched2 (there's also a split _after_
> > > > >>>> sched2 on x86 it seems).
> > > > >>>>
> > > > >>>> So how would you go implement {s,u}{min,max}{si,di}3 for the
> > > > >>>> case STV doesn't end up doing any transform?
> > > > >>>
> > > > >>> If STV doesn't transform the insn, then a pre-reload splitter splits
> > > > >>> the insn back to compare+cmove.
> > > > >>
> > > > >> OK, that would work.  But there's no way to force a jumpy sequence 
> > > > >> then
> > > > >> which we know is faster than compare+cmove because later RTL
> > > > >> if-conversion passes happily re-discover the smax (or conditional 
> > > > >> move)
> > > > >> sequence.
> > > > >>
> > > > >>> However, considering the SImode move
> > > > >>> from/to int/xmm register is relatively cheap, the cost function 
> > > > >>> should
> > > > >>> be tuned so that STV always converts smaxsi3 pattern.
> > > > >>
> > > > >> Note that on both Zen and even more so bdverN the int/xmm transition
> > > > >> makes it no longer profitable but a _lot_ slower than the cmp/cmov
> > > > >> sequence... (for the loop in hmmer which is the only one I see
> > > > >> any effect of any of my patches).  So identifying chains that
> > > > >> start/end in memory is important for cost reasons.
> > > > >
> > > > > Please note that the cost function also considers the cost of move
> > > > > from/to xmm. So, the cost of the whole chain would disable the
> > > > > transformation.
> > > > >
> > > > >> So I think the splitting has to happen after the last if-conversion
> > > > >> pass (and thus we may need to allocate a scratch register for this
> > > > >> purpose?)
> > > > >
> > > > > I really hope that the underlying issue will be solved by a machine
> > > > > dependant pass inserted somewhere after the pre-reload split. This
> > > > > way, we can split unconverted smax to the cmove, and this later pass
> > > > > would handle jcc and cmove instructions. Until then... yes your
> > > > > proposed approach is one of the ways to avoid unwanted if-conversion,
> > > > > although sometimes we would like to split to cmove instead.
> > > >
> > > > So the following makes STV also consider SImode chains, re-using the
> > > > DImode chain code.  I've kept a simple incomplete smaxsi3 pattern
> > > > and also did not alter the {SI,DI}mode chain cost function - it's
> > > > quite off for TARGET_64BIT.  With this I get the expected conversion
> > > > for the testcase derived from hmmer.
> > > >
> > > > No further testing sofar.
> > > >
> > > > Is it OK to re-use the DImode chain code this way?  I'll clean things
> > > > up some more of course.
> > >
> > > Yes, the approach looks OK to me. It makes chain building mode
> > > agnostic, and the chain building can be used for
> > > a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be 
> > > added.
> > > b) SImode x86_32 and x86_64 (this will be mainly used for SImode
> > > minmax and surrounding SImode operations)
> > > c) DImode x86_64 (also, mainly used for DImode minmax and surrounding
> > > DImode operations)
> > >
> > > > Still need help with the actual patterns for minmax and how the 
> > > > splitters
> > > > should look like.
> > >
> > > Please look at the attached patch. Maybe we can add memory_operand as
> > > operand 1 and operand 2 predicate, but let's keep things simple for
> > > now.
> >
> > Thanks.  The attached patch makes the patch cleaner and it survives
> > "some" barebone testing.  It also touches the cost function to
> > avoid being too overly trigger-happy.  I've also ended up using
> > ix86_cost->sse_op instead of COSTS_N_INSN-based magic.  In
> > particular we estimated GPR reg-reg move as COST_N_INSNS(2) while
> > move costs shouldn't be wrapped in COST_N_INSNS.
> > IMHO we should probably disregard any reg-reg moves for costing pre-RA.
> > At least with the current code every reg-reg move biases in favor of
> > SSE...
> >
> > And we're simply adding move and non-move costs in 'gain', somewhat
> > mixing apples and oranges?  We could separate those and require
> > both to be a net positive win?
> >
> > Still using -mtune=bdverN exposes that some cost tables have xmm and gpr
> > costs as apples and oranges... (so it never triggers for Bulldozer)
> >
> > I now run into
> >
> > /space/rguenther/src/svn/trunk-bisect/libgcc/libgcov-driver.c:509:1:
> > error: unrecognizable insn:
> > (insn 116 115 1511 8 (set (subreg:V2DI (reg/v:DI 87 [ run_max ]) 0)
> >         (smax:V2DI (subreg:V2DI (reg/v:DI 87 [ run_max ]) 0)
> >             (subreg:V2DI (reg:DI 349 [ MEM[base: _261, offset: 0B] ]) 0)))
> > -1
> >      (expr_list:REG_DEAD (reg:DI 349 [ MEM[base: _261, offset: 0B] ])
> >         (expr_list:REG_UNUSED (reg:CC 17 flags)
> >             (nil))))
> > during RTL pass: stv
> >
> > where even with -mavx2 we do not have s{min,max}v2di3.  We do have
> > an expander here but it seems only AVX512F has the DImode min/max
> > ops.  I have adjusted dimode_scalar_to_vector_candidate_p
> > accordingly.
>
> Uh, you need to use some other mode iterator that SWI48 then, like:
>
> (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"])
>
> and then we need to split DImode for 32bits, too.


For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode
condition, I'll provide _doubleword splitter later.

Uros.

Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs

Reply via email to