On Wed, 7 Aug 2019, Richard Biener wrote:

> On Mon, 5 Aug 2019, Uros Bizjak wrote:
> 
> > On Mon, Aug 5, 2019 at 3:29 PM Richard Biener <rguent...@suse.de> wrote:
> > 
> > > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI 
> > > > > > > > > "TARGET_AVX512F"])
> > > > > > > > >
> > > > > > > > > and then we need to split DImode for 32bits, too.
> > > > > > > >
> > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode
> > > > > > > > condition, I'll provide _doubleword splitter later.
> > > > > > >
> > > > > > > Shouldn't that be TARGET_AVX512VL instead?  Or does the insn use 
> > > > > > > %g0 etc.
> > > > > > > to force use of %zmmN?
> > > > > >
> > > > > > It generates V4SI mode, so - yes, AVX512VL.
> > > > >
> > > > >     case SMAX:
> > > > >     case SMIN:
> > > > >     case UMAX:
> > > > >     case UMIN:
> > > > >       if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL))
> > > > >           || (mode == SImode && !TARGET_SSE4_1))
> > > > >         return false;
> > > > >
> > > > > so there's no way to use AVX512VL for 32bit?
> > > >
> > > > There is a way, but on 32bit targets, we need to split DImode
> > > > operation to a sequence of SImode operations for unconverted pattern.
> > > > This is of course doable, but somehow more complex than simply
> > > > emitting a DImode compare + DImode cmove, which is what current
> > > > splitter does. So, a follow-up task.
> > >
> > > Ah, OK.  So for the above condition we can elide the !TARGET_64BIT
> > > check we just need to properly split if we enable the scalar minmax
> > > pattern for DImode on 32bits, the STV conversion would go fine.
> > 
> > Yes, that is correct.
> 
> So I tested the patch below (now with appropriate ChangeLog) on
> x86_64-unknown-linux-gnu.  I've thrown it at SPEC CPU 2006 with
> the obvious hmmer improvement, now checking for off-noise results
> with a 3-run on those that may have one (with more than +-1 second
> differences in the 1-run).

Update on this one.  On Haswell I see (besides hmmer and the
ones +-1 second in the 1-run); base is unpatched, peak is patched:

401.bzip2        9650        382       25.3 S    9650        380       
25.4 S
401.bzip2        9650        381       25.3 *    9650        377       
25.6 *
401.bzip2        9650        381       25.3 S    9650        376       
25.7 S

458.sjeng       12100        433       28.0 S   12100        433       
28.0 S
458.sjeng       12100        428       28.3 S   12100        424       
28.5 *
458.sjeng       12100        432       28.0 *   12100        424       
28.6 S

464.h264ref     22130        413       53.6 S   22130        422       
52.5 S
464.h264ref     22130        413       53.6 *   22130        421       
52.5 S
464.h264ref     22130        413       53.6 S   22130        421       
52.5 *

473.astar        7020        328       21.4 S    7020        316       
22.2 S
473.astar        7020        322       21.8 S    7020        314       
22.4 *
473.astar        7020        322       21.8 *    7020        311       
22.6 S

416.gamess      19580        593       33.0 S   19580        601       
32.6 S
416.gamess      19580        593       33.0 S   19580        601       
32.6 *
416.gamess      19580        593       33.0 *   19580        601       
32.6 S

so it's a loss for 464.h264ref and 416.gamess from the above numbers
and a slight win for the others (and a big one for 456.hmmer).

I plan to have a look at the two as followup only, possibly adding
a debug couter to be able to bisect to a specific chain.

Richard.

Reply via email to