On Fri, Aug 9, 2019 at 3:00 PM Richard Biener <rguent...@suse.de> wrote:

> > > > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI 
> > > > > > > > > > "TARGET_AVX512F"])
> > > > > > > > > >
> > > > > > > > > > and then we need to split DImode for 32bits, too.
> > > > > > > > >
> > > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for 
> > > > > > > > > DImode
> > > > > > > > > condition, I'll provide _doubleword splitter later.
> > > > > > > >
> > > > > > > > Shouldn't that be TARGET_AVX512VL instead?  Or does the insn 
> > > > > > > > use %g0 etc.
> > > > > > > > to force use of %zmmN?
> > > > > > >
> > > > > > > It generates V4SI mode, so - yes, AVX512VL.
> > > > > >
> > > > > >     case SMAX:
> > > > > >     case SMIN:
> > > > > >     case UMAX:
> > > > > >     case UMIN:
> > > > > >       if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL))
> > > > > >           || (mode == SImode && !TARGET_SSE4_1))
> > > > > >         return false;
> > > > > >
> > > > > > so there's no way to use AVX512VL for 32bit?
> > > > >
> > > > > There is a way, but on 32bit targets, we need to split DImode
> > > > > operation to a sequence of SImode operations for unconverted pattern.
> > > > > This is of course doable, but somehow more complex than simply
> > > > > emitting a DImode compare + DImode cmove, which is what current
> > > > > splitter does. So, a follow-up task.
> > > >
> > > > Please find attached the complete .md part that enables SImode for
> > > > TARGET_SSE4_1 and DImode for TARGET_AVX512VL for both, 32bit and 64bit
> > > > targets. The patterns also allows for memory operand 2, so STV has
> > > > chance to create the vector pattern with implicit load. In case STV
> > > > fails, the memory operand 2 is loaded to the register first;  operand
> > > > 2 is used in compare and cmove instruction, so pre-loading of the
> > > > operand should be beneficial.
> > >
> > > Thanks.
> > >
> > > > Also note, that splitting should happen rarely. Due to the cost
> > > > function, STV should effectively always convert minmax to a vector
> > > > insn.
> > >
> > > I've analyzed the 464.h264ref slowdown on Haswell and it is due to
> > > this kind of "simple" conversion:
> > >
> > >   5.50 │1d0:   test   %esi,%es
> > >   0.07 │       mov    $0x0,%ex
> > >        │       cmovs  %eax,%es
> > >   5.84 │       imul   %r8d,%es
> > >
> > > to
> > >
> > >   0.65 │1e0:   vpxor  %xmm0,%xmm0,%xmm0
> > >   0.32 │       vpmaxs -0x10(%rsp),%xmm0,%xmm0
> > >  40.45 │       vmovd  %xmm0,%eax
> > >   2.45 │       imul   %r8d,%eax
> > >
> > > which looks like a RA artifact in the end.  We spill %esi only
> > > with -mstv here as STV introduces a (subreg:V4SI ...) use
> > > of a pseudo ultimatively set from di.  STV creates an additional
> > > pseudo for this (copy-in) but it places that copy next to the
> > > original def rather than next to the start of the chain it
> > > converts which is probably the issue why we spill.  And this
> > > is because it inserts those at each definition of the pseudo
> > > rather than just at the reaching definition(s) or at the
> > > uses of the pseudo in the chain (that because there may be
> > > defs of that pseudo in the chain itself).  Note that STV emits
> > > such "conversion" copies as simple reg-reg moves:
> > >
> > > (insn 1094 3 4 2 (set (reg:SI 777)
> > >         (reg/v:SI 438 [ y ])) "refbuf.i":4:1 -1
> > >      (nil))
> > >
> > > but those do not prevail very long (this one gets removed by CSE2).
> > > So IRA just sees the (subreg:V4SI (reg/v:SI 438 [ y ]) 0) use
> > > and computes
> > >
> > >     r438: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS
> > >     a297(r438,l0) costs: SSE_REGS:5628,5628 MEM:3618,3618
> > >
> > > so I wonder if STV shouldn't instead emit gpr->xmm moves
> > > here (but I guess nothing again prevents RTL optimizers from
> > > combining that with the single-use in the max instruction...).
> > >
> > > So this boils down to STV splitting live-ranges but other
> > > passes undoing that and then RA not considering splitting
> > > live-ranges here, arriving at unoptimal allocation.
> > >
> > > A testcase showing this issue is (simplified from 464.h264ref
> > > UMVLine16Y_11):
> > >
> > > unsigned short
> > > UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
> > > {
> > >   if (y != width)
> > >     {
> > >       y = y < 0 ? 0 : y;
> > >       return Pic[y * width];
> > >     }
> > >   return Pic[y];
> > > }
> > >
> > > where the condition and the Pic[y] load mimics the other use of y.
> > > Different, even worse spilling is generated by
> > >
> > > unsigned short
> > > UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
> > > {
> > >   y = y < 0 ? 0 : y;
> > >   return Pic[y * width] + y;
> > > }
> > >
> > > I guess this all shows that STVs "trick" of simply wrapping
> > > integer mode pseudos in (subreg:vector-mode ...) is bad?
> > >
> > > I've added a (failing) testcase to reflect the above.
> >
> > Experimenting a bit with just for the conversion insns using
> > V4SImode pseudos we end up preserving those moves (but I
> > do have to use a lowpart set, using reg:V4SI = subreg:V4SI Simode-reg
> > ends up using movv4si_internal which only leaves us with
> > memory for the SImode operand) _plus_ moving the move next
> > to the actual use has an effect.  Not necssarily a good one
> > though:
> >
> >         vpxor   %xmm0, %xmm0, %xmm0
> >         vmovaps %xmm0, -16(%rsp)
> >         movl    %esi, -16(%rsp)
> >         vpmaxsd -16(%rsp), %xmm0, %xmm0
> >         vmovd   %xmm0, %eax
> >
> > eh?  I guess the lowpart set is not good (my patch has this
> > as well, but I got saved by never having vector modes to subset...).
> > Using
> >
> >     (vec_merge:V4SI (vec_duplicate:V4SI (reg/v:SI 83 [ i ]))
> >             (const_vector:V4SI [
> >                     (const_int 0 [0]) repeated x4
> >                 ])
> >             (const_int 1 [0x1]))) "t3.c":5:10 -1
> >
> > for the move ends up with
> >
> >         vpxor   %xmm1, %xmm1, %xmm1
> >         vpinsrd $0, %esi, %xmm1, %xmm0
> >
> > eh?  LRA chooses the correct alternative here but somehow
> > postreload CSE CSEs the zero with the xmm1 clearing, leading
> > to the vpinsrd...  (I guess a general issue, not sure if really
> > worse - definitely a larger instruction).  Unfortunately
> > postreload-cse doesn't add a reg-equal note.  This happens only
> > when emitting the reg move before the use, not doing that emits
> > a vmovd as expected.
> >
> > At least the spilling is gone here.
> >
> > I am re-testing as follows, the main change is that
> > general_scalar_chain::make_vector_copies now generates a
> > vector pseudo as destination (and I've fixed up the code
> > to not generate (subreg:V4SI (reg:V4SI 1234) 0)).
> >
> > Hope this fixes the observed slowdowns (it fixes the new testcase).
>
> It fixes the slowdown observed in 416.gamess and 464.h264ref.
>
> Bootstrapped on x86_64-unknown-linux-gnu, testing still in progress.
>
> CCing Jeff who "knows RTL".
>
> OK?

Please add -mno-stv to gcc.target/i386/minmax-{1,2}.c to avoid
spurious test failures on SSE4.1 targets.

Uros.

> Thanks,
> Richard.
>
> > Richard.
> >
> > mccas.F:twotff_ for 416.gamess
> > refbuf.c:UMVLine16Y_11 for 464.h264ref
> >
> > 2019-08-07  Richard Biener  <rguent...@suse.de>
> >
> >       PR target/91154
> >       * config/i386/i386-features.h (scalar_chain::scalar_chain): Add
> >       mode arguments.
> >       (scalar_chain::smode): New member.
> >       (scalar_chain::vmode): Likewise.
> >       (dimode_scalar_chain): Rename to...
> >       (general_scalar_chain): ... this.
> >       (general_scalar_chain::general_scalar_chain): Take mode arguments.
> >       (timode_scalar_chain::timode_scalar_chain): Initialize scalar_chain
> >       base with TImode and V1TImode.
> >       * config/i386/i386-features.c (scalar_chain::scalar_chain): Adjust.
> >       (general_scalar_chain::vector_const_cost): Adjust for SImode
> >       chains.
> >       (general_scalar_chain::compute_convert_gain): Likewise.  Fix
> >       reg-reg move cost gain, use ix86_cost->sse_op cost and adjust
> >       scalar costs.  Add {S,U}{MIN,MAX} support.  Dump per-instruction
> >       gain if not zero.
> >       (general_scalar_chain::replace_with_subreg): Use vmode/smode.
> >       Elide the subreg if the reg is already vector.
> >       (general_scalar_chain::make_vector_copies): Likewise.  Handle
> >       non-DImode chains appropriately.  Use a vector-mode pseudo as
> >       destination.
> >       (general_scalar_chain::convert_reg): Likewise.
> >       (general_scalar_chain::convert_op): Likewise.  Elide the
> >       subreg if the reg is already vector.
> >       (general_scalar_chain::convert_insn): Likewise.  Add
> >       fatal_insn_not_found if the result is not recognized.
> >       (convertible_comparison_p): Pass in the scalar mode and use that.
> >       (general_scalar_to_vector_candidate_p): Likewise.  Rename from
> >       dimode_scalar_to_vector_candidate_p.  Add {S,U}{MIN,MAX} support.
> >       (scalar_to_vector_candidate_p): Remove by inlining into single
> >       caller.
> >       (general_remove_non_convertible_regs): Rename from
> >       dimode_remove_non_convertible_regs.
> >       (remove_non_convertible_regs): Remove by inlining into single caller.
> >       (convert_scalars_to_vector): Handle SImode and DImode chains
> >       in addition to TImode chains.
> >       * config/i386/i386.md (<maxmin><SWI48>3): New insn split after STV.
> >
> >       * gcc.target/i386/pr91154.c: New testcase.
> >       * gcc.target/i386/minmax-3.c: Likewise.
> >       * gcc.target/i386/minmax-4.c: Likewise.
> >       * gcc.target/i386/minmax-5.c: Likewise.
> >       * gcc.target/i386/minmax-6.c: Likewise.

Reply via email to