On Mon, 5 Aug 2019, Uros Bizjak wrote:

> On Mon, Aug 5, 2019 at 3:29 PM Richard Biener <rguent...@suse.de> wrote:
> 
> > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI 
> > > > > > > > "TARGET_AVX512F"])
> > > > > > > >
> > > > > > > > and then we need to split DImode for 32bits, too.
> > > > > > >
> > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode
> > > > > > > condition, I'll provide _doubleword splitter later.
> > > > > >
> > > > > > Shouldn't that be TARGET_AVX512VL instead?  Or does the insn use 
> > > > > > %g0 etc.
> > > > > > to force use of %zmmN?
> > > > >
> > > > > It generates V4SI mode, so - yes, AVX512VL.
> > > >
> > > >     case SMAX:
> > > >     case SMIN:
> > > >     case UMAX:
> > > >     case UMIN:
> > > >       if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL))
> > > >           || (mode == SImode && !TARGET_SSE4_1))
> > > >         return false;
> > > >
> > > > so there's no way to use AVX512VL for 32bit?
> > >
> > > There is a way, but on 32bit targets, we need to split DImode
> > > operation to a sequence of SImode operations for unconverted pattern.
> > > This is of course doable, but somehow more complex than simply
> > > emitting a DImode compare + DImode cmove, which is what current
> > > splitter does. So, a follow-up task.
> >
> > Ah, OK.  So for the above condition we can elide the !TARGET_64BIT
> > check we just need to properly split if we enable the scalar minmax
> > pattern for DImode on 32bits, the STV conversion would go fine.
> 
> Yes, that is correct.

So I tested the patch below (now with appropriate ChangeLog) on
x86_64-unknown-linux-gnu.  I've thrown it at SPEC CPU 2006 with
the obvious hmmer improvement, now checking for off-noise results
with a 3-run on those that may have one (with more than +-1 second
differences in the 1-run).

As-is the patch likely runs into the splitting issue for DImode
on i?86 and the patch misses functional testcases.  I'll do the
hmmer loop with both DImode and SImode and testcases to trigger
all pattern variants with the different ISAs we have.

Some of the patch could be split out (the cost changes that are
also effective for DImode for example).

AFAICS we could go with only adding SImode avoiding the DImode
splitting thing and this would solve the hmmer regression.

Thanks,
Richard.

2019-08-07  Richard Biener  <rguent...@suse.de>

        PR target/91154
        * config/i386/i386-features.h (scalar_chain::scalar_chain): Add
        mode arguments.
        (scalar_chain::smode): New member.
        (scalar_chain::vmode): Likewise.
        (dimode_scalar_chain): Rename to...
        (general_scalar_chain): ... this.
        (general_scalar_chain::general_scalar_chain): Take mode arguments.
        (timode_scalar_chain::timode_scalar_chain): Initialize scalar_chain
        base with TImode and V1TImode.
        * config/i386/i386-features.c (scalar_chain::scalar_chain): Adjust.
        (general_scalar_chain::vector_const_cost): Adjust for SImode
        chains.
        (general_scalar_chain::compute_convert_gain): Likewise.  Fix
        reg-reg move cost gain, use ix86_cost->sse_op cost and adjust
        scalar costs.  Add {S,U}{MIN,MAX} support.  Dump per-instruction
        gain if not zero.
        (general_scalar_chain::replace_with_subreg): Use vmode/smode.
        (general_scalar_chain::make_vector_copies): Likewise.  Handle
        non-DImode chains appropriately.
        (general_scalar_chain::convert_reg): Likewise.
        (general_scalar_chain::convert_op): Likewise.
        (general_scalar_chain::convert_insn): Likewise.  Add
        fatal_insn_not_found if the result is not recognized.
        (convertible_comparison_p): Pass in the scalar mode and use that.
        (general_scalar_to_vector_candidate_p): Likewise.  Rename from
        dimode_scalar_to_vector_candidate_p.  Add {S,U}{MIN,MAX} support.
        (scalar_to_vector_candidate_p): Remove by inlining into single
        caller.
        (general_remove_non_convertible_regs): Rename from
        dimode_remove_non_convertible_regs.
        (remove_non_convertible_regs): Remove by inlining into single caller.
        (convert_scalars_to_vector): Handle SImode and DImode chains
        in addition to TImode chains.
        * config/i386/i386.md (<maxmin><SWI48>3): New insn split after STV.

Index: gcc/config/i386/i386-features.c
===================================================================
--- gcc/config/i386/i386-features.c     (revision 274111)
+++ gcc/config/i386/i386-features.c     (working copy)
@@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0;
 
 /* Initialize new chain.  */
 
-scalar_chain::scalar_chain ()
+scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
 {
+  smode = smode_;
+  vmode = vmode_;
+
   chain_id = ++max_id;
 
    if (dump_file)
@@ -319,7 +322,7 @@ scalar_chain::add_to_queue (unsigned ins
    conversion.  */
 
 void
-dimode_scalar_chain::mark_dual_mode_def (df_ref def)
+general_scalar_chain::mark_dual_mode_def (df_ref def)
 {
   gcc_assert (DF_REF_REG_DEF_P (def));
 
@@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate
       && !HARD_REGISTER_P (SET_DEST (def_set)))
     bitmap_set_bit (defs, REGNO (SET_DEST (def_set)));
 
+  /* ???  The following is quadratic since analyze_register_chain
+     iterates over all refs to look for dual-mode regs.  Instead this
+     should be done separately for all regs mentioned in the chain once.  */
   df_ref ref;
   df_ref def;
   for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref))
@@ -469,19 +475,21 @@ scalar_chain::build (bitmap candidates,
    instead of using a scalar one.  */
 
 int
-dimode_scalar_chain::vector_const_cost (rtx exp)
+general_scalar_chain::vector_const_cost (rtx exp)
 {
   gcc_assert (CONST_INT_P (exp));
 
-  if (standard_sse_constant_p (exp, V2DImode))
-    return COSTS_N_INSNS (1);
-  return ix86_cost->sse_load[1];
+  if (standard_sse_constant_p (exp, vmode))
+    return ix86_cost->sse_op;
+  /* We have separate costs for SImode and DImode, use SImode costs
+     for smaller modes.  */
+  return ix86_cost->sse_load[smode == DImode ? 1 : 0];
 }
 
 /* Compute a gain for chain conversion.  */
 
 int
-dimode_scalar_chain::compute_convert_gain ()
+general_scalar_chain::compute_convert_gain ()
 {
   bitmap_iterator bi;
   unsigned insn_uid;
@@ -491,28 +499,37 @@ dimode_scalar_chain::compute_convert_gai
   if (dump_file)
     fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id);
 
+  /* SSE costs distinguish between SImode and DImode loads/stores, for
+     int costs factor in the number of GPRs involved.  When supporting
+     smaller modes than SImode the int load/store costs need to be
+     adjusted as well.  */
+  unsigned sse_cost_idx = smode == DImode ? 1 : 0;
+  unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1;
+
   EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi)
     {
       rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn;
       rtx def_set = single_set (insn);
       rtx src = SET_SRC (def_set);
       rtx dst = SET_DEST (def_set);
+      int igain = 0;
 
       if (REG_P (src) && REG_P (dst))
-       gain += COSTS_N_INSNS (2) - ix86_cost->xmm_move;
+       igain += 2 * m - ix86_cost->xmm_move;
       else if (REG_P (src) && MEM_P (dst))
-       gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
+       igain
+         += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx];
       else if (MEM_P (src) && REG_P (dst))
-       gain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1];
+       igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx];
       else if (GET_CODE (src) == ASHIFT
               || GET_CODE (src) == ASHIFTRT
               || GET_CODE (src) == LSHIFTRT)
        {
          if (CONST_INT_P (XEXP (src, 0)))
-           gain -= vector_const_cost (XEXP (src, 0));
-         gain += ix86_cost->shift_const;
+           igain -= vector_const_cost (XEXP (src, 0));
+         igain += m * ix86_cost->shift_const - ix86_cost->sse_op;
          if (INTVAL (XEXP (src, 1)) >= 32)
-           gain -= COSTS_N_INSNS (1);
+           igain -= COSTS_N_INSNS (1);
        }
       else if (GET_CODE (src) == PLUS
               || GET_CODE (src) == MINUS
@@ -520,20 +537,31 @@ dimode_scalar_chain::compute_convert_gai
               || GET_CODE (src) == XOR
               || GET_CODE (src) == AND)
        {
-         gain += ix86_cost->add;
+         igain += m * ix86_cost->add - ix86_cost->sse_op;
          /* Additional gain for andnot for targets without BMI.  */
          if (GET_CODE (XEXP (src, 0)) == NOT
              && !TARGET_BMI)
-           gain += 2 * ix86_cost->add;
+           igain += m * ix86_cost->add;
 
          if (CONST_INT_P (XEXP (src, 0)))
-           gain -= vector_const_cost (XEXP (src, 0));
+           igain -= vector_const_cost (XEXP (src, 0));
          if (CONST_INT_P (XEXP (src, 1)))
-           gain -= vector_const_cost (XEXP (src, 1));
+           igain -= vector_const_cost (XEXP (src, 1));
        }
       else if (GET_CODE (src) == NEG
               || GET_CODE (src) == NOT)
-       gain += ix86_cost->add - COSTS_N_INSNS (1);
+       igain += m * ix86_cost->add - ix86_cost->sse_op;
+      else if (GET_CODE (src) == SMAX
+              || GET_CODE (src) == SMIN
+              || GET_CODE (src) == UMAX
+              || GET_CODE (src) == UMIN)
+       {
+         /* We do not have any conditional move cost, estimate it as a
+            reg-reg move.  Comparisons are costed as adds.  */
+         igain += m * (COSTS_N_INSNS (2) + ix86_cost->add);
+         /* Integer SSE ops are all costed the same.  */
+         igain -= ix86_cost->sse_op;
+       }
       else if (GET_CODE (src) == COMPARE)
        {
          /* Assume comparison cost is the same.  */
@@ -541,18 +569,28 @@ dimode_scalar_chain::compute_convert_gai
       else if (CONST_INT_P (src))
        {
          if (REG_P (dst))
-           gain += COSTS_N_INSNS (2);
+           /* DImode can be immediate for TARGET_64BIT and SImode always.  */
+           igain += COSTS_N_INSNS (m);
          else if (MEM_P (dst))
-           gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
-         gain -= vector_const_cost (src);
+           igain += (m * ix86_cost->int_store[2]
+                    - ix86_cost->sse_store[sse_cost_idx]);
+         igain -= vector_const_cost (src);
        }
       else
        gcc_unreachable ();
+
+      if (igain != 0 && dump_file)
+       {
+         fprintf (dump_file, "  Instruction gain %d for ", igain);
+         dump_insn_slim (dump_file, insn);
+       }
+      gain += igain;
     }
 
   if (dump_file)
     fprintf (dump_file, "  Instruction conversion gain: %d\n", gain);
 
+  /* ???  What about integer to SSE?  */
   EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi)
     cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer;
 
@@ -570,10 +608,10 @@ dimode_scalar_chain::compute_convert_gai
 /* Replace REG in X with a V2DI subreg of NEW_REG.  */
 
 rtx
-dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
+general_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
 {
   if (x == reg)
-    return gen_rtx_SUBREG (V2DImode, new_reg, 0);
+    return gen_rtx_SUBREG (vmode, new_reg, 0);
 
   const char *fmt = GET_RTX_FORMAT (GET_CODE (x));
   int i, j;
@@ -593,7 +631,7 @@ dimode_scalar_chain::replace_with_subreg
 /* Replace REG in INSN with a V2DI subreg of NEW_REG.  */
 
 void
-dimode_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn,
+general_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn,
                                                  rtx reg, rtx new_reg)
 {
   replace_with_subreg (single_set (insn), reg, new_reg);
@@ -624,10 +662,10 @@ scalar_chain::emit_conversion_insns (rtx
    and replace its uses in a chain.  */
 
 void
-dimode_scalar_chain::make_vector_copies (unsigned regno)
+general_scalar_chain::make_vector_copies (unsigned regno)
 {
   rtx reg = regno_reg_rtx[regno];
-  rtx vreg = gen_reg_rtx (DImode);
+  rtx vreg = gen_reg_rtx (smode);
   df_ref ref;
 
   for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref))
@@ -636,37 +674,47 @@ dimode_scalar_chain::make_vector_copies
        start_sequence ();
        if (!TARGET_INTER_UNIT_MOVES_TO_VEC)
          {
-           rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
-           emit_move_insn (adjust_address (tmp, SImode, 0),
-                           gen_rtx_SUBREG (SImode, reg, 0));
-           emit_move_insn (adjust_address (tmp, SImode, 4),
-                           gen_rtx_SUBREG (SImode, reg, 4));
+           rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
+           if (smode == DImode && !TARGET_64BIT)
+             {
+               emit_move_insn (adjust_address (tmp, SImode, 0),
+                               gen_rtx_SUBREG (SImode, reg, 0));
+               emit_move_insn (adjust_address (tmp, SImode, 4),
+                               gen_rtx_SUBREG (SImode, reg, 4));
+             }
+           else
+             emit_move_insn (tmp, reg);
            emit_move_insn (vreg, tmp);
          }
-       else if (TARGET_SSE4_1)
+       else if (!TARGET_64BIT && smode == DImode)
          {
-           emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
-                                       CONST0_RTX (V4SImode),
-                                       gen_rtx_SUBREG (SImode, reg, 0)));
-           emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
-                                         gen_rtx_SUBREG (V4SImode, vreg, 0),
-                                         gen_rtx_SUBREG (SImode, reg, 4),
-                                         GEN_INT (2)));
+           if (TARGET_SSE4_1)
+             {
+               emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
+                                           CONST0_RTX (V4SImode),
+                                           gen_rtx_SUBREG (SImode, reg, 0)));
+               emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 
0),
+                                             gen_rtx_SUBREG (V4SImode, vreg, 
0),
+                                             gen_rtx_SUBREG (SImode, reg, 4),
+                                             GEN_INT (2)));
+             }
+           else
+             {
+               rtx tmp = gen_reg_rtx (DImode);
+               emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
+                                           CONST0_RTX (V4SImode),
+                                           gen_rtx_SUBREG (SImode, reg, 0)));
+               emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
+                                           CONST0_RTX (V4SImode),
+                                           gen_rtx_SUBREG (SImode, reg, 4)));
+               emit_insn (gen_vec_interleave_lowv4si
+                          (gen_rtx_SUBREG (V4SImode, vreg, 0),
+                           gen_rtx_SUBREG (V4SImode, vreg, 0),
+                           gen_rtx_SUBREG (V4SImode, tmp, 0)));
+             }
          }
        else
-         {
-           rtx tmp = gen_reg_rtx (DImode);
-           emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
-                                       CONST0_RTX (V4SImode),
-                                       gen_rtx_SUBREG (SImode, reg, 0)));
-           emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
-                                       CONST0_RTX (V4SImode),
-                                       gen_rtx_SUBREG (SImode, reg, 4)));
-           emit_insn (gen_vec_interleave_lowv4si
-                      (gen_rtx_SUBREG (V4SImode, vreg, 0),
-                       gen_rtx_SUBREG (V4SImode, vreg, 0),
-                       gen_rtx_SUBREG (V4SImode, tmp, 0)));
-         }
+         emit_move_insn (gen_lowpart (smode, vreg), reg);
        rtx_insn *seq = get_insns ();
        end_sequence ();
        rtx_insn *insn = DF_REF_INSN (ref);
@@ -695,7 +743,7 @@ dimode_scalar_chain::make_vector_copies
    in case register is used in not convertible insn.  */
 
 void
-dimode_scalar_chain::convert_reg (unsigned regno)
+general_scalar_chain::convert_reg (unsigned regno)
 {
   bool scalar_copy = bitmap_bit_p (defs_conv, regno);
   rtx reg = regno_reg_rtx[regno];
@@ -707,7 +755,7 @@ dimode_scalar_chain::convert_reg (unsign
   bitmap_copy (conv, insns);
 
   if (scalar_copy)
-    scopy = gen_reg_rtx (DImode);
+    scopy = gen_reg_rtx (smode);
 
   for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref))
     {
@@ -727,40 +775,55 @@ dimode_scalar_chain::convert_reg (unsign
          start_sequence ();
          if (!TARGET_INTER_UNIT_MOVES_FROM_VEC)
            {
-             rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
+             rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
              emit_move_insn (tmp, reg);
-             emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
-                             adjust_address (tmp, SImode, 0));
-             emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
-                             adjust_address (tmp, SImode, 4));
+             if (!TARGET_64BIT && smode == DImode)
+               {
+                 emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
+                                 adjust_address (tmp, SImode, 0));
+                 emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
+                                 adjust_address (tmp, SImode, 4));
+               }
+             else
+               emit_move_insn (scopy, tmp);
            }
-         else if (TARGET_SSE4_1)
+         else if (!TARGET_64BIT && smode == DImode)
            {
-             rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx));
-             emit_insn
-               (gen_rtx_SET
-                (gen_rtx_SUBREG (SImode, scopy, 0),
-                 gen_rtx_VEC_SELECT (SImode,
-                                     gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
-
-             tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
-             emit_insn
-               (gen_rtx_SET
-                (gen_rtx_SUBREG (SImode, scopy, 4),
-                 gen_rtx_VEC_SELECT (SImode,
-                                     gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
+             if (TARGET_SSE4_1)
+               {
+                 rtx tmp = gen_rtx_PARALLEL (VOIDmode,
+                                             gen_rtvec (1, const0_rtx));
+                 emit_insn
+                   (gen_rtx_SET
+                      (gen_rtx_SUBREG (SImode, scopy, 0),
+                       gen_rtx_VEC_SELECT (SImode,
+                                           gen_rtx_SUBREG (V4SImode, reg, 0),
+                                           tmp)));
+
+                 tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
+                 emit_insn
+                   (gen_rtx_SET
+                      (gen_rtx_SUBREG (SImode, scopy, 4),
+                       gen_rtx_VEC_SELECT (SImode,
+                                           gen_rtx_SUBREG (V4SImode, reg, 0),
+                                           tmp)));
+               }
+             else
+               {
+                 rtx vcopy = gen_reg_rtx (V2DImode);
+                 emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
+                 emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
+                                 gen_rtx_SUBREG (SImode, vcopy, 0));
+                 emit_move_insn (vcopy,
+                                 gen_rtx_LSHIFTRT (V2DImode,
+                                                   vcopy, GEN_INT (32)));
+                 emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
+                                 gen_rtx_SUBREG (SImode, vcopy, 0));
+               }
            }
          else
-           {
-             rtx vcopy = gen_reg_rtx (V2DImode);
-             emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
-             emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
-                             gen_rtx_SUBREG (SImode, vcopy, 0));
-             emit_move_insn (vcopy,
-                             gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32)));
-             emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
-                             gen_rtx_SUBREG (SImode, vcopy, 0));
-           }
+           emit_move_insn (scopy, reg);
+
          rtx_insn *seq = get_insns ();
          end_sequence ();
          emit_conversion_insns (seq, insn);
@@ -809,21 +872,21 @@ dimode_scalar_chain::convert_reg (unsign
    registers conversion.  */
 
 void
-dimode_scalar_chain::convert_op (rtx *op, rtx_insn *insn)
+general_scalar_chain::convert_op (rtx *op, rtx_insn *insn)
 {
   *op = copy_rtx_if_shared (*op);
 
   if (GET_CODE (*op) == NOT)
     {
       convert_op (&XEXP (*op, 0), insn);
-      PUT_MODE (*op, V2DImode);
+      PUT_MODE (*op, vmode);
     }
   else if (MEM_P (*op))
     {
-      rtx tmp = gen_reg_rtx (DImode);
+      rtx tmp = gen_reg_rtx (GET_MODE (*op));
 
       emit_insn_before (gen_move_insn (tmp, *op), insn);
-      *op = gen_rtx_SUBREG (V2DImode, tmp, 0);
+      *op = gen_rtx_SUBREG (vmode, tmp, 0);
 
       if (dump_file)
        fprintf (dump_file, "  Preloading operand for insn %d into r%d\n",
@@ -841,24 +904,30 @@ dimode_scalar_chain::convert_op (rtx *op
            gcc_assert (!DF_REF_CHAIN (ref));
            break;
          }
-      *op = gen_rtx_SUBREG (V2DImode, *op, 0);
+      *op = gen_rtx_SUBREG (vmode, *op, 0);
     }
   else if (CONST_INT_P (*op))
     {
       rtx vec_cst;
-      rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0);
+      rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0);
 
       /* Prefer all ones vector in case of -1.  */
       if (constm1_operand (*op, GET_MODE (*op)))
-       vec_cst = CONSTM1_RTX (V2DImode);
+       vec_cst = CONSTM1_RTX (vmode);
       else
-       vec_cst = gen_rtx_CONST_VECTOR (V2DImode,
-                                       gen_rtvec (2, *op, const0_rtx));
+       {
+         unsigned n = GET_MODE_NUNITS (vmode);
+         rtx *v = XALLOCAVEC (rtx, n);
+         v[0] = *op;
+         for (unsigned i = 1; i < n; ++i)
+           v[i] = const0_rtx;
+         vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v));
+       }
 
-      if (!standard_sse_constant_p (vec_cst, V2DImode))
+      if (!standard_sse_constant_p (vec_cst, vmode))
        {
          start_sequence ();
-         vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst));
+         vec_cst = validize_mem (force_const_mem (vmode, vec_cst));
          rtx_insn *seq = get_insns ();
          end_sequence ();
          emit_insn_before (seq, insn);
@@ -870,14 +939,14 @@ dimode_scalar_chain::convert_op (rtx *op
   else
     {
       gcc_assert (SUBREG_P (*op));
-      gcc_assert (GET_MODE (*op) == V2DImode);
+      gcc_assert (GET_MODE (*op) == vmode);
     }
 }
 
 /* Convert INSN to vector mode.  */
 
 void
-dimode_scalar_chain::convert_insn (rtx_insn *insn)
+general_scalar_chain::convert_insn (rtx_insn *insn)
 {
   rtx def_set = single_set (insn);
   rtx src = SET_SRC (def_set);
@@ -888,9 +957,9 @@ dimode_scalar_chain::convert_insn (rtx_i
     {
       /* There are no scalar integer instructions and therefore
         temporary register usage is required.  */
-      rtx tmp = gen_reg_rtx (DImode);
+      rtx tmp = gen_reg_rtx (GET_MODE (dst));
       emit_conversion_insns (gen_move_insn (dst, tmp), insn);
-      dst = gen_rtx_SUBREG (V2DImode, tmp, 0);
+      dst = gen_rtx_SUBREG (vmode, tmp, 0);
     }
 
   switch (GET_CODE (src))
@@ -899,7 +968,7 @@ dimode_scalar_chain::convert_insn (rtx_i
     case ASHIFTRT:
     case LSHIFTRT:
       convert_op (&XEXP (src, 0), insn);
-      PUT_MODE (src, V2DImode);
+      PUT_MODE (src, vmode);
       break;
 
     case PLUS:
@@ -907,25 +976,29 @@ dimode_scalar_chain::convert_insn (rtx_i
     case IOR:
     case XOR:
     case AND:
+    case SMAX:
+    case SMIN:
+    case UMAX:
+    case UMIN:
       convert_op (&XEXP (src, 0), insn);
       convert_op (&XEXP (src, 1), insn);
-      PUT_MODE (src, V2DImode);
+      PUT_MODE (src, vmode);
       break;
 
     case NEG:
       src = XEXP (src, 0);
       convert_op (&src, insn);
-      subreg = gen_reg_rtx (V2DImode);
-      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn);
-      src = gen_rtx_MINUS (V2DImode, subreg, src);
+      subreg = gen_reg_rtx (vmode);
+      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn);
+      src = gen_rtx_MINUS (vmode, subreg, src);
       break;
 
     case NOT:
       src = XEXP (src, 0);
       convert_op (&src, insn);
-      subreg = gen_reg_rtx (V2DImode);
-      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn);
-      src = gen_rtx_XOR (V2DImode, src, subreg);
+      subreg = gen_reg_rtx (vmode);
+      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn);
+      src = gen_rtx_XOR (vmode, src, subreg);
       break;
 
     case MEM:
@@ -939,17 +1012,17 @@ dimode_scalar_chain::convert_insn (rtx_i
       break;
 
     case SUBREG:
-      gcc_assert (GET_MODE (src) == V2DImode);
+      gcc_assert (GET_MODE (src) == vmode);
       break;
 
     case COMPARE:
       src = SUBREG_REG (XEXP (XEXP (src, 0), 0));
 
-      gcc_assert ((REG_P (src) && GET_MODE (src) == DImode)
-                 || (SUBREG_P (src) && GET_MODE (src) == V2DImode));
+      gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode))
+                 || (SUBREG_P (src) && GET_MODE (src) == vmode));
 
       if (REG_P (src))
-       subreg = gen_rtx_SUBREG (V2DImode, src, 0);
+       subreg = gen_rtx_SUBREG (vmode, src, 0);
       else
        subreg = copy_rtx_if_shared (src);
       emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared 
(subreg),
@@ -977,7 +1050,9 @@ dimode_scalar_chain::convert_insn (rtx_i
   PATTERN (insn) = def_set;
 
   INSN_CODE (insn) = -1;
-  recog_memoized (insn);
+  int patt = recog_memoized (insn);
+  if  (patt == -1)
+    fatal_insn_not_found (insn);
   df_insn_rescan (insn);
 }
 
@@ -1116,7 +1191,7 @@ timode_scalar_chain::convert_insn (rtx_i
 }
 
 void
-dimode_scalar_chain::convert_registers ()
+general_scalar_chain::convert_registers ()
 {
   bitmap_iterator bi;
   unsigned id;
@@ -1186,7 +1261,7 @@ has_non_address_hard_reg (rtx_insn *insn
                     (const_int 0 [0])))  */
 
 static bool
-convertible_comparison_p (rtx_insn *insn)
+convertible_comparison_p (rtx_insn *insn, enum machine_mode mode)
 {
   if (!TARGET_SSE4_1)
     return false;
@@ -1219,12 +1294,12 @@ convertible_comparison_p (rtx_insn *insn
 
   if (!SUBREG_P (op1)
       || !SUBREG_P (op2)
-      || GET_MODE (op1) != SImode
-      || GET_MODE (op2) != SImode
+      || GET_MODE (op1) != mode
+      || GET_MODE (op2) != mode
       || ((SUBREG_BYTE (op1) != 0
-          || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode))
+          || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode))
          && (SUBREG_BYTE (op2) != 0
-             || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode))))
+             || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode))))
     return false;
 
   op1 = SUBREG_REG (op1);
@@ -1232,7 +1307,7 @@ convertible_comparison_p (rtx_insn *insn
 
   if (op1 != op2
       || !REG_P (op1)
-      || GET_MODE (op1) != DImode)
+      || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ())
     return false;
 
   return true;
@@ -1241,7 +1316,7 @@ convertible_comparison_p (rtx_insn *insn
 /* The DImode version of scalar_to_vector_candidate_p.  */
 
 static bool
-dimode_scalar_to_vector_candidate_p (rtx_insn *insn)
+general_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode)
 {
   rtx def_set = single_set (insn);
 
@@ -1255,12 +1330,12 @@ dimode_scalar_to_vector_candidate_p (rtx
   rtx dst = SET_DEST (def_set);
 
   if (GET_CODE (src) == COMPARE)
-    return convertible_comparison_p (insn);
+    return convertible_comparison_p (insn, mode);
 
   /* We are interested in DImode promotion only.  */
-  if ((GET_MODE (src) != DImode
+  if ((GET_MODE (src) != mode
        && !CONST_INT_P (src))
-      || GET_MODE (dst) != DImode)
+      || GET_MODE (dst) != mode)
     return false;
 
   if (!REG_P (dst) && !MEM_P (dst))
@@ -1280,6 +1355,15 @@ dimode_scalar_to_vector_candidate_p (rtx
        return false;
       break;
 
+    case SMAX:
+    case SMIN:
+    case UMAX:
+    case UMIN:
+      if ((mode == DImode && !TARGET_AVX512VL)
+         || (mode == SImode && !TARGET_SSE4_1))
+       return false;
+      /* Fallthru.  */
+
     case PLUS:
     case MINUS:
     case IOR:
@@ -1290,7 +1374,7 @@ dimode_scalar_to_vector_candidate_p (rtx
          && !CONST_INT_P (XEXP (src, 1)))
        return false;
 
-      if (GET_MODE (XEXP (src, 1)) != DImode
+      if (GET_MODE (XEXP (src, 1)) != mode
          && !CONST_INT_P (XEXP (src, 1)))
        return false;
       break;
@@ -1319,7 +1403,7 @@ dimode_scalar_to_vector_candidate_p (rtx
          || !REG_P (XEXP (XEXP (src, 0), 0))))
       return false;
 
-  if (GET_MODE (XEXP (src, 0)) != DImode
+  if (GET_MODE (XEXP (src, 0)) != mode
       && !CONST_INT_P (XEXP (src, 0)))
     return false;
 
@@ -1383,22 +1467,16 @@ timode_scalar_to_vector_candidate_p (rtx
   return false;
 }
 
-/* Return 1 if INSN may be converted into vector
-   instruction.  */
-
-static bool
-scalar_to_vector_candidate_p (rtx_insn *insn)
-{
-  if (TARGET_64BIT)
-    return timode_scalar_to_vector_candidate_p (insn);
-  else
-    return dimode_scalar_to_vector_candidate_p (insn);
-}
+/* For a given bitmap of insn UIDs scans all instruction and
+   remove insn from CANDIDATES in case it has both convertible
+   and not convertible definitions.
 
-/* The DImode version of remove_non_convertible_regs.  */
+   All insns in a bitmap are conversion candidates according to
+   scalar_to_vector_candidate_p.  Currently it implies all insns
+   are single_set.  */
 
 static void
-dimode_remove_non_convertible_regs (bitmap candidates)
+general_remove_non_convertible_regs (bitmap candidates)
 {
   bitmap_iterator bi;
   unsigned id;
@@ -1553,23 +1631,6 @@ timode_remove_non_convertible_regs (bitm
   BITMAP_FREE (regs);
 }
 
-/* For a given bitmap of insn UIDs scans all instruction and
-   remove insn from CANDIDATES in case it has both convertible
-   and not convertible definitions.
-
-   All insns in a bitmap are conversion candidates according to
-   scalar_to_vector_candidate_p.  Currently it implies all insns
-   are single_set.  */
-
-static void
-remove_non_convertible_regs (bitmap candidates)
-{
-  if (TARGET_64BIT)
-    timode_remove_non_convertible_regs (candidates);
-  else
-    dimode_remove_non_convertible_regs (candidates);
-}
-
 /* Main STV pass function.  Find and convert scalar
    instructions into vector mode when profitable.  */
 
@@ -1577,11 +1638,14 @@ static unsigned int
 convert_scalars_to_vector ()
 {
   basic_block bb;
-  bitmap candidates;
   int converted_insns = 0;
 
   bitmap_obstack_initialize (NULL);
-  candidates = BITMAP_ALLOC (NULL);
+  const machine_mode cand_mode[3] = { SImode, DImode, TImode };
+  const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode };
+  bitmap_head candidates[3];  /* { SImode, DImode, TImode } */
+  for (unsigned i = 0; i < 3; ++i)
+    bitmap_initialize (&candidates[i], &bitmap_default_obstack);
 
   calculate_dominance_info (CDI_DOMINATORS);
   df_set_flags (DF_DEFER_INSN_RESCAN);
@@ -1597,51 +1661,73 @@ convert_scalars_to_vector ()
     {
       rtx_insn *insn;
       FOR_BB_INSNS (bb, insn)
-       if (scalar_to_vector_candidate_p (insn))
+       if (TARGET_64BIT
+           && timode_scalar_to_vector_candidate_p (insn))
          {
            if (dump_file)
-             fprintf (dump_file, "  insn %d is marked as a candidate\n",
+             fprintf (dump_file, "  insn %d is marked as a TImode candidate\n",
                       INSN_UID (insn));
 
-           bitmap_set_bit (candidates, INSN_UID (insn));
+           bitmap_set_bit (&candidates[2], INSN_UID (insn));
+         }
+       else
+         {
+           /* Check {SI,DI}mode.  */
+           for (unsigned i = 0; i <= 1; ++i)
+             if (general_scalar_to_vector_candidate_p (insn, cand_mode[i]))
+               {
+                 if (dump_file)
+                   fprintf (dump_file, "  insn %d is marked as a %s 
candidate\n",
+                            INSN_UID (insn), i == 0 ? "SImode" : "DImode");
+
+                 bitmap_set_bit (&candidates[i], INSN_UID (insn));
+                 break;
+               }
          }
     }
 
-  remove_non_convertible_regs (candidates);
+  if (TARGET_64BIT)
+    timode_remove_non_convertible_regs (&candidates[2]);
+  for (unsigned i = 0; i <= 1; ++i)
+    general_remove_non_convertible_regs (&candidates[i]);
 
-  if (bitmap_empty_p (candidates))
-    if (dump_file)
+  for (unsigned i = 0; i <= 2; ++i)
+    if (!bitmap_empty_p (&candidates[i]))
+      break;
+    else if (i == 2 && dump_file)
       fprintf (dump_file, "There are no candidates for optimization.\n");
 
-  while (!bitmap_empty_p (candidates))
-    {
-      unsigned uid = bitmap_first_set_bit (candidates);
-      scalar_chain *chain;
+  for (unsigned i = 0; i <= 2; ++i)
+    while (!bitmap_empty_p (&candidates[i]))
+      {
+       unsigned uid = bitmap_first_set_bit (&candidates[i]);
+       scalar_chain *chain;
 
-      if (TARGET_64BIT)
-       chain = new timode_scalar_chain;
-      else
-       chain = new dimode_scalar_chain;
+       if (cand_mode[i] == TImode)
+         chain = new timode_scalar_chain;
+       else
+         chain = new general_scalar_chain (cand_mode[i], cand_vmode[i]);
 
-      /* Find instructions chain we want to convert to vector mode.
-        Check all uses and definitions to estimate all required
-        conversions.  */
-      chain->build (candidates, uid);
+       /* Find instructions chain we want to convert to vector mode.
+          Check all uses and definitions to estimate all required
+          conversions.  */
+       chain->build (&candidates[i], uid);
 
-      if (chain->compute_convert_gain () > 0)
-       converted_insns += chain->convert ();
-      else
-       if (dump_file)
-         fprintf (dump_file, "Chain #%d conversion is not profitable\n",
-                  chain->chain_id);
+       if (chain->compute_convert_gain () > 0)
+         converted_insns += chain->convert ();
+       else
+         if (dump_file)
+           fprintf (dump_file, "Chain #%d conversion is not profitable\n",
+                    chain->chain_id);
 
-      delete chain;
-    }
+       delete chain;
+      }
 
   if (dump_file)
     fprintf (dump_file, "Total insns converted: %d\n", converted_insns);
 
-  BITMAP_FREE (candidates);
+  for (unsigned i = 0; i <= 2; ++i)
+    bitmap_release (&candidates[i]);
   bitmap_obstack_release (NULL);
   df_process_deferred_rescans ();
 
Index: gcc/config/i386/i386-features.h
===================================================================
--- gcc/config/i386/i386-features.h     (revision 274111)
+++ gcc/config/i386/i386-features.h     (working copy)
@@ -127,11 +127,16 @@ namespace {
 class scalar_chain
 {
  public:
-  scalar_chain ();
+  scalar_chain (enum machine_mode, enum machine_mode);
   virtual ~scalar_chain ();
 
   static unsigned max_id;
 
+  /* Scalar mode.  */
+  enum machine_mode smode;
+  /* Vector mode.  */
+  enum machine_mode vmode;
+
   /* ID of a chain.  */
   unsigned int chain_id;
   /* A queue of instructions to be included into a chain.  */
@@ -159,9 +164,11 @@ class scalar_chain
   virtual void convert_registers () = 0;
 };
 
-class dimode_scalar_chain : public scalar_chain
+class general_scalar_chain : public scalar_chain
 {
  public:
+  general_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
+    : scalar_chain (smode_, vmode_) {}
   int compute_convert_gain ();
  private:
   void mark_dual_mode_def (df_ref def);
@@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala
 class timode_scalar_chain : public scalar_chain
 {
  public:
+  timode_scalar_chain () : scalar_chain (TImode, V1TImode) {}
+
   /* Convert from TImode to V1TImode is always faster.  */
   int compute_convert_gain () { return 1; }
 
Index: gcc/config/i386/i386.md
===================================================================
--- gcc/config/i386/i386.md     (revision 274111)
+++ gcc/config/i386/i386.md     (working copy)
@@ -17721,6 +17721,30 @@ (define_peephole2
     std::swap (operands[4], operands[5]);
 })
 
+;; min/max patterns
+
+(define_code_attr maxmin_rel
+  [(smax "ge") (smin "le") (umax "geu") (umin "leu")])
+(define_code_attr maxmin_cmpmode
+  [(smax "CCGC") (smin "CCGC") (umax "CC") (umin "CC")])
+
+(define_insn_and_split "<code><mode>3"
+  [(set (match_operand:SWI48 0 "register_operand")
+       (maxmin:SWI48 (match_operand:SWI48 1 "register_operand")
+                     (match_operand:SWI48 2 "register_operand")))
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_STV && TARGET_SSE4_1
+   && can_create_pseudo_p ()"
+  "#"
+  "&& 1"
+  [(set (reg:<maxmin_cmpmode> FLAGS_REG)
+       (compare:<maxmin_cmpmode> (match_dup 1)(match_dup 2)))
+   (set (match_dup 0)
+       (if_then_else:SWI48
+         (<maxmin_rel> (reg:<maxmin_cmpmode> FLAGS_REG)(const_int 0))
+         (match_dup 1)
+         (match_dup 2)))])
+
 ;; Conditional addition patterns
 (define_expand "add<mode>cc"
   [(match_operand:SWI 0 "register_operand")

Reply via email to