On Mon, 5 Aug 2019, Uros Bizjak wrote: > On Mon, Aug 5, 2019 at 3:29 PM Richard Biener <rguent...@suse.de> wrote: > > > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI > > > > > > > > "TARGET_AVX512F"]) > > > > > > > > > > > > > > > > and then we need to split DImode for 32bits, too. > > > > > > > > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode > > > > > > > condition, I'll provide _doubleword splitter later. > > > > > > > > > > > > Shouldn't that be TARGET_AVX512VL instead? Or does the insn use > > > > > > %g0 etc. > > > > > > to force use of %zmmN? > > > > > > > > > > It generates V4SI mode, so - yes, AVX512VL. > > > > > > > > case SMAX: > > > > case SMIN: > > > > case UMAX: > > > > case UMIN: > > > > if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL)) > > > > || (mode == SImode && !TARGET_SSE4_1)) > > > > return false; > > > > > > > > so there's no way to use AVX512VL for 32bit? > > > > > > There is a way, but on 32bit targets, we need to split DImode > > > operation to a sequence of SImode operations for unconverted pattern. > > > This is of course doable, but somehow more complex than simply > > > emitting a DImode compare + DImode cmove, which is what current > > > splitter does. So, a follow-up task. > > > > Ah, OK. So for the above condition we can elide the !TARGET_64BIT > > check we just need to properly split if we enable the scalar minmax > > pattern for DImode on 32bits, the STV conversion would go fine. > > Yes, that is correct.
So I tested the patch below (now with appropriate ChangeLog) on x86_64-unknown-linux-gnu. I've thrown it at SPEC CPU 2006 with the obvious hmmer improvement, now checking for off-noise results with a 3-run on those that may have one (with more than +-1 second differences in the 1-run). As-is the patch likely runs into the splitting issue for DImode on i?86 and the patch misses functional testcases. I'll do the hmmer loop with both DImode and SImode and testcases to trigger all pattern variants with the different ISAs we have. Some of the patch could be split out (the cost changes that are also effective for DImode for example). AFAICS we could go with only adding SImode avoiding the DImode splitting thing and this would solve the hmmer regression. Thanks, Richard. 2019-08-07 Richard Biener <rguent...@suse.de> PR target/91154 * config/i386/i386-features.h (scalar_chain::scalar_chain): Add mode arguments. (scalar_chain::smode): New member. (scalar_chain::vmode): Likewise. (dimode_scalar_chain): Rename to... (general_scalar_chain): ... this. (general_scalar_chain::general_scalar_chain): Take mode arguments. (timode_scalar_chain::timode_scalar_chain): Initialize scalar_chain base with TImode and V1TImode. * config/i386/i386-features.c (scalar_chain::scalar_chain): Adjust. (general_scalar_chain::vector_const_cost): Adjust for SImode chains. (general_scalar_chain::compute_convert_gain): Likewise. Fix reg-reg move cost gain, use ix86_cost->sse_op cost and adjust scalar costs. Add {S,U}{MIN,MAX} support. Dump per-instruction gain if not zero. (general_scalar_chain::replace_with_subreg): Use vmode/smode. (general_scalar_chain::make_vector_copies): Likewise. Handle non-DImode chains appropriately. (general_scalar_chain::convert_reg): Likewise. (general_scalar_chain::convert_op): Likewise. (general_scalar_chain::convert_insn): Likewise. Add fatal_insn_not_found if the result is not recognized. (convertible_comparison_p): Pass in the scalar mode and use that. (general_scalar_to_vector_candidate_p): Likewise. Rename from dimode_scalar_to_vector_candidate_p. Add {S,U}{MIN,MAX} support. (scalar_to_vector_candidate_p): Remove by inlining into single caller. (general_remove_non_convertible_regs): Rename from dimode_remove_non_convertible_regs. (remove_non_convertible_regs): Remove by inlining into single caller. (convert_scalars_to_vector): Handle SImode and DImode chains in addition to TImode chains. * config/i386/i386.md (<maxmin><SWI48>3): New insn split after STV. Index: gcc/config/i386/i386-features.c =================================================================== --- gcc/config/i386/i386-features.c (revision 274111) +++ gcc/config/i386/i386-features.c (working copy) @@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0; /* Initialize new chain. */ -scalar_chain::scalar_chain () +scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) { + smode = smode_; + vmode = vmode_; + chain_id = ++max_id; if (dump_file) @@ -319,7 +322,7 @@ scalar_chain::add_to_queue (unsigned ins conversion. */ void -dimode_scalar_chain::mark_dual_mode_def (df_ref def) +general_scalar_chain::mark_dual_mode_def (df_ref def) { gcc_assert (DF_REF_REG_DEF_P (def)); @@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate && !HARD_REGISTER_P (SET_DEST (def_set))) bitmap_set_bit (defs, REGNO (SET_DEST (def_set))); + /* ??? The following is quadratic since analyze_register_chain + iterates over all refs to look for dual-mode regs. Instead this + should be done separately for all regs mentioned in the chain once. */ df_ref ref; df_ref def; for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref)) @@ -469,19 +475,21 @@ scalar_chain::build (bitmap candidates, instead of using a scalar one. */ int -dimode_scalar_chain::vector_const_cost (rtx exp) +general_scalar_chain::vector_const_cost (rtx exp) { gcc_assert (CONST_INT_P (exp)); - if (standard_sse_constant_p (exp, V2DImode)) - return COSTS_N_INSNS (1); - return ix86_cost->sse_load[1]; + if (standard_sse_constant_p (exp, vmode)) + return ix86_cost->sse_op; + /* We have separate costs for SImode and DImode, use SImode costs + for smaller modes. */ + return ix86_cost->sse_load[smode == DImode ? 1 : 0]; } /* Compute a gain for chain conversion. */ int -dimode_scalar_chain::compute_convert_gain () +general_scalar_chain::compute_convert_gain () { bitmap_iterator bi; unsigned insn_uid; @@ -491,28 +499,37 @@ dimode_scalar_chain::compute_convert_gai if (dump_file) fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id); + /* SSE costs distinguish between SImode and DImode loads/stores, for + int costs factor in the number of GPRs involved. When supporting + smaller modes than SImode the int load/store costs need to be + adjusted as well. */ + unsigned sse_cost_idx = smode == DImode ? 1 : 0; + unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1; + EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi) { rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn; rtx def_set = single_set (insn); rtx src = SET_SRC (def_set); rtx dst = SET_DEST (def_set); + int igain = 0; if (REG_P (src) && REG_P (dst)) - gain += COSTS_N_INSNS (2) - ix86_cost->xmm_move; + igain += 2 * m - ix86_cost->xmm_move; else if (REG_P (src) && MEM_P (dst)) - gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; + igain + += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx]; else if (MEM_P (src) && REG_P (dst)) - gain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1]; + igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx]; else if (GET_CODE (src) == ASHIFT || GET_CODE (src) == ASHIFTRT || GET_CODE (src) == LSHIFTRT) { if (CONST_INT_P (XEXP (src, 0))) - gain -= vector_const_cost (XEXP (src, 0)); - gain += ix86_cost->shift_const; + igain -= vector_const_cost (XEXP (src, 0)); + igain += m * ix86_cost->shift_const - ix86_cost->sse_op; if (INTVAL (XEXP (src, 1)) >= 32) - gain -= COSTS_N_INSNS (1); + igain -= COSTS_N_INSNS (1); } else if (GET_CODE (src) == PLUS || GET_CODE (src) == MINUS @@ -520,20 +537,31 @@ dimode_scalar_chain::compute_convert_gai || GET_CODE (src) == XOR || GET_CODE (src) == AND) { - gain += ix86_cost->add; + igain += m * ix86_cost->add - ix86_cost->sse_op; /* Additional gain for andnot for targets without BMI. */ if (GET_CODE (XEXP (src, 0)) == NOT && !TARGET_BMI) - gain += 2 * ix86_cost->add; + igain += m * ix86_cost->add; if (CONST_INT_P (XEXP (src, 0))) - gain -= vector_const_cost (XEXP (src, 0)); + igain -= vector_const_cost (XEXP (src, 0)); if (CONST_INT_P (XEXP (src, 1))) - gain -= vector_const_cost (XEXP (src, 1)); + igain -= vector_const_cost (XEXP (src, 1)); } else if (GET_CODE (src) == NEG || GET_CODE (src) == NOT) - gain += ix86_cost->add - COSTS_N_INSNS (1); + igain += m * ix86_cost->add - ix86_cost->sse_op; + else if (GET_CODE (src) == SMAX + || GET_CODE (src) == SMIN + || GET_CODE (src) == UMAX + || GET_CODE (src) == UMIN) + { + /* We do not have any conditional move cost, estimate it as a + reg-reg move. Comparisons are costed as adds. */ + igain += m * (COSTS_N_INSNS (2) + ix86_cost->add); + /* Integer SSE ops are all costed the same. */ + igain -= ix86_cost->sse_op; + } else if (GET_CODE (src) == COMPARE) { /* Assume comparison cost is the same. */ @@ -541,18 +569,28 @@ dimode_scalar_chain::compute_convert_gai else if (CONST_INT_P (src)) { if (REG_P (dst)) - gain += COSTS_N_INSNS (2); + /* DImode can be immediate for TARGET_64BIT and SImode always. */ + igain += COSTS_N_INSNS (m); else if (MEM_P (dst)) - gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; - gain -= vector_const_cost (src); + igain += (m * ix86_cost->int_store[2] + - ix86_cost->sse_store[sse_cost_idx]); + igain -= vector_const_cost (src); } else gcc_unreachable (); + + if (igain != 0 && dump_file) + { + fprintf (dump_file, " Instruction gain %d for ", igain); + dump_insn_slim (dump_file, insn); + } + gain += igain; } if (dump_file) fprintf (dump_file, " Instruction conversion gain: %d\n", gain); + /* ??? What about integer to SSE? */ EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi) cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer; @@ -570,10 +608,10 @@ dimode_scalar_chain::compute_convert_gai /* Replace REG in X with a V2DI subreg of NEW_REG. */ rtx -dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg) +general_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg) { if (x == reg) - return gen_rtx_SUBREG (V2DImode, new_reg, 0); + return gen_rtx_SUBREG (vmode, new_reg, 0); const char *fmt = GET_RTX_FORMAT (GET_CODE (x)); int i, j; @@ -593,7 +631,7 @@ dimode_scalar_chain::replace_with_subreg /* Replace REG in INSN with a V2DI subreg of NEW_REG. */ void -dimode_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn, +general_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn, rtx reg, rtx new_reg) { replace_with_subreg (single_set (insn), reg, new_reg); @@ -624,10 +662,10 @@ scalar_chain::emit_conversion_insns (rtx and replace its uses in a chain. */ void -dimode_scalar_chain::make_vector_copies (unsigned regno) +general_scalar_chain::make_vector_copies (unsigned regno) { rtx reg = regno_reg_rtx[regno]; - rtx vreg = gen_reg_rtx (DImode); + rtx vreg = gen_reg_rtx (smode); df_ref ref; for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref)) @@ -636,37 +674,47 @@ dimode_scalar_chain::make_vector_copies start_sequence (); if (!TARGET_INTER_UNIT_MOVES_TO_VEC) { - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP); - emit_move_insn (adjust_address (tmp, SImode, 0), - gen_rtx_SUBREG (SImode, reg, 0)); - emit_move_insn (adjust_address (tmp, SImode, 4), - gen_rtx_SUBREG (SImode, reg, 4)); + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP); + if (smode == DImode && !TARGET_64BIT) + { + emit_move_insn (adjust_address (tmp, SImode, 0), + gen_rtx_SUBREG (SImode, reg, 0)); + emit_move_insn (adjust_address (tmp, SImode, 4), + gen_rtx_SUBREG (SImode, reg, 4)); + } + else + emit_move_insn (tmp, reg); emit_move_insn (vreg, tmp); } - else if (TARGET_SSE4_1) + else if (!TARGET_64BIT && smode == DImode) { - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), - CONST0_RTX (V4SImode), - gen_rtx_SUBREG (SImode, reg, 0))); - emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (SImode, reg, 4), - GEN_INT (2))); + if (TARGET_SSE4_1) + { + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), + CONST0_RTX (V4SImode), + gen_rtx_SUBREG (SImode, reg, 0))); + emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (SImode, reg, 4), + GEN_INT (2))); + } + else + { + rtx tmp = gen_reg_rtx (DImode); + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), + CONST0_RTX (V4SImode), + gen_rtx_SUBREG (SImode, reg, 0))); + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0), + CONST0_RTX (V4SImode), + gen_rtx_SUBREG (SImode, reg, 4))); + emit_insn (gen_vec_interleave_lowv4si + (gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (V4SImode, tmp, 0))); + } } else - { - rtx tmp = gen_reg_rtx (DImode); - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), - CONST0_RTX (V4SImode), - gen_rtx_SUBREG (SImode, reg, 0))); - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0), - CONST0_RTX (V4SImode), - gen_rtx_SUBREG (SImode, reg, 4))); - emit_insn (gen_vec_interleave_lowv4si - (gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (V4SImode, tmp, 0))); - } + emit_move_insn (gen_lowpart (smode, vreg), reg); rtx_insn *seq = get_insns (); end_sequence (); rtx_insn *insn = DF_REF_INSN (ref); @@ -695,7 +743,7 @@ dimode_scalar_chain::make_vector_copies in case register is used in not convertible insn. */ void -dimode_scalar_chain::convert_reg (unsigned regno) +general_scalar_chain::convert_reg (unsigned regno) { bool scalar_copy = bitmap_bit_p (defs_conv, regno); rtx reg = regno_reg_rtx[regno]; @@ -707,7 +755,7 @@ dimode_scalar_chain::convert_reg (unsign bitmap_copy (conv, insns); if (scalar_copy) - scopy = gen_reg_rtx (DImode); + scopy = gen_reg_rtx (smode); for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref)) { @@ -727,40 +775,55 @@ dimode_scalar_chain::convert_reg (unsign start_sequence (); if (!TARGET_INTER_UNIT_MOVES_FROM_VEC) { - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP); + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP); emit_move_insn (tmp, reg); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), - adjust_address (tmp, SImode, 0)); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), - adjust_address (tmp, SImode, 4)); + if (!TARGET_64BIT && smode == DImode) + { + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), + adjust_address (tmp, SImode, 0)); + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), + adjust_address (tmp, SImode, 4)); + } + else + emit_move_insn (scopy, tmp); } - else if (TARGET_SSE4_1) + else if (!TARGET_64BIT && smode == DImode) { - rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx)); - emit_insn - (gen_rtx_SET - (gen_rtx_SUBREG (SImode, scopy, 0), - gen_rtx_VEC_SELECT (SImode, - gen_rtx_SUBREG (V4SImode, reg, 0), tmp))); - - tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx)); - emit_insn - (gen_rtx_SET - (gen_rtx_SUBREG (SImode, scopy, 4), - gen_rtx_VEC_SELECT (SImode, - gen_rtx_SUBREG (V4SImode, reg, 0), tmp))); + if (TARGET_SSE4_1) + { + rtx tmp = gen_rtx_PARALLEL (VOIDmode, + gen_rtvec (1, const0_rtx)); + emit_insn + (gen_rtx_SET + (gen_rtx_SUBREG (SImode, scopy, 0), + gen_rtx_VEC_SELECT (SImode, + gen_rtx_SUBREG (V4SImode, reg, 0), + tmp))); + + tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx)); + emit_insn + (gen_rtx_SET + (gen_rtx_SUBREG (SImode, scopy, 4), + gen_rtx_VEC_SELECT (SImode, + gen_rtx_SUBREG (V4SImode, reg, 0), + tmp))); + } + else + { + rtx vcopy = gen_reg_rtx (V2DImode); + emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0)); + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), + gen_rtx_SUBREG (SImode, vcopy, 0)); + emit_move_insn (vcopy, + gen_rtx_LSHIFTRT (V2DImode, + vcopy, GEN_INT (32))); + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), + gen_rtx_SUBREG (SImode, vcopy, 0)); + } } else - { - rtx vcopy = gen_reg_rtx (V2DImode); - emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0)); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), - gen_rtx_SUBREG (SImode, vcopy, 0)); - emit_move_insn (vcopy, - gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32))); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), - gen_rtx_SUBREG (SImode, vcopy, 0)); - } + emit_move_insn (scopy, reg); + rtx_insn *seq = get_insns (); end_sequence (); emit_conversion_insns (seq, insn); @@ -809,21 +872,21 @@ dimode_scalar_chain::convert_reg (unsign registers conversion. */ void -dimode_scalar_chain::convert_op (rtx *op, rtx_insn *insn) +general_scalar_chain::convert_op (rtx *op, rtx_insn *insn) { *op = copy_rtx_if_shared (*op); if (GET_CODE (*op) == NOT) { convert_op (&XEXP (*op, 0), insn); - PUT_MODE (*op, V2DImode); + PUT_MODE (*op, vmode); } else if (MEM_P (*op)) { - rtx tmp = gen_reg_rtx (DImode); + rtx tmp = gen_reg_rtx (GET_MODE (*op)); emit_insn_before (gen_move_insn (tmp, *op), insn); - *op = gen_rtx_SUBREG (V2DImode, tmp, 0); + *op = gen_rtx_SUBREG (vmode, tmp, 0); if (dump_file) fprintf (dump_file, " Preloading operand for insn %d into r%d\n", @@ -841,24 +904,30 @@ dimode_scalar_chain::convert_op (rtx *op gcc_assert (!DF_REF_CHAIN (ref)); break; } - *op = gen_rtx_SUBREG (V2DImode, *op, 0); + *op = gen_rtx_SUBREG (vmode, *op, 0); } else if (CONST_INT_P (*op)) { rtx vec_cst; - rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0); + rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0); /* Prefer all ones vector in case of -1. */ if (constm1_operand (*op, GET_MODE (*op))) - vec_cst = CONSTM1_RTX (V2DImode); + vec_cst = CONSTM1_RTX (vmode); else - vec_cst = gen_rtx_CONST_VECTOR (V2DImode, - gen_rtvec (2, *op, const0_rtx)); + { + unsigned n = GET_MODE_NUNITS (vmode); + rtx *v = XALLOCAVEC (rtx, n); + v[0] = *op; + for (unsigned i = 1; i < n; ++i) + v[i] = const0_rtx; + vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v)); + } - if (!standard_sse_constant_p (vec_cst, V2DImode)) + if (!standard_sse_constant_p (vec_cst, vmode)) { start_sequence (); - vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst)); + vec_cst = validize_mem (force_const_mem (vmode, vec_cst)); rtx_insn *seq = get_insns (); end_sequence (); emit_insn_before (seq, insn); @@ -870,14 +939,14 @@ dimode_scalar_chain::convert_op (rtx *op else { gcc_assert (SUBREG_P (*op)); - gcc_assert (GET_MODE (*op) == V2DImode); + gcc_assert (GET_MODE (*op) == vmode); } } /* Convert INSN to vector mode. */ void -dimode_scalar_chain::convert_insn (rtx_insn *insn) +general_scalar_chain::convert_insn (rtx_insn *insn) { rtx def_set = single_set (insn); rtx src = SET_SRC (def_set); @@ -888,9 +957,9 @@ dimode_scalar_chain::convert_insn (rtx_i { /* There are no scalar integer instructions and therefore temporary register usage is required. */ - rtx tmp = gen_reg_rtx (DImode); + rtx tmp = gen_reg_rtx (GET_MODE (dst)); emit_conversion_insns (gen_move_insn (dst, tmp), insn); - dst = gen_rtx_SUBREG (V2DImode, tmp, 0); + dst = gen_rtx_SUBREG (vmode, tmp, 0); } switch (GET_CODE (src)) @@ -899,7 +968,7 @@ dimode_scalar_chain::convert_insn (rtx_i case ASHIFTRT: case LSHIFTRT: convert_op (&XEXP (src, 0), insn); - PUT_MODE (src, V2DImode); + PUT_MODE (src, vmode); break; case PLUS: @@ -907,25 +976,29 @@ dimode_scalar_chain::convert_insn (rtx_i case IOR: case XOR: case AND: + case SMAX: + case SMIN: + case UMAX: + case UMIN: convert_op (&XEXP (src, 0), insn); convert_op (&XEXP (src, 1), insn); - PUT_MODE (src, V2DImode); + PUT_MODE (src, vmode); break; case NEG: src = XEXP (src, 0); convert_op (&src, insn); - subreg = gen_reg_rtx (V2DImode); - emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn); - src = gen_rtx_MINUS (V2DImode, subreg, src); + subreg = gen_reg_rtx (vmode); + emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn); + src = gen_rtx_MINUS (vmode, subreg, src); break; case NOT: src = XEXP (src, 0); convert_op (&src, insn); - subreg = gen_reg_rtx (V2DImode); - emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn); - src = gen_rtx_XOR (V2DImode, src, subreg); + subreg = gen_reg_rtx (vmode); + emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn); + src = gen_rtx_XOR (vmode, src, subreg); break; case MEM: @@ -939,17 +1012,17 @@ dimode_scalar_chain::convert_insn (rtx_i break; case SUBREG: - gcc_assert (GET_MODE (src) == V2DImode); + gcc_assert (GET_MODE (src) == vmode); break; case COMPARE: src = SUBREG_REG (XEXP (XEXP (src, 0), 0)); - gcc_assert ((REG_P (src) && GET_MODE (src) == DImode) - || (SUBREG_P (src) && GET_MODE (src) == V2DImode)); + gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode)) + || (SUBREG_P (src) && GET_MODE (src) == vmode)); if (REG_P (src)) - subreg = gen_rtx_SUBREG (V2DImode, src, 0); + subreg = gen_rtx_SUBREG (vmode, src, 0); else subreg = copy_rtx_if_shared (src); emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg), @@ -977,7 +1050,9 @@ dimode_scalar_chain::convert_insn (rtx_i PATTERN (insn) = def_set; INSN_CODE (insn) = -1; - recog_memoized (insn); + int patt = recog_memoized (insn); + if (patt == -1) + fatal_insn_not_found (insn); df_insn_rescan (insn); } @@ -1116,7 +1191,7 @@ timode_scalar_chain::convert_insn (rtx_i } void -dimode_scalar_chain::convert_registers () +general_scalar_chain::convert_registers () { bitmap_iterator bi; unsigned id; @@ -1186,7 +1261,7 @@ has_non_address_hard_reg (rtx_insn *insn (const_int 0 [0]))) */ static bool -convertible_comparison_p (rtx_insn *insn) +convertible_comparison_p (rtx_insn *insn, enum machine_mode mode) { if (!TARGET_SSE4_1) return false; @@ -1219,12 +1294,12 @@ convertible_comparison_p (rtx_insn *insn if (!SUBREG_P (op1) || !SUBREG_P (op2) - || GET_MODE (op1) != SImode - || GET_MODE (op2) != SImode + || GET_MODE (op1) != mode + || GET_MODE (op2) != mode || ((SUBREG_BYTE (op1) != 0 - || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode)) + || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode)) && (SUBREG_BYTE (op2) != 0 - || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode)))) + || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode)))) return false; op1 = SUBREG_REG (op1); @@ -1232,7 +1307,7 @@ convertible_comparison_p (rtx_insn *insn if (op1 != op2 || !REG_P (op1) - || GET_MODE (op1) != DImode) + || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ()) return false; return true; @@ -1241,7 +1316,7 @@ convertible_comparison_p (rtx_insn *insn /* The DImode version of scalar_to_vector_candidate_p. */ static bool -dimode_scalar_to_vector_candidate_p (rtx_insn *insn) +general_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode) { rtx def_set = single_set (insn); @@ -1255,12 +1330,12 @@ dimode_scalar_to_vector_candidate_p (rtx rtx dst = SET_DEST (def_set); if (GET_CODE (src) == COMPARE) - return convertible_comparison_p (insn); + return convertible_comparison_p (insn, mode); /* We are interested in DImode promotion only. */ - if ((GET_MODE (src) != DImode + if ((GET_MODE (src) != mode && !CONST_INT_P (src)) - || GET_MODE (dst) != DImode) + || GET_MODE (dst) != mode) return false; if (!REG_P (dst) && !MEM_P (dst)) @@ -1280,6 +1355,15 @@ dimode_scalar_to_vector_candidate_p (rtx return false; break; + case SMAX: + case SMIN: + case UMAX: + case UMIN: + if ((mode == DImode && !TARGET_AVX512VL) + || (mode == SImode && !TARGET_SSE4_1)) + return false; + /* Fallthru. */ + case PLUS: case MINUS: case IOR: @@ -1290,7 +1374,7 @@ dimode_scalar_to_vector_candidate_p (rtx && !CONST_INT_P (XEXP (src, 1))) return false; - if (GET_MODE (XEXP (src, 1)) != DImode + if (GET_MODE (XEXP (src, 1)) != mode && !CONST_INT_P (XEXP (src, 1))) return false; break; @@ -1319,7 +1403,7 @@ dimode_scalar_to_vector_candidate_p (rtx || !REG_P (XEXP (XEXP (src, 0), 0)))) return false; - if (GET_MODE (XEXP (src, 0)) != DImode + if (GET_MODE (XEXP (src, 0)) != mode && !CONST_INT_P (XEXP (src, 0))) return false; @@ -1383,22 +1467,16 @@ timode_scalar_to_vector_candidate_p (rtx return false; } -/* Return 1 if INSN may be converted into vector - instruction. */ - -static bool -scalar_to_vector_candidate_p (rtx_insn *insn) -{ - if (TARGET_64BIT) - return timode_scalar_to_vector_candidate_p (insn); - else - return dimode_scalar_to_vector_candidate_p (insn); -} +/* For a given bitmap of insn UIDs scans all instruction and + remove insn from CANDIDATES in case it has both convertible + and not convertible definitions. -/* The DImode version of remove_non_convertible_regs. */ + All insns in a bitmap are conversion candidates according to + scalar_to_vector_candidate_p. Currently it implies all insns + are single_set. */ static void -dimode_remove_non_convertible_regs (bitmap candidates) +general_remove_non_convertible_regs (bitmap candidates) { bitmap_iterator bi; unsigned id; @@ -1553,23 +1631,6 @@ timode_remove_non_convertible_regs (bitm BITMAP_FREE (regs); } -/* For a given bitmap of insn UIDs scans all instruction and - remove insn from CANDIDATES in case it has both convertible - and not convertible definitions. - - All insns in a bitmap are conversion candidates according to - scalar_to_vector_candidate_p. Currently it implies all insns - are single_set. */ - -static void -remove_non_convertible_regs (bitmap candidates) -{ - if (TARGET_64BIT) - timode_remove_non_convertible_regs (candidates); - else - dimode_remove_non_convertible_regs (candidates); -} - /* Main STV pass function. Find and convert scalar instructions into vector mode when profitable. */ @@ -1577,11 +1638,14 @@ static unsigned int convert_scalars_to_vector () { basic_block bb; - bitmap candidates; int converted_insns = 0; bitmap_obstack_initialize (NULL); - candidates = BITMAP_ALLOC (NULL); + const machine_mode cand_mode[3] = { SImode, DImode, TImode }; + const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode }; + bitmap_head candidates[3]; /* { SImode, DImode, TImode } */ + for (unsigned i = 0; i < 3; ++i) + bitmap_initialize (&candidates[i], &bitmap_default_obstack); calculate_dominance_info (CDI_DOMINATORS); df_set_flags (DF_DEFER_INSN_RESCAN); @@ -1597,51 +1661,73 @@ convert_scalars_to_vector () { rtx_insn *insn; FOR_BB_INSNS (bb, insn) - if (scalar_to_vector_candidate_p (insn)) + if (TARGET_64BIT + && timode_scalar_to_vector_candidate_p (insn)) { if (dump_file) - fprintf (dump_file, " insn %d is marked as a candidate\n", + fprintf (dump_file, " insn %d is marked as a TImode candidate\n", INSN_UID (insn)); - bitmap_set_bit (candidates, INSN_UID (insn)); + bitmap_set_bit (&candidates[2], INSN_UID (insn)); + } + else + { + /* Check {SI,DI}mode. */ + for (unsigned i = 0; i <= 1; ++i) + if (general_scalar_to_vector_candidate_p (insn, cand_mode[i])) + { + if (dump_file) + fprintf (dump_file, " insn %d is marked as a %s candidate\n", + INSN_UID (insn), i == 0 ? "SImode" : "DImode"); + + bitmap_set_bit (&candidates[i], INSN_UID (insn)); + break; + } } } - remove_non_convertible_regs (candidates); + if (TARGET_64BIT) + timode_remove_non_convertible_regs (&candidates[2]); + for (unsigned i = 0; i <= 1; ++i) + general_remove_non_convertible_regs (&candidates[i]); - if (bitmap_empty_p (candidates)) - if (dump_file) + for (unsigned i = 0; i <= 2; ++i) + if (!bitmap_empty_p (&candidates[i])) + break; + else if (i == 2 && dump_file) fprintf (dump_file, "There are no candidates for optimization.\n"); - while (!bitmap_empty_p (candidates)) - { - unsigned uid = bitmap_first_set_bit (candidates); - scalar_chain *chain; + for (unsigned i = 0; i <= 2; ++i) + while (!bitmap_empty_p (&candidates[i])) + { + unsigned uid = bitmap_first_set_bit (&candidates[i]); + scalar_chain *chain; - if (TARGET_64BIT) - chain = new timode_scalar_chain; - else - chain = new dimode_scalar_chain; + if (cand_mode[i] == TImode) + chain = new timode_scalar_chain; + else + chain = new general_scalar_chain (cand_mode[i], cand_vmode[i]); - /* Find instructions chain we want to convert to vector mode. - Check all uses and definitions to estimate all required - conversions. */ - chain->build (candidates, uid); + /* Find instructions chain we want to convert to vector mode. + Check all uses and definitions to estimate all required + conversions. */ + chain->build (&candidates[i], uid); - if (chain->compute_convert_gain () > 0) - converted_insns += chain->convert (); - else - if (dump_file) - fprintf (dump_file, "Chain #%d conversion is not profitable\n", - chain->chain_id); + if (chain->compute_convert_gain () > 0) + converted_insns += chain->convert (); + else + if (dump_file) + fprintf (dump_file, "Chain #%d conversion is not profitable\n", + chain->chain_id); - delete chain; - } + delete chain; + } if (dump_file) fprintf (dump_file, "Total insns converted: %d\n", converted_insns); - BITMAP_FREE (candidates); + for (unsigned i = 0; i <= 2; ++i) + bitmap_release (&candidates[i]); bitmap_obstack_release (NULL); df_process_deferred_rescans (); Index: gcc/config/i386/i386-features.h =================================================================== --- gcc/config/i386/i386-features.h (revision 274111) +++ gcc/config/i386/i386-features.h (working copy) @@ -127,11 +127,16 @@ namespace { class scalar_chain { public: - scalar_chain (); + scalar_chain (enum machine_mode, enum machine_mode); virtual ~scalar_chain (); static unsigned max_id; + /* Scalar mode. */ + enum machine_mode smode; + /* Vector mode. */ + enum machine_mode vmode; + /* ID of a chain. */ unsigned int chain_id; /* A queue of instructions to be included into a chain. */ @@ -159,9 +164,11 @@ class scalar_chain virtual void convert_registers () = 0; }; -class dimode_scalar_chain : public scalar_chain +class general_scalar_chain : public scalar_chain { public: + general_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) + : scalar_chain (smode_, vmode_) {} int compute_convert_gain (); private: void mark_dual_mode_def (df_ref def); @@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala class timode_scalar_chain : public scalar_chain { public: + timode_scalar_chain () : scalar_chain (TImode, V1TImode) {} + /* Convert from TImode to V1TImode is always faster. */ int compute_convert_gain () { return 1; } Index: gcc/config/i386/i386.md =================================================================== --- gcc/config/i386/i386.md (revision 274111) +++ gcc/config/i386/i386.md (working copy) @@ -17721,6 +17721,30 @@ (define_peephole2 std::swap (operands[4], operands[5]); }) +;; min/max patterns + +(define_code_attr maxmin_rel + [(smax "ge") (smin "le") (umax "geu") (umin "leu")]) +(define_code_attr maxmin_cmpmode + [(smax "CCGC") (smin "CCGC") (umax "CC") (umin "CC")]) + +(define_insn_and_split "<code><mode>3" + [(set (match_operand:SWI48 0 "register_operand") + (maxmin:SWI48 (match_operand:SWI48 1 "register_operand") + (match_operand:SWI48 2 "register_operand"))) + (clobber (reg:CC FLAGS_REG))] + "TARGET_STV && TARGET_SSE4_1 + && can_create_pseudo_p ()" + "#" + "&& 1" + [(set (reg:<maxmin_cmpmode> FLAGS_REG) + (compare:<maxmin_cmpmode> (match_dup 1)(match_dup 2))) + (set (match_dup 0) + (if_then_else:SWI48 + (<maxmin_rel> (reg:<maxmin_cmpmode> FLAGS_REG)(const_int 0)) + (match_dup 1) + (match_dup 2)))]) + ;; Conditional addition patterns (define_expand "add<mode>cc" [(match_operand:SWI 0 "register_operand")