On Sun, 4 Aug 2019, Uros Bizjak wrote: > On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <rguent...@suse.de> wrote: > > > > On Thu, 1 Aug 2019, Uros Bizjak wrote: > > > > > On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguent...@suse.de> wrote: > > > > > >>>> So you unconditionally add a smaxdi3 pattern - indeed this looks > > >>>> necessary even when going the STV route. The actual regression > > >>>> for the testcase could also be solved by turing the smaxsi3 > > >>>> back into a compare and jump rather than a conditional move sequence. > > >>>> So I wonder how you'd do that given that there's pass_if_after_reload > > >>>> after pass_split_after_reload and I'm not sure we can split > > >>>> as late as pass_split_before_sched2 (there's also a split _after_ > > >>>> sched2 on x86 it seems). > > >>>> > > >>>> So how would you go implement {s,u}{min,max}{si,di}3 for the > > >>>> case STV doesn't end up doing any transform? > > >>> > > >>> If STV doesn't transform the insn, then a pre-reload splitter splits > > >>> the insn back to compare+cmove. > > >> > > >> OK, that would work. But there's no way to force a jumpy sequence then > > >> which we know is faster than compare+cmove because later RTL > > >> if-conversion passes happily re-discover the smax (or conditional move) > > >> sequence. > > >> > > >>> However, considering the SImode move > > >>> from/to int/xmm register is relatively cheap, the cost function should > > >>> be tuned so that STV always converts smaxsi3 pattern. > > >> > > >> Note that on both Zen and even more so bdverN the int/xmm transition > > >> makes it no longer profitable but a _lot_ slower than the cmp/cmov > > >> sequence... (for the loop in hmmer which is the only one I see > > >> any effect of any of my patches). So identifying chains that > > >> start/end in memory is important for cost reasons. > > > > > > Please note that the cost function also considers the cost of move > > > from/to xmm. So, the cost of the whole chain would disable the > > > transformation. > > > > > >> So I think the splitting has to happen after the last if-conversion > > >> pass (and thus we may need to allocate a scratch register for this > > >> purpose?) > > > > > > I really hope that the underlying issue will be solved by a machine > > > dependant pass inserted somewhere after the pre-reload split. This > > > way, we can split unconverted smax to the cmove, and this later pass > > > would handle jcc and cmove instructions. Until then... yes your > > > proposed approach is one of the ways to avoid unwanted if-conversion, > > > although sometimes we would like to split to cmove instead. > > > > So the following makes STV also consider SImode chains, re-using the > > DImode chain code. I've kept a simple incomplete smaxsi3 pattern > > and also did not alter the {SI,DI}mode chain cost function - it's > > quite off for TARGET_64BIT. With this I get the expected conversion > > for the testcase derived from hmmer. > > > > No further testing sofar. > > > > Is it OK to re-use the DImode chain code this way? I'll clean things > > up some more of course. > > Yes, the approach looks OK to me. It makes chain building mode > agnostic, and the chain building can be used for > a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added. > b) SImode x86_32 and x86_64 (this will be mainly used for SImode > minmax and surrounding SImode operations) > c) DImode x86_64 (also, mainly used for DImode minmax and surrounding > DImode operations) > > > Still need help with the actual patterns for minmax and how the splitters > > should look like. > > Please look at the attached patch. Maybe we can add memory_operand as > operand 1 and operand 2 predicate, but let's keep things simple for > now.
Thanks. The attached patch makes the patch cleaner and it survives "some" barebone testing. It also touches the cost function to avoid being too overly trigger-happy. I've also ended up using ix86_cost->sse_op instead of COSTS_N_INSN-based magic. In particular we estimated GPR reg-reg move as COST_N_INSNS(2) while move costs shouldn't be wrapped in COST_N_INSNS. IMHO we should probably disregard any reg-reg moves for costing pre-RA. At least with the current code every reg-reg move biases in favor of SSE... And we're simply adding move and non-move costs in 'gain', somewhat mixing apples and oranges? We could separate those and require both to be a net positive win? Still using -mtune=bdverN exposes that some cost tables have xmm and gpr costs as apples and oranges... (so it never triggers for Bulldozer) I now run into /space/rguenther/src/svn/trunk-bisect/libgcc/libgcov-driver.c:509:1: error: unrecognizable insn: (insn 116 115 1511 8 (set (subreg:V2DI (reg/v:DI 87 [ run_max ]) 0) (smax:V2DI (subreg:V2DI (reg/v:DI 87 [ run_max ]) 0) (subreg:V2DI (reg:DI 349 [ MEM[base: _261, offset: 0B] ]) 0))) -1 (expr_list:REG_DEAD (reg:DI 349 [ MEM[base: _261, offset: 0B] ]) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil)))) during RTL pass: stv where even with -mavx2 we do not have s{min,max}v2di3. We do have an expander here but it seems only AVX512F has the DImode min/max ops. I have adjusted dimode_scalar_to_vector_candidate_p accordingly. I'm considering to rename the dimode_{scalar_to_vector_candidate_p,remove_non_convertible_regs} functions to drop the dimode_ prefix - is that OK or do you prefer some other prefix? So - bootstrap with --with-arch=skylake in progress. It detects quite a few chains (unsurprisingly) so I guess we need to address compile-time issues in the pass before enabling this enhancement (maybe as followup?). Further comments on the actual patch welcome, I consider it "finished" if testing reveals no issues. ChangeLog still needs to be written and testcases to be added. Thanks, Richard. Index: gcc/config/i386/i386-features.c =================================================================== --- gcc/config/i386/i386-features.c (revision 274111) +++ gcc/config/i386/i386-features.c (working copy) @@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0; /* Initialize new chain. */ -scalar_chain::scalar_chain () +scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) { + smode = smode_; + vmode = vmode_; + chain_id = ++max_id; if (dump_file) @@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate && !HARD_REGISTER_P (SET_DEST (def_set))) bitmap_set_bit (defs, REGNO (SET_DEST (def_set))); + /* ??? The following is quadratic since analyze_register_chain + iterates over all refs to look for dual-mode regs. Instead this + should be done separately for all regs mentioned in the chain once. */ df_ref ref; df_ref def; for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref)) @@ -473,9 +479,11 @@ dimode_scalar_chain::vector_const_cost ( { gcc_assert (CONST_INT_P (exp)); - if (standard_sse_constant_p (exp, V2DImode)) - return COSTS_N_INSNS (1); - return ix86_cost->sse_load[1]; + if (standard_sse_constant_p (exp, vmode)) + return ix86_cost->sse_op; + /* We have separate costs for SImode and DImode, use SImode costs + for smaller modes. */ + return ix86_cost->sse_load[smode == DImode ? 1 : 0]; } /* Compute a gain for chain conversion. */ @@ -491,28 +499,37 @@ dimode_scalar_chain::compute_convert_gai if (dump_file) fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id); + /* SSE costs distinguish between SImode and DImode loads/stores, for + int costs factor in the number of GPRs involved. When supporting + smaller modes than SImode the int load/store costs need to be + adjusted as well. */ + unsigned sse_cost_idx = smode == DImode ? 1 : 0; + unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1; + EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi) { rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn; rtx def_set = single_set (insn); rtx src = SET_SRC (def_set); rtx dst = SET_DEST (def_set); + int igain = 0; if (REG_P (src) && REG_P (dst)) - gain += COSTS_N_INSNS (2) - ix86_cost->xmm_move; + igain += 2 * m - ix86_cost->xmm_move; else if (REG_P (src) && MEM_P (dst)) - gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; + igain + += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx]; else if (MEM_P (src) && REG_P (dst)) - gain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1]; + igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx]; else if (GET_CODE (src) == ASHIFT || GET_CODE (src) == ASHIFTRT || GET_CODE (src) == LSHIFTRT) { if (CONST_INT_P (XEXP (src, 0))) - gain -= vector_const_cost (XEXP (src, 0)); - gain += ix86_cost->shift_const; + igain -= vector_const_cost (XEXP (src, 0)); + igain += m * ix86_cost->shift_const - ix86_cost->sse_op; if (INTVAL (XEXP (src, 1)) >= 32) - gain -= COSTS_N_INSNS (1); + igain -= COSTS_N_INSNS (1); } else if (GET_CODE (src) == PLUS || GET_CODE (src) == MINUS @@ -520,20 +537,31 @@ dimode_scalar_chain::compute_convert_gai || GET_CODE (src) == XOR || GET_CODE (src) == AND) { - gain += ix86_cost->add; + igain += m * ix86_cost->add - ix86_cost->sse_op; /* Additional gain for andnot for targets without BMI. */ if (GET_CODE (XEXP (src, 0)) == NOT && !TARGET_BMI) - gain += 2 * ix86_cost->add; + igain += m * ix86_cost->add; if (CONST_INT_P (XEXP (src, 0))) - gain -= vector_const_cost (XEXP (src, 0)); + igain -= vector_const_cost (XEXP (src, 0)); if (CONST_INT_P (XEXP (src, 1))) - gain -= vector_const_cost (XEXP (src, 1)); + igain -= vector_const_cost (XEXP (src, 1)); } else if (GET_CODE (src) == NEG || GET_CODE (src) == NOT) - gain += ix86_cost->add - COSTS_N_INSNS (1); + igain += m * ix86_cost->add - ix86_cost->sse_op; + else if (GET_CODE (src) == SMAX + || GET_CODE (src) == SMIN + || GET_CODE (src) == UMAX + || GET_CODE (src) == UMIN) + { + /* We do not have any conditional move cost, estimate it as a + reg-reg move. Comparisons are costed as adds. */ + igain += m * (COSTS_N_INSNS (2) + ix86_cost->add); + /* Integer SSE ops are all costed the same. */ + igain -= ix86_cost->sse_op; + } else if (GET_CODE (src) == COMPARE) { /* Assume comparison cost is the same. */ @@ -541,18 +569,28 @@ dimode_scalar_chain::compute_convert_gai else if (CONST_INT_P (src)) { if (REG_P (dst)) - gain += COSTS_N_INSNS (2); + /* DImode can be immediate for TARGET_64BIT and SImode always. */ + igain += COSTS_N_INSNS (m); else if (MEM_P (dst)) - gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; - gain -= vector_const_cost (src); + igain += (m * ix86_cost->int_store[2] + - ix86_cost->sse_store[sse_cost_idx]); + igain -= vector_const_cost (src); } else gcc_unreachable (); + + if (igain != 0 && dump_file) + { + fprintf (dump_file, " Instruction gain %d for ", igain); + dump_insn_slim (dump_file, insn); + } + gain += igain; } if (dump_file) fprintf (dump_file, " Instruction conversion gain: %d\n", gain); + /* ??? What about integer to SSE? */ EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi) cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer; @@ -573,7 +611,7 @@ rtx dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg) { if (x == reg) - return gen_rtx_SUBREG (V2DImode, new_reg, 0); + return gen_rtx_SUBREG (vmode, new_reg, 0); const char *fmt = GET_RTX_FORMAT (GET_CODE (x)); int i, j; @@ -636,37 +674,47 @@ dimode_scalar_chain::make_vector_copies start_sequence (); if (!TARGET_INTER_UNIT_MOVES_TO_VEC) { - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP); - emit_move_insn (adjust_address (tmp, SImode, 0), - gen_rtx_SUBREG (SImode, reg, 0)); - emit_move_insn (adjust_address (tmp, SImode, 4), - gen_rtx_SUBREG (SImode, reg, 4)); + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP); + if (smode == DImode && !TARGET_64BIT) + { + emit_move_insn (adjust_address (tmp, SImode, 0), + gen_rtx_SUBREG (SImode, reg, 0)); + emit_move_insn (adjust_address (tmp, SImode, 4), + gen_rtx_SUBREG (SImode, reg, 4)); + } + else + emit_move_insn (tmp, reg); emit_move_insn (vreg, tmp); } - else if (TARGET_SSE4_1) + else if (!TARGET_64BIT && smode == DImode) { - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), - CONST0_RTX (V4SImode), - gen_rtx_SUBREG (SImode, reg, 0))); - emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (SImode, reg, 4), - GEN_INT (2))); + if (TARGET_SSE4_1) + { + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), + CONST0_RTX (V4SImode), + gen_rtx_SUBREG (SImode, reg, 0))); + emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (SImode, reg, 4), + GEN_INT (2))); + } + else + { + rtx tmp = gen_reg_rtx (DImode); + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), + CONST0_RTX (V4SImode), + gen_rtx_SUBREG (SImode, reg, 0))); + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0), + CONST0_RTX (V4SImode), + gen_rtx_SUBREG (SImode, reg, 4))); + emit_insn (gen_vec_interleave_lowv4si + (gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (V4SImode, tmp, 0))); + } } else - { - rtx tmp = gen_reg_rtx (DImode); - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), - CONST0_RTX (V4SImode), - gen_rtx_SUBREG (SImode, reg, 0))); - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0), - CONST0_RTX (V4SImode), - gen_rtx_SUBREG (SImode, reg, 4))); - emit_insn (gen_vec_interleave_lowv4si - (gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (V4SImode, tmp, 0))); - } + emit_move_insn (gen_lowpart (smode, vreg), reg); rtx_insn *seq = get_insns (); end_sequence (); rtx_insn *insn = DF_REF_INSN (ref); @@ -707,7 +755,7 @@ dimode_scalar_chain::convert_reg (unsign bitmap_copy (conv, insns); if (scalar_copy) - scopy = gen_reg_rtx (DImode); + scopy = gen_reg_rtx (smode); for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref)) { @@ -727,40 +775,55 @@ dimode_scalar_chain::convert_reg (unsign start_sequence (); if (!TARGET_INTER_UNIT_MOVES_FROM_VEC) { - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP); + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP); emit_move_insn (tmp, reg); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), - adjust_address (tmp, SImode, 0)); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), - adjust_address (tmp, SImode, 4)); + if (!TARGET_64BIT && smode == DImode) + { + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), + adjust_address (tmp, SImode, 0)); + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), + adjust_address (tmp, SImode, 4)); + } + else + emit_move_insn (scopy, tmp); } - else if (TARGET_SSE4_1) + else if (!TARGET_64BIT && smode == DImode) { - rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx)); - emit_insn - (gen_rtx_SET - (gen_rtx_SUBREG (SImode, scopy, 0), - gen_rtx_VEC_SELECT (SImode, - gen_rtx_SUBREG (V4SImode, reg, 0), tmp))); - - tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx)); - emit_insn - (gen_rtx_SET - (gen_rtx_SUBREG (SImode, scopy, 4), - gen_rtx_VEC_SELECT (SImode, - gen_rtx_SUBREG (V4SImode, reg, 0), tmp))); + if (TARGET_SSE4_1) + { + rtx tmp = gen_rtx_PARALLEL (VOIDmode, + gen_rtvec (1, const0_rtx)); + emit_insn + (gen_rtx_SET + (gen_rtx_SUBREG (SImode, scopy, 0), + gen_rtx_VEC_SELECT (SImode, + gen_rtx_SUBREG (V4SImode, reg, 0), + tmp))); + + tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx)); + emit_insn + (gen_rtx_SET + (gen_rtx_SUBREG (SImode, scopy, 4), + gen_rtx_VEC_SELECT (SImode, + gen_rtx_SUBREG (V4SImode, reg, 0), + tmp))); + } + else + { + rtx vcopy = gen_reg_rtx (V2DImode); + emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0)); + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), + gen_rtx_SUBREG (SImode, vcopy, 0)); + emit_move_insn (vcopy, + gen_rtx_LSHIFTRT (V2DImode, + vcopy, GEN_INT (32))); + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), + gen_rtx_SUBREG (SImode, vcopy, 0)); + } } else - { - rtx vcopy = gen_reg_rtx (V2DImode); - emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0)); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), - gen_rtx_SUBREG (SImode, vcopy, 0)); - emit_move_insn (vcopy, - gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32))); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), - gen_rtx_SUBREG (SImode, vcopy, 0)); - } + emit_move_insn (scopy, reg); + rtx_insn *seq = get_insns (); end_sequence (); emit_conversion_insns (seq, insn); @@ -816,14 +879,14 @@ dimode_scalar_chain::convert_op (rtx *op if (GET_CODE (*op) == NOT) { convert_op (&XEXP (*op, 0), insn); - PUT_MODE (*op, V2DImode); + PUT_MODE (*op, vmode); } else if (MEM_P (*op)) { - rtx tmp = gen_reg_rtx (DImode); + rtx tmp = gen_reg_rtx (GET_MODE (*op)); emit_insn_before (gen_move_insn (tmp, *op), insn); - *op = gen_rtx_SUBREG (V2DImode, tmp, 0); + *op = gen_rtx_SUBREG (vmode, tmp, 0); if (dump_file) fprintf (dump_file, " Preloading operand for insn %d into r%d\n", @@ -841,24 +904,30 @@ dimode_scalar_chain::convert_op (rtx *op gcc_assert (!DF_REF_CHAIN (ref)); break; } - *op = gen_rtx_SUBREG (V2DImode, *op, 0); + *op = gen_rtx_SUBREG (vmode, *op, 0); } else if (CONST_INT_P (*op)) { rtx vec_cst; - rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0); + rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0); /* Prefer all ones vector in case of -1. */ if (constm1_operand (*op, GET_MODE (*op))) - vec_cst = CONSTM1_RTX (V2DImode); + vec_cst = CONSTM1_RTX (vmode); else - vec_cst = gen_rtx_CONST_VECTOR (V2DImode, - gen_rtvec (2, *op, const0_rtx)); + { + unsigned n = GET_MODE_NUNITS (vmode); + rtx *v = XALLOCAVEC (rtx, n); + v[0] = *op; + for (unsigned i = 1; i < n; ++i) + v[i] = const0_rtx; + vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v)); + } - if (!standard_sse_constant_p (vec_cst, V2DImode)) + if (!standard_sse_constant_p (vec_cst, vmode)) { start_sequence (); - vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst)); + vec_cst = validize_mem (force_const_mem (vmode, vec_cst)); rtx_insn *seq = get_insns (); end_sequence (); emit_insn_before (seq, insn); @@ -870,7 +939,7 @@ dimode_scalar_chain::convert_op (rtx *op else { gcc_assert (SUBREG_P (*op)); - gcc_assert (GET_MODE (*op) == V2DImode); + gcc_assert (GET_MODE (*op) == vmode); } } @@ -888,9 +957,9 @@ dimode_scalar_chain::convert_insn (rtx_i { /* There are no scalar integer instructions and therefore temporary register usage is required. */ - rtx tmp = gen_reg_rtx (DImode); + rtx tmp = gen_reg_rtx (GET_MODE (dst)); emit_conversion_insns (gen_move_insn (dst, tmp), insn); - dst = gen_rtx_SUBREG (V2DImode, tmp, 0); + dst = gen_rtx_SUBREG (vmode, tmp, 0); } switch (GET_CODE (src)) @@ -899,7 +968,7 @@ dimode_scalar_chain::convert_insn (rtx_i case ASHIFTRT: case LSHIFTRT: convert_op (&XEXP (src, 0), insn); - PUT_MODE (src, V2DImode); + PUT_MODE (src, vmode); break; case PLUS: @@ -907,25 +976,29 @@ dimode_scalar_chain::convert_insn (rtx_i case IOR: case XOR: case AND: + case SMAX: + case SMIN: + case UMAX: + case UMIN: convert_op (&XEXP (src, 0), insn); convert_op (&XEXP (src, 1), insn); - PUT_MODE (src, V2DImode); + PUT_MODE (src, vmode); break; case NEG: src = XEXP (src, 0); convert_op (&src, insn); - subreg = gen_reg_rtx (V2DImode); - emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn); - src = gen_rtx_MINUS (V2DImode, subreg, src); + subreg = gen_reg_rtx (vmode); + emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn); + src = gen_rtx_MINUS (vmode, subreg, src); break; case NOT: src = XEXP (src, 0); convert_op (&src, insn); - subreg = gen_reg_rtx (V2DImode); - emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn); - src = gen_rtx_XOR (V2DImode, src, subreg); + subreg = gen_reg_rtx (vmode); + emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn); + src = gen_rtx_XOR (vmode, src, subreg); break; case MEM: @@ -939,17 +1012,17 @@ dimode_scalar_chain::convert_insn (rtx_i break; case SUBREG: - gcc_assert (GET_MODE (src) == V2DImode); + gcc_assert (GET_MODE (src) == vmode); break; case COMPARE: src = SUBREG_REG (XEXP (XEXP (src, 0), 0)); - gcc_assert ((REG_P (src) && GET_MODE (src) == DImode) - || (SUBREG_P (src) && GET_MODE (src) == V2DImode)); + gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode)) + || (SUBREG_P (src) && GET_MODE (src) == vmode)); if (REG_P (src)) - subreg = gen_rtx_SUBREG (V2DImode, src, 0); + subreg = gen_rtx_SUBREG (vmode, src, 0); else subreg = copy_rtx_if_shared (src); emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg), @@ -977,7 +1050,9 @@ dimode_scalar_chain::convert_insn (rtx_i PATTERN (insn) = def_set; INSN_CODE (insn) = -1; - recog_memoized (insn); + int patt = recog_memoized (insn); + if (patt == -1) + fatal_insn_not_found (insn); df_insn_rescan (insn); } @@ -1186,7 +1261,7 @@ has_non_address_hard_reg (rtx_insn *insn (const_int 0 [0]))) */ static bool -convertible_comparison_p (rtx_insn *insn) +convertible_comparison_p (rtx_insn *insn, enum machine_mode mode) { if (!TARGET_SSE4_1) return false; @@ -1219,12 +1294,12 @@ convertible_comparison_p (rtx_insn *insn if (!SUBREG_P (op1) || !SUBREG_P (op2) - || GET_MODE (op1) != SImode - || GET_MODE (op2) != SImode + || GET_MODE (op1) != mode + || GET_MODE (op2) != mode || ((SUBREG_BYTE (op1) != 0 - || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode)) + || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode)) && (SUBREG_BYTE (op2) != 0 - || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode)))) + || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode)))) return false; op1 = SUBREG_REG (op1); @@ -1232,7 +1307,7 @@ convertible_comparison_p (rtx_insn *insn if (op1 != op2 || !REG_P (op1) - || GET_MODE (op1) != DImode) + || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ()) return false; return true; @@ -1241,7 +1316,7 @@ convertible_comparison_p (rtx_insn *insn /* The DImode version of scalar_to_vector_candidate_p. */ static bool -dimode_scalar_to_vector_candidate_p (rtx_insn *insn) +dimode_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode) { rtx def_set = single_set (insn); @@ -1255,12 +1330,12 @@ dimode_scalar_to_vector_candidate_p (rtx rtx dst = SET_DEST (def_set); if (GET_CODE (src) == COMPARE) - return convertible_comparison_p (insn); + return convertible_comparison_p (insn, mode); /* We are interested in DImode promotion only. */ - if ((GET_MODE (src) != DImode + if ((GET_MODE (src) != mode && !CONST_INT_P (src)) - || GET_MODE (dst) != DImode) + || GET_MODE (dst) != mode) return false; if (!REG_P (dst) && !MEM_P (dst)) @@ -1280,6 +1355,15 @@ dimode_scalar_to_vector_candidate_p (rtx return false; break; + case SMAX: + case SMIN: + case UMAX: + case UMIN: + if ((mode == DImode && !TARGET_AVX512F) + || (mode == SImode && !TARGET_SSE4_1)) + return false; + /* Fallthru. */ + case PLUS: case MINUS: case IOR: @@ -1290,7 +1374,7 @@ dimode_scalar_to_vector_candidate_p (rtx && !CONST_INT_P (XEXP (src, 1))) return false; - if (GET_MODE (XEXP (src, 1)) != DImode + if (GET_MODE (XEXP (src, 1)) != mode && !CONST_INT_P (XEXP (src, 1))) return false; break; @@ -1319,7 +1403,7 @@ dimode_scalar_to_vector_candidate_p (rtx || !REG_P (XEXP (XEXP (src, 0), 0)))) return false; - if (GET_MODE (XEXP (src, 0)) != DImode + if (GET_MODE (XEXP (src, 0)) != mode && !CONST_INT_P (XEXP (src, 0))) return false; @@ -1383,19 +1467,13 @@ timode_scalar_to_vector_candidate_p (rtx return false; } -/* Return 1 if INSN may be converted into vector - instruction. */ - -static bool -scalar_to_vector_candidate_p (rtx_insn *insn) -{ - if (TARGET_64BIT) - return timode_scalar_to_vector_candidate_p (insn); - else - return dimode_scalar_to_vector_candidate_p (insn); -} +/* For a given bitmap of insn UIDs scans all instruction and + remove insn from CANDIDATES in case it has both convertible + and not convertible definitions. -/* The DImode version of remove_non_convertible_regs. */ + All insns in a bitmap are conversion candidates according to + scalar_to_vector_candidate_p. Currently it implies all insns + are single_set. */ static void dimode_remove_non_convertible_regs (bitmap candidates) @@ -1553,23 +1631,6 @@ timode_remove_non_convertible_regs (bitm BITMAP_FREE (regs); } -/* For a given bitmap of insn UIDs scans all instruction and - remove insn from CANDIDATES in case it has both convertible - and not convertible definitions. - - All insns in a bitmap are conversion candidates according to - scalar_to_vector_candidate_p. Currently it implies all insns - are single_set. */ - -static void -remove_non_convertible_regs (bitmap candidates) -{ - if (TARGET_64BIT) - timode_remove_non_convertible_regs (candidates); - else - dimode_remove_non_convertible_regs (candidates); -} - /* Main STV pass function. Find and convert scalar instructions into vector mode when profitable. */ @@ -1577,11 +1638,14 @@ static unsigned int convert_scalars_to_vector () { basic_block bb; - bitmap candidates; int converted_insns = 0; bitmap_obstack_initialize (NULL); - candidates = BITMAP_ALLOC (NULL); + const machine_mode cand_mode[3] = { SImode, DImode, TImode }; + const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode }; + bitmap_head candidates[3]; /* { SImode, DImode, TImode } */ + for (unsigned i = 0; i < 3; ++i) + bitmap_initialize (&candidates[i], &bitmap_default_obstack); calculate_dominance_info (CDI_DOMINATORS); df_set_flags (DF_DEFER_INSN_RESCAN); @@ -1597,51 +1661,73 @@ convert_scalars_to_vector () { rtx_insn *insn; FOR_BB_INSNS (bb, insn) - if (scalar_to_vector_candidate_p (insn)) + if (TARGET_64BIT + && timode_scalar_to_vector_candidate_p (insn)) { if (dump_file) - fprintf (dump_file, " insn %d is marked as a candidate\n", + fprintf (dump_file, " insn %d is marked as a TImode candidate\n", INSN_UID (insn)); - bitmap_set_bit (candidates, INSN_UID (insn)); + bitmap_set_bit (&candidates[2], INSN_UID (insn)); + } + else + { + /* Check {SI,DI}mode. */ + for (unsigned i = 0; i <= 1; ++i) + if (dimode_scalar_to_vector_candidate_p (insn, cand_mode[i])) + { + if (dump_file) + fprintf (dump_file, " insn %d is marked as a %s candidate\n", + INSN_UID (insn), i == 0 ? "SImode" : "DImode"); + + bitmap_set_bit (&candidates[i], INSN_UID (insn)); + break; + } } } - remove_non_convertible_regs (candidates); + if (TARGET_64BIT) + timode_remove_non_convertible_regs (&candidates[2]); + for (unsigned i = 0; i <= 1; ++i) + dimode_remove_non_convertible_regs (&candidates[i]); - if (bitmap_empty_p (candidates)) - if (dump_file) + for (unsigned i = 0; i <= 2; ++i) + if (!bitmap_empty_p (&candidates[i])) + break; + else if (i == 2 && dump_file) fprintf (dump_file, "There are no candidates for optimization.\n"); - while (!bitmap_empty_p (candidates)) - { - unsigned uid = bitmap_first_set_bit (candidates); - scalar_chain *chain; + for (unsigned i = 0; i <= 2; ++i) + while (!bitmap_empty_p (&candidates[i])) + { + unsigned uid = bitmap_first_set_bit (&candidates[i]); + scalar_chain *chain; - if (TARGET_64BIT) - chain = new timode_scalar_chain; - else - chain = new dimode_scalar_chain; + if (cand_mode[i] == TImode) + chain = new timode_scalar_chain; + else + chain = new dimode_scalar_chain (cand_mode[i], cand_vmode[i]); - /* Find instructions chain we want to convert to vector mode. - Check all uses and definitions to estimate all required - conversions. */ - chain->build (candidates, uid); + /* Find instructions chain we want to convert to vector mode. + Check all uses and definitions to estimate all required + conversions. */ + chain->build (&candidates[i], uid); - if (chain->compute_convert_gain () > 0) - converted_insns += chain->convert (); - else - if (dump_file) - fprintf (dump_file, "Chain #%d conversion is not profitable\n", - chain->chain_id); + if (chain->compute_convert_gain () > 0) + converted_insns += chain->convert (); + else + if (dump_file) + fprintf (dump_file, "Chain #%d conversion is not profitable\n", + chain->chain_id); - delete chain; - } + delete chain; + } if (dump_file) fprintf (dump_file, "Total insns converted: %d\n", converted_insns); - BITMAP_FREE (candidates); + for (unsigned i = 0; i <= 2; ++i) + bitmap_release (&candidates[i]); bitmap_obstack_release (NULL); df_process_deferred_rescans (); Index: gcc/config/i386/i386-features.h =================================================================== --- gcc/config/i386/i386-features.h (revision 274111) +++ gcc/config/i386/i386-features.h (working copy) @@ -127,11 +127,16 @@ namespace { class scalar_chain { public: - scalar_chain (); + scalar_chain (enum machine_mode, enum machine_mode); virtual ~scalar_chain (); static unsigned max_id; + /* Scalar mode. */ + enum machine_mode smode; + /* Vector mode. */ + enum machine_mode vmode; + /* ID of a chain. */ unsigned int chain_id; /* A queue of instructions to be included into a chain. */ @@ -162,6 +167,8 @@ class scalar_chain class dimode_scalar_chain : public scalar_chain { public: + dimode_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) + : scalar_chain (smode_, vmode_) {} int compute_convert_gain (); private: void mark_dual_mode_def (df_ref def); @@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala class timode_scalar_chain : public scalar_chain { public: + timode_scalar_chain () : scalar_chain (TImode, V1TImode) {} + /* Convert from TImode to V1TImode is always faster. */ int compute_convert_gain () { return 1; } Index: gcc/config/i386/i386.md =================================================================== --- gcc/config/i386/i386.md (revision 274111) +++ gcc/config/i386/i386.md (working copy) @@ -17721,6 +17721,27 @@ (define_peephole2 std::swap (operands[4], operands[5]); }) +;; min/max patterns + +(define_code_attr smaxmin_rel [(smax "ge") (smin "le")]) + +(define_insn_and_split "<code><mode>3" + [(set (match_operand:SWI48 0 "register_operand") + (smaxmin:SWI48 (match_operand:SWI48 1 "register_operand") + (match_operand:SWI48 2 "register_operand"))) + (clobber (reg:CC FLAGS_REG))] + "TARGET_STV && TARGET_SSE4_1 + && can_create_pseudo_p ()" + "#" + "&& 1" + [(set (reg:CCGC FLAGS_REG) + (compare:CCGC (match_dup 1)(match_dup 2))) + (set (match_dup 0) + (if_then_else:SWI48 + (<smaxmin_rel> (reg:CCGC FLAGS_REG)(const_int 0)) + (match_dup 1) + (match_dup 2)))]) + ;; Conditional addition patterns (define_expand "add<mode>cc" [(match_operand:SWI 0 "register_operand")