On Mon, Aug 5, 2019 at 1:50 PM Richard Biener <rguent...@suse.de> wrote: > > On Sun, 4 Aug 2019, Uros Bizjak wrote: > > > On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <rguent...@suse.de> wrote: > > > > > > On Thu, 1 Aug 2019, Uros Bizjak wrote: > > > > > > > On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguent...@suse.de> > > > > wrote: > > > > > > > >>>> So you unconditionally add a smaxdi3 pattern - indeed this looks > > > >>>> necessary even when going the STV route. The actual regression > > > >>>> for the testcase could also be solved by turing the smaxsi3 > > > >>>> back into a compare and jump rather than a conditional move sequence. > > > >>>> So I wonder how you'd do that given that there's pass_if_after_reload > > > >>>> after pass_split_after_reload and I'm not sure we can split > > > >>>> as late as pass_split_before_sched2 (there's also a split _after_ > > > >>>> sched2 on x86 it seems). > > > >>>> > > > >>>> So how would you go implement {s,u}{min,max}{si,di}3 for the > > > >>>> case STV doesn't end up doing any transform? > > > >>> > > > >>> If STV doesn't transform the insn, then a pre-reload splitter splits > > > >>> the insn back to compare+cmove. > > > >> > > > >> OK, that would work. But there's no way to force a jumpy sequence then > > > >> which we know is faster than compare+cmove because later RTL > > > >> if-conversion passes happily re-discover the smax (or conditional move) > > > >> sequence. > > > >> > > > >>> However, considering the SImode move > > > >>> from/to int/xmm register is relatively cheap, the cost function should > > > >>> be tuned so that STV always converts smaxsi3 pattern. > > > >> > > > >> Note that on both Zen and even more so bdverN the int/xmm transition > > > >> makes it no longer profitable but a _lot_ slower than the cmp/cmov > > > >> sequence... (for the loop in hmmer which is the only one I see > > > >> any effect of any of my patches). So identifying chains that > > > >> start/end in memory is important for cost reasons. > > > > > > > > Please note that the cost function also considers the cost of move > > > > from/to xmm. So, the cost of the whole chain would disable the > > > > transformation. > > > > > > > >> So I think the splitting has to happen after the last if-conversion > > > >> pass (and thus we may need to allocate a scratch register for this > > > >> purpose?) > > > > > > > > I really hope that the underlying issue will be solved by a machine > > > > dependant pass inserted somewhere after the pre-reload split. This > > > > way, we can split unconverted smax to the cmove, and this later pass > > > > would handle jcc and cmove instructions. Until then... yes your > > > > proposed approach is one of the ways to avoid unwanted if-conversion, > > > > although sometimes we would like to split to cmove instead. > > > > > > So the following makes STV also consider SImode chains, re-using the > > > DImode chain code. I've kept a simple incomplete smaxsi3 pattern > > > and also did not alter the {SI,DI}mode chain cost function - it's > > > quite off for TARGET_64BIT. With this I get the expected conversion > > > for the testcase derived from hmmer. > > > > > > No further testing sofar. > > > > > > Is it OK to re-use the DImode chain code this way? I'll clean things > > > up some more of course. > > > > Yes, the approach looks OK to me. It makes chain building mode > > agnostic, and the chain building can be used for > > a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added. > > b) SImode x86_32 and x86_64 (this will be mainly used for SImode > > minmax and surrounding SImode operations) > > c) DImode x86_64 (also, mainly used for DImode minmax and surrounding > > DImode operations) > > > > > Still need help with the actual patterns for minmax and how the splitters > > > should look like. > > > > Please look at the attached patch. Maybe we can add memory_operand as > > operand 1 and operand 2 predicate, but let's keep things simple for > > now. > > Thanks. The attached patch makes the patch cleaner and it survives > "some" barebone testing. It also touches the cost function to > avoid being too overly trigger-happy. I've also ended up using > ix86_cost->sse_op instead of COSTS_N_INSN-based magic. In > particular we estimated GPR reg-reg move as COST_N_INSNS(2) while > move costs shouldn't be wrapped in COST_N_INSNS. > IMHO we should probably disregard any reg-reg moves for costing pre-RA. > At least with the current code every reg-reg move biases in favor of > SSE...
This is currently a bit mixed-up area in x86 target support. HJ is looking into this [1] and I hope Honza can review the patch. > And we're simply adding move and non-move costs in 'gain', somewhat > mixing apples and oranges? We could separate those and require > both to be a net positive win? > > Still using -mtune=bdverN exposes that some cost tables have xmm and gpr > costs as apples and oranges... (so it never triggers for Bulldozer) > > I now run into > > /space/rguenther/src/svn/trunk-bisect/libgcc/libgcov-driver.c:509:1: > error: unrecognizable insn: > (insn 116 115 1511 8 (set (subreg:V2DI (reg/v:DI 87 [ run_max ]) 0) > (smax:V2DI (subreg:V2DI (reg/v:DI 87 [ run_max ]) 0) > (subreg:V2DI (reg:DI 349 [ MEM[base: _261, offset: 0B] ]) 0))) > -1 > (expr_list:REG_DEAD (reg:DI 349 [ MEM[base: _261, offset: 0B] ]) > (expr_list:REG_UNUSED (reg:CC 17 flags) > (nil)))) > during RTL pass: stv > > where even with -mavx2 we do not have s{min,max}v2di3. We do have > an expander here but it seems only AVX512F has the DImode min/max > ops. I have adjusted dimode_scalar_to_vector_candidate_p > accordingly. > > I'm considering to rename the > dimode_{scalar_to_vector_candidate_p,remove_non_convertible_regs} > functions to drop the dimode_ prefix - is that OK or do you > prefer some other prefix? > > So - bootstrap with --with-arch=skylake in progress. > > It detects quite a few chains (unsurprisingly) so I guess we need > to address compile-time issues in the pass before enabling this > enhancement (maybe as followup?). > > Further comments on the actual patch welcome, I consider it > "finished" if testing reveals no issues. ChangeLog still needs > to be written and testcases to be added. > +;; min/max patterns > + > +(define_code_attr smaxmin_rel [(smax "ge") (smin "le")]) > + > +(define_insn_and_split "<code><mode>3" > + [(set (match_operand:SWI48 0 "register_operand") > + (smaxmin:SWI48 (match_operand:SWI48 1 "register_operand") > + (match_operand:SWI48 2 "register_operand"))) > + (clobber (reg:CC FLAGS_REG))] > + "TARGET_STV && TARGET_SSE4_1 > + && can_create_pseudo_p ()" > + "#" > + "&& 1" > + [(set (reg:CCGC FLAGS_REG) > + (compare:CCGC (match_dup 1)(match_dup 2))) > + (set (match_dup 0) > + (if_then_else:SWI48 > + (<smaxmin_rel> (reg:CCGC FLAGS_REG)(const_int 0)) > + (match_dup 1) > + (match_dup 2)))]) > + > ;; Conditional addition patterns > (define_expand "add<mode>cc" > [(match_operand:SWI 0 "register_operand") Please find attached (untested) i386.md patch that defines signed and unsigned min/max pattern. [1] https://gcc.gnu.org/ml/gcc-patches/2019-07/msg01542.html Uros.
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index e19a591fa9d..8a492626103 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -17721,6 +17721,30 @@ std::swap (operands[4], operands[5]); }) +;; min/max patterns + +(define_code_attr maxmin_rel + [(smax "ge") (smin "le") (umax "geu") (umin "leu")]) +(define_code_attr maxmin_cmpmode + [(smax "CCGC") (smin "CCGC") (umax "CC") (umin "CC")]) + +(define_insn_and_split "<code><mode>3" + [(set (match_operand:SWI48 0 "register_operand") + (maxmin:SWI48 (match_operand:SWI48 1 "register_operand") + (match_operand:SWI48 2 "register_operand"))) + (clobber (reg:CC FLAGS_REG))] + "TARGET_STV && TARGET_SSE4_1 + && can_create_pseudo_p ()" + "#" + "&& 1" + [(set (reg:<maxmin_cmpmode> FLAGS_REG) + (compare:<maxmin_cmpmode> (match_dup 1)(match_dup 2))) + (set (match_dup 0) + (if_then_else:SWI48 + (<maxmin_rel> (reg:<maxmin_cmpmode> FLAGS_REG)(const_int 0)) + (match_dup 1) + (match_dup 2)))]) + ;; Conditional addition patterns (define_expand "add<mode>cc" [(match_operand:SWI 0 "register_operand")