Uros Bizjak <ubiz...@gmail.com> writes: > On Mon, Aug 5, 2019 at 12:12 PM Richard Sandiford > <richard.sandif...@arm.com> wrote: >> >> Uros Bizjak <ubiz...@gmail.com> writes: >> > On Mon, Aug 5, 2019 at 11:13 AM Richard Sandiford >> > <richard.sandif...@arm.com> wrote: >> >> >> >> Uros Bizjak <ubiz...@gmail.com> writes: >> >> > On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <rguent...@suse.de> wrote: >> >> >> >> >> >> On Thu, 1 Aug 2019, Uros Bizjak wrote: >> >> >> >> >> >> > On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguent...@suse.de> >> >> >> > wrote: >> >> >> > >> >> >> >>>> So you unconditionally add a smaxdi3 pattern - indeed this looks >> >> >> >>>> necessary even when going the STV route. The actual regression >> >> >> >>>> for the testcase could also be solved by turing the smaxsi3 >> >> >> >>>> back into a compare and jump rather than a conditional move >> >> >> >>>> sequence. >> >> >> >>>> So I wonder how you'd do that given that there's >> >> >> >>>> pass_if_after_reload >> >> >> >>>> after pass_split_after_reload and I'm not sure we can split >> >> >> >>>> as late as pass_split_before_sched2 (there's also a split _after_ >> >> >> >>>> sched2 on x86 it seems). >> >> >> >>>> >> >> >> >>>> So how would you go implement {s,u}{min,max}{si,di}3 for the >> >> >> >>>> case STV doesn't end up doing any transform? >> >> >> >>> >> >> >> >>> If STV doesn't transform the insn, then a pre-reload splitter >> >> >> >>> splits >> >> >> >>> the insn back to compare+cmove. >> >> >> >> >> >> >> >> OK, that would work. But there's no way to force a jumpy sequence >> >> >> >> then >> >> >> >> which we know is faster than compare+cmove because later RTL >> >> >> >> if-conversion passes happily re-discover the smax (or conditional >> >> >> >> move) >> >> >> >> sequence. >> >> >> >> >> >> >> >>> However, considering the SImode move >> >> >> >>> from/to int/xmm register is relatively cheap, the cost function >> >> >> >>> should >> >> >> >>> be tuned so that STV always converts smaxsi3 pattern. >> >> >> >> >> >> >> >> Note that on both Zen and even more so bdverN the int/xmm transition >> >> >> >> makes it no longer profitable but a _lot_ slower than the cmp/cmov >> >> >> >> sequence... (for the loop in hmmer which is the only one I see >> >> >> >> any effect of any of my patches). So identifying chains that >> >> >> >> start/end in memory is important for cost reasons. >> >> >> > >> >> >> > Please note that the cost function also considers the cost of move >> >> >> > from/to xmm. So, the cost of the whole chain would disable the >> >> >> > transformation. >> >> >> > >> >> >> >> So I think the splitting has to happen after the last if-conversion >> >> >> >> pass (and thus we may need to allocate a scratch register for this >> >> >> >> purpose?) >> >> >> > >> >> >> > I really hope that the underlying issue will be solved by a machine >> >> >> > dependant pass inserted somewhere after the pre-reload split. This >> >> >> > way, we can split unconverted smax to the cmove, and this later pass >> >> >> > would handle jcc and cmove instructions. Until then... yes your >> >> >> > proposed approach is one of the ways to avoid unwanted if-conversion, >> >> >> > although sometimes we would like to split to cmove instead. >> >> >> >> >> >> So the following makes STV also consider SImode chains, re-using the >> >> >> DImode chain code. I've kept a simple incomplete smaxsi3 pattern >> >> >> and also did not alter the {SI,DI}mode chain cost function - it's >> >> >> quite off for TARGET_64BIT. With this I get the expected conversion >> >> >> for the testcase derived from hmmer. >> >> >> >> >> >> No further testing sofar. >> >> >> >> >> >> Is it OK to re-use the DImode chain code this way? I'll clean things >> >> >> up some more of course. >> >> > >> >> > Yes, the approach looks OK to me. It makes chain building mode >> >> > agnostic, and the chain building can be used for >> >> > a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be >> >> > added. >> >> > b) SImode x86_32 and x86_64 (this will be mainly used for SImode >> >> > minmax and surrounding SImode operations) >> >> > c) DImode x86_64 (also, mainly used for DImode minmax and surrounding >> >> > DImode operations) >> >> > >> >> >> Still need help with the actual patterns for minmax and how the >> >> >> splitters >> >> >> should look like. >> >> > >> >> > Please look at the attached patch. Maybe we can add memory_operand as >> >> > operand 1 and operand 2 predicate, but let's keep things simple for >> >> > now. >> >> > >> >> > Uros. >> >> > >> >> > Index: i386.md >> >> > =================================================================== >> >> > --- i386.md (revision 274008) >> >> > +++ i386.md (working copy) >> >> > @@ -17721,6 +17721,27 @@ >> >> > std::swap (operands[4], operands[5]); >> >> > }) >> >> > >> >> > +;; min/max patterns >> >> > + >> >> > +(define_code_attr smaxmin_rel [(smax "ge") (smin "le")]) >> >> > + >> >> > +(define_insn_and_split "<code><mode>3" >> >> > + [(set (match_operand:SWI48 0 "register_operand") >> >> > + (smaxmin:SWI48 (match_operand:SWI48 1 "register_operand") >> >> > + (match_operand:SWI48 2 "register_operand"))) >> >> > + (clobber (reg:CC FLAGS_REG))] >> >> > + "TARGET_STV && TARGET_SSE4_1 >> >> > + && can_create_pseudo_p ()" >> >> > + "#" >> >> > + "&& 1" >> >> > + [(set (reg:CCGC FLAGS_REG) >> >> > + (compare:CCGC (match_dup 1)(match_dup 2))) >> >> > + (set (match_dup 0) >> >> > + (if_then_else:SWI48 >> >> > + (<smaxmin_rel> (reg:CCGC FLAGS_REG)(const_int 0)) >> >> > + (match_dup 1) >> >> > + (match_dup 2)))]) >> >> > + >> >> >> >> The pattern could in theory be matched after the last pre-RA split pass >> >> has run, so I think the pattern still needs to have constraints and be >> >> matchable even without can_create_pseudo_p. It looks like the split >> >> above should work post-RA. >> >> >> >> A bit pedantic, because the pattern's probably fine in practice... >> > >> > Currently, all unmatched STV patterns split before reload, and there >> > were no problems. If the pattern matches after last pre-RA split, then >> > the post-reload splitter will fail, since can_create_pseudo_p also >> > applies to the part that splits the insn. >> >> But what I meant was: you should be able to remove the >> can_create_pseudo_p () and add constraints. (You'd have to remove >> can_create_pseudo_p () with constraints anyway, since the insn >> wouldn't match after RA otherwise.) > > I was under impression that it is better to split pseudo->pseudo, so > reload has some more freedom on what register to choose, especially > with matched and earlyclobbered DImode regs in x86_32 DImode patterns. > There were some complications with andn pattern (that needed > earlyclobber on a register to avoid clobbering registers in a memory > address), and it was necessary to clobber the whole DImode register > pair, wasting a SImode register. We can avoid all these complications > by splitting before the RA, where also a pseudo can be allocated.
Yeah, splitting before RA is fine. All I meant was that: (define_insn_and_split "<code><mode>3" [(set (match_operand:SWI48 0 "register_operand" "=r") (smaxmin:SWI48 (match_operand:SWI48 1 "register_operand" "r") (match_operand:SWI48 2 "register_operand" "r"))) (clobber (reg:CC FLAGS_REG))] "TARGET_STV && TARGET_SSE4_1" "#" "&& 1" [(set (reg:CCGC FLAGS_REG) (compare:CCGC (match_dup 1) (match_dup 2))) (set (match_dup 0) (if_then_else:SWI48 (<smaxmin_rel> (reg:CCGC FLAGS_REG) (const_int 0)) (match_dup 1) (match_dup 2)))]) seems like it should be correct too and avoids the theoretical problem I mentioned. If the instruction does survive until RA then the split should work correctly on the reloaded instruction. Thanks, Richard