On Mon, Aug 5, 2019 at 2:43 PM Uros Bizjak <ubiz...@gmail.com> wrote: > > On Mon, Aug 5, 2019 at 1:50 PM Richard Biener <rguent...@suse.de> wrote: > > > > On Sun, 4 Aug 2019, Uros Bizjak wrote: > > > > > On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <rguent...@suse.de> wrote: > > > > > > > > On Thu, 1 Aug 2019, Uros Bizjak wrote: > > > > > > > > > On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguent...@suse.de> > > > > > wrote: > > > > > > > > > >>>> So you unconditionally add a smaxdi3 pattern - indeed this looks > > > > >>>> necessary even when going the STV route. The actual regression > > > > >>>> for the testcase could also be solved by turing the smaxsi3 > > > > >>>> back into a compare and jump rather than a conditional move > > > > >>>> sequence. > > > > >>>> So I wonder how you'd do that given that there's > > > > >>>> pass_if_after_reload > > > > >>>> after pass_split_after_reload and I'm not sure we can split > > > > >>>> as late as pass_split_before_sched2 (there's also a split _after_ > > > > >>>> sched2 on x86 it seems). > > > > >>>> > > > > >>>> So how would you go implement {s,u}{min,max}{si,di}3 for the > > > > >>>> case STV doesn't end up doing any transform? > > > > >>> > > > > >>> If STV doesn't transform the insn, then a pre-reload splitter splits > > > > >>> the insn back to compare+cmove. > > > > >> > > > > >> OK, that would work. But there's no way to force a jumpy sequence > > > > >> then > > > > >> which we know is faster than compare+cmove because later RTL > > > > >> if-conversion passes happily re-discover the smax (or conditional > > > > >> move) > > > > >> sequence. > > > > >> > > > > >>> However, considering the SImode move > > > > >>> from/to int/xmm register is relatively cheap, the cost function > > > > >>> should > > > > >>> be tuned so that STV always converts smaxsi3 pattern. > > > > >> > > > > >> Note that on both Zen and even more so bdverN the int/xmm transition > > > > >> makes it no longer profitable but a _lot_ slower than the cmp/cmov > > > > >> sequence... (for the loop in hmmer which is the only one I see > > > > >> any effect of any of my patches). So identifying chains that > > > > >> start/end in memory is important for cost reasons. > > > > > > > > > > Please note that the cost function also considers the cost of move > > > > > from/to xmm. So, the cost of the whole chain would disable the > > > > > transformation. > > > > > > > > > >> So I think the splitting has to happen after the last if-conversion > > > > >> pass (and thus we may need to allocate a scratch register for this > > > > >> purpose?) > > > > > > > > > > I really hope that the underlying issue will be solved by a machine > > > > > dependant pass inserted somewhere after the pre-reload split. This > > > > > way, we can split unconverted smax to the cmove, and this later pass > > > > > would handle jcc and cmove instructions. Until then... yes your > > > > > proposed approach is one of the ways to avoid unwanted if-conversion, > > > > > although sometimes we would like to split to cmove instead. > > > > > > > > So the following makes STV also consider SImode chains, re-using the > > > > DImode chain code. I've kept a simple incomplete smaxsi3 pattern > > > > and also did not alter the {SI,DI}mode chain cost function - it's > > > > quite off for TARGET_64BIT. With this I get the expected conversion > > > > for the testcase derived from hmmer. > > > > > > > > No further testing sofar. > > > > > > > > Is it OK to re-use the DImode chain code this way? I'll clean things > > > > up some more of course. > > > > > > Yes, the approach looks OK to me. It makes chain building mode > > > agnostic, and the chain building can be used for > > > a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be > > > added. > > > b) SImode x86_32 and x86_64 (this will be mainly used for SImode > > > minmax and surrounding SImode operations) > > > c) DImode x86_64 (also, mainly used for DImode minmax and surrounding > > > DImode operations) > > > > > > > Still need help with the actual patterns for minmax and how the > > > > splitters > > > > should look like. > > > > > > Please look at the attached patch. Maybe we can add memory_operand as > > > operand 1 and operand 2 predicate, but let's keep things simple for > > > now. > > > > Thanks. The attached patch makes the patch cleaner and it survives > > "some" barebone testing. It also touches the cost function to > > avoid being too overly trigger-happy. I've also ended up using > > ix86_cost->sse_op instead of COSTS_N_INSN-based magic. In > > particular we estimated GPR reg-reg move as COST_N_INSNS(2) while > > move costs shouldn't be wrapped in COST_N_INSNS. > > IMHO we should probably disregard any reg-reg moves for costing pre-RA. > > At least with the current code every reg-reg move biases in favor of > > SSE... > > > > And we're simply adding move and non-move costs in 'gain', somewhat > > mixing apples and oranges? We could separate those and require > > both to be a net positive win? > > > > Still using -mtune=bdverN exposes that some cost tables have xmm and gpr > > costs as apples and oranges... (so it never triggers for Bulldozer) > > > > I now run into > > > > /space/rguenther/src/svn/trunk-bisect/libgcc/libgcov-driver.c:509:1: > > error: unrecognizable insn: > > (insn 116 115 1511 8 (set (subreg:V2DI (reg/v:DI 87 [ run_max ]) 0) > > (smax:V2DI (subreg:V2DI (reg/v:DI 87 [ run_max ]) 0) > > (subreg:V2DI (reg:DI 349 [ MEM[base: _261, offset: 0B] ]) 0))) > > -1 > > (expr_list:REG_DEAD (reg:DI 349 [ MEM[base: _261, offset: 0B] ]) > > (expr_list:REG_UNUSED (reg:CC 17 flags) > > (nil)))) > > during RTL pass: stv > > > > where even with -mavx2 we do not have s{min,max}v2di3. We do have > > an expander here but it seems only AVX512F has the DImode min/max > > ops. I have adjusted dimode_scalar_to_vector_candidate_p > > accordingly. > > Uh, you need to use some other mode iterator that SWI48 then, like: > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"]) > > and then we need to split DImode for 32bits, too.
For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode condition, I'll provide _doubleword splitter later. Uros.