On 8/5/19 6:32 AM, Uros Bizjak wrote: > On Mon, Aug 5, 2019 at 1:50 PM Richard Biener <rguent...@suse.de> wrote: >> >> On Sun, 4 Aug 2019, Uros Bizjak wrote: >> >>> On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <rguent...@suse.de> wrote: >>>> >>>> On Thu, 1 Aug 2019, Uros Bizjak wrote: >>>> >>>>> On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguent...@suse.de> wrote: >>>>> >>>>>>>> So you unconditionally add a smaxdi3 pattern - indeed this looks >>>>>>>> necessary even when going the STV route. The actual regression >>>>>>>> for the testcase could also be solved by turing the smaxsi3 >>>>>>>> back into a compare and jump rather than a conditional move sequence. >>>>>>>> So I wonder how you'd do that given that there's pass_if_after_reload >>>>>>>> after pass_split_after_reload and I'm not sure we can split >>>>>>>> as late as pass_split_before_sched2 (there's also a split _after_ >>>>>>>> sched2 on x86 it seems). >>>>>>>> >>>>>>>> So how would you go implement {s,u}{min,max}{si,di}3 for the >>>>>>>> case STV doesn't end up doing any transform? >>>>>>> >>>>>>> If STV doesn't transform the insn, then a pre-reload splitter splits >>>>>>> the insn back to compare+cmove. >>>>>> >>>>>> OK, that would work. But there's no way to force a jumpy sequence then >>>>>> which we know is faster than compare+cmove because later RTL >>>>>> if-conversion passes happily re-discover the smax (or conditional move) >>>>>> sequence. >>>>>> >>>>>>> However, considering the SImode move >>>>>>> from/to int/xmm register is relatively cheap, the cost function should >>>>>>> be tuned so that STV always converts smaxsi3 pattern. >>>>>> >>>>>> Note that on both Zen and even more so bdverN the int/xmm transition >>>>>> makes it no longer profitable but a _lot_ slower than the cmp/cmov >>>>>> sequence... (for the loop in hmmer which is the only one I see >>>>>> any effect of any of my patches). So identifying chains that >>>>>> start/end in memory is important for cost reasons. >>>>> >>>>> Please note that the cost function also considers the cost of move >>>>> from/to xmm. So, the cost of the whole chain would disable the >>>>> transformation. >>>>> >>>>>> So I think the splitting has to happen after the last if-conversion >>>>>> pass (and thus we may need to allocate a scratch register for this >>>>>> purpose?) >>>>> >>>>> I really hope that the underlying issue will be solved by a machine >>>>> dependant pass inserted somewhere after the pre-reload split. This >>>>> way, we can split unconverted smax to the cmove, and this later pass >>>>> would handle jcc and cmove instructions. Until then... yes your >>>>> proposed approach is one of the ways to avoid unwanted if-conversion, >>>>> although sometimes we would like to split to cmove instead. >>>> >>>> So the following makes STV also consider SImode chains, re-using the >>>> DImode chain code. I've kept a simple incomplete smaxsi3 pattern >>>> and also did not alter the {SI,DI}mode chain cost function - it's >>>> quite off for TARGET_64BIT. With this I get the expected conversion >>>> for the testcase derived from hmmer. >>>> >>>> No further testing sofar. >>>> >>>> Is it OK to re-use the DImode chain code this way? I'll clean things >>>> up some more of course. >>> >>> Yes, the approach looks OK to me. It makes chain building mode >>> agnostic, and the chain building can be used for >>> a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added. >>> b) SImode x86_32 and x86_64 (this will be mainly used for SImode >>> minmax and surrounding SImode operations) >>> c) DImode x86_64 (also, mainly used for DImode minmax and surrounding >>> DImode operations) >>> >>>> Still need help with the actual patterns for minmax and how the splitters >>>> should look like. >>> >>> Please look at the attached patch. Maybe we can add memory_operand as >>> operand 1 and operand 2 predicate, but let's keep things simple for >>> now. >> >> Thanks. The attached patch makes the patch cleaner and it survives >> "some" barebone testing. It also touches the cost function to >> avoid being too overly trigger-happy. I've also ended up using >> ix86_cost->sse_op instead of COSTS_N_INSN-based magic. In >> particular we estimated GPR reg-reg move as COST_N_INSNS(2) while >> move costs shouldn't be wrapped in COST_N_INSNS. >> IMHO we should probably disregard any reg-reg moves for costing pre-RA. >> At least with the current code every reg-reg move biases in favor of >> SSE... > > This is currently a bit mixed-up area in x86 target support. HJ is > looking into this [1] and I hope Honza can review the patch. Yea, Honza's input on that would be greatly appreciated.
Jeff >