On Thu, 1 Aug 2019, Uros Bizjak wrote: > On Wed, Jul 31, 2019 at 1:21 PM Richard Biener <rguent...@suse.de> wrote: > > > > On Sat, 27 Jul 2019, Uros Bizjak wrote: > > > > > On Sat, Jul 27, 2019 at 12:07 PM Uros Bizjak <ubiz...@gmail.com> wrote: > > > > > > > > How would one write smaxsi3 as a splitter to be split after > > > > > reload in the case LRA assigned the GPR alternative? Is it > > > > > even worth doing? Even the SSE reg alternative can be split > > > > > to remove the not needed CC clobber. > > > > > > > > > > Finally I'm unsure about the add where I needed to place > > > > > the SSE alternative before the 2nd op memory one since it > > > > > otherwise gets the same cost and wins. > > > > > > > > > > So - how to go forward with this? > > > > > > > > Sorry to come a bit late to the discussion. > > > > > > > > We are aware of CMOV issue for quite some time, but the issue is not > > > > understood yet in detail (I was hoping for Intel people to look at > > > > this). However, you demonstrated that using PMAX and PMIN instead of > > > > scalar CMOV can bring us big gains, and this thread now deals on how > > > > to best implement PMAX/PMIN for scalar code. > > > > > > > > I think that the way to go forward is with STV infrastructure. > > > > Currently, the implementation only deals with DImode on SSE2 32bit > > > > targets, but I see no issues on using STV pass also for SImode (on > > > > 32bit and 64bit targets). There are actually two STV passes, the first > > > > one (currently run on 64bit targets) is run before cse2, and the > > > > second (which currently runs on 32bit SSE2 only) is run after combine > > > > and before split1 pass. The second pass is interesting to us. > > > > > > > > The base idea of the second STV pass (for 32bit targets!) is that we > > > > introduce a DImode _doubleword instructons that otherwise do not exist > > > > with integer registers. Now, the passes up to and including combine > > > > pass can use these instructions to simplify and optimize the insn > > > > flow. Later, based on cost analysis, STV pass either converts the > > > > _doubleword instructions to a real vector ones (e.g. V2DImode > > > > patterns) or leaves them intact, and a follow-up split pass splits > > > > them into scalar SImode instruction pairs. STV pass also takes care to > > > > move and preload values from their scalar form to a vector > > > > representation (using SUBREGs). Please note that all this happens on > > > > pseudos, and register allocator will later simply use scalar (integer) > > > > registers in scalar patterns and vector registers with vector insn > > > > patterns. > > > > > > > > Your approach to amend existing scalar SImode patterns with vector > > > > registers will introduce no end of problems. Register allocator will > > > > do funny things during register pressure, where values will take a > > > > trip to a vector register before being stored to memory (and vice > > > > versa, you already found some of them). Current RA simply can't > > > > distinguish clearly between two register sets. > > > > > > > > So, my advice would be to use STV pass also for SImode values, on > > > > 64bit and 32bit targets. On both targets, we will be able to use > > > > instructions that operate on vector register set, and for 32bit > > > > targets (and to some extent on 64bit targets), we would perhaps be > > > > able to relax register pressure in a kind of controlled way. > > > > > > > > So, to demonstrate the benefits of existing STV pass, it should be > > > > relatively easy to introduce 64bit max/min pattern on 32bit target to > > > > handle 64bit values. For 32bit values, the pass should be re-run to > > > > convert SImode scalar operations to vector operations in a controlled > > > > way, based on various cost functions. > > > > I've looked at STV before trying to use RA to solve the issue but > > quickly stepped away because of its structure which seems to be > > tied to particular modes, duplicating things for TImode and DImode > > so it looked like I have to write up everything again for SImode... > > ATM, DImode is used exclusively for x86_32 while TImode is used > exclusively for x86_64. Also, TImode is used for different purpose > before combine, while DImode is used after combine. I don't remember > the details, but IIRC it made sense for the intended purpose. > > > > It really should be possible to run the pass once, handling a set > > of modes rather than re-running it for the SImode case I am after. > > See also a recent PR about STV slowness and tendency to hog memory > > because it seems to enable every DF problem that is around... > > Huh, I was not aware of implementation details... > > > > Please find attached patch to see STV in action. The compilation will > > > crash due to non-existing V2DImode SMAX insn, but in the _.268r.stv2 > > > dump, you will be able to see chain building, cost calculation and > > > conversion insertion. > > > > So you unconditionally add a smaxdi3 pattern - indeed this looks > > necessary even when going the STV route. The actual regression > > for the testcase could also be solved by turing the smaxsi3 > > back into a compare and jump rather than a conditional move sequence. > > So I wonder how you'd do that given that there's pass_if_after_reload > > after pass_split_after_reload and I'm not sure we can split > > as late as pass_split_before_sched2 (there's also a split _after_ > > sched2 on x86 it seems). > > > > So how would you go implement {s,u}{min,max}{si,di}3 for the > > case STV doesn't end up doing any transform? > > If STV doesn't transform the insn, then a pre-reload splitter splits > the insn back to compare+cmove.
OK, that would work. But there's no way to force a jumpy sequence then which we know is faster than compare+cmove because later RTL if-conversion passes happily re-discover the smax (or conditional move) sequence. > However, considering the SImode move > from/to int/xmm register is relatively cheap, the cost function should > be tuned so that STV always converts smaxsi3 pattern. Note that on both Zen and even more so bdverN the int/xmm transition makes it no longer profitable but a _lot_ slower than the cmp/cmov sequence... (for the loop in hmmer which is the only one I see any effect of any of my patches). So identifying chains that start/end in memory is important for cost reasons. So I think the splitting has to happen after the last if-conversion pass (and thus we may need to allocate a scratch register for this purpose?) > (As said before, > the fix of the slowdown with consecutive cmov insns is a side effect > of the transformation to smax insn that helps in this particular case, > I think that this issue should be fixed in a general way, there are > already a couple of PRs reported). > > > You could save me some guesswork here if you can come up with > > a reasonably complete final set of patterns (ok, I only care > > about smaxsi3) so I can have a look at the STV approach again > > (you may remember I simply "split" at assembler emission time). > > I think that the cost function should always enable smaxsi3 > generation. To further optimize STV chain (to avoid unnecessary > xmm<->int transitions) we could add all integer logic, arithmetic and > constant shifts to the candidates (the ones that DImode STV converts). > > Uros. > > > Thanks, > > Richard. > > > > > The testcase: > > > > > > --cut here-- > > > long long test (long long a, long long b) > > > { > > > return (a > b) ? a : b; > > > } > > > --cut here-- > > > > > > gcc -O2 -m32 -msse2 (-mstv): > > > > > > _.268r.stv2 dump: > > > > > > Searching for mode conversion candidates... > > > insn 2 is marked as a candidate > > > insn 3 is marked as a candidate > > > insn 7 is marked as a candidate > > > Created a new instruction chain #1 > > > Building chain #1... > > > Adding insn 2 to chain #1 > > > Adding insn 7 into chain's #1 queue > > > Adding insn 7 to chain #1 > > > r85 use in insn 12 isn't convertible > > > Mark r85 def in insn 7 as requiring both modes in chain #1 > > > Adding insn 3 into chain's #1 queue > > > Adding insn 3 to chain #1 > > > Collected chain #1... > > > insns: 2, 3, 7 > > > defs to convert: r85 > > > Computing gain for chain #1... > > > Instruction conversion gain: 24 > > > Registers conversion cost: 6 > > > Total gain: 18 > > > Converting chain #1... > > > > > > ... > > > > > > (insn 2 5 3 2 (set (reg/v:DI 83 [ a ]) > > > (mem/c:DI (reg/f:SI 16 argp) [1 a+0 S8 A32])) "max.c":2:1 66 > > > {*movdi_internal} > > > (nil)) > > > (insn 3 2 4 2 (set (reg/v:DI 84 [ b ]) > > > (mem/c:DI (plus:SI (reg/f:SI 16 argp) > > > (const_int 8 [0x8])) [1 b+0 S8 A32])) "max.c":2:1 66 > > > {*movdi_internal} > > > (nil)) > > > (note 4 3 7 2 NOTE_INSN_FUNCTION_BEG) > > > (insn 7 4 15 2 (set (subreg:V2DI (reg:DI 85) 0) > > > (smax:V2DI (subreg:V2DI (reg/v:DI 84 [ b ]) 0) > > > (subreg:V2DI (reg/v:DI 83 [ a ]) 0))) "max.c":3:22 -1 > > > (expr_list:REG_DEAD (reg/v:DI 84 [ b ]) > > > (expr_list:REG_DEAD (reg/v:DI 83 [ a ]) > > > (expr_list:REG_UNUSED (reg:CC 17 flags) > > > (nil))))) > > > (insn 15 7 16 2 (set (reg:V2DI 87) > > > (subreg:V2DI (reg:DI 85) 0)) "max.c":3:22 -1 > > > (nil)) > > > (insn 16 15 17 2 (set (subreg:SI (reg:DI 86) 0) > > > (subreg:SI (reg:V2DI 87) 0)) "max.c":3:22 -1 > > > (nil)) > > > (insn 17 16 18 2 (set (reg:V2DI 87) > > > (lshiftrt:V2DI (reg:V2DI 87) > > > (const_int 32 [0x20]))) "max.c":3:22 -1 > > > (nil)) > > > (insn 18 17 12 2 (set (subreg:SI (reg:DI 86) 4) > > > (subreg:SI (reg:V2DI 87) 0)) "max.c":3:22 -1 > > > (nil)) > > > (insn 12 18 13 2 (set (reg/i:DI 0 ax) > > > (reg:DI 86)) "max.c":4:1 66 {*movdi_internal} > > > (expr_list:REG_DEAD (reg:DI 86) > > > (nil))) > > > (insn 13 12 0 2 (use (reg/i:DI 0 ax)) "max.c":4:1 -1 > > > (nil)) > > > > > > Uros. > > > > > > > -- > > Richard Biener <rguent...@suse.de> > > SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; > > GF: Felix Imendörffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG Nürnberg) > -- Richard Biener <rguent...@suse.de> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Felix Imendörffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG Nürnberg)