Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs

Richard Biener Thu, 01 Aug 2019 02:28:24 -0700

On Thu, 1 Aug 2019, Uros Bizjak wrote:

> On Wed, Jul 31, 2019 at 1:21 PM Richard Biener <rguent...@suse.de> wrote:
> >
> > On Sat, 27 Jul 2019, Uros Bizjak wrote:
> >
> > > On Sat, Jul 27, 2019 at 12:07 PM Uros Bizjak <ubiz...@gmail.com> wrote:
> > >
> > > > > How would one write smaxsi3 as a splitter to be split after
> > > > > reload in the case LRA assigned the GPR alternative?  Is it
> > > > > even worth doing?  Even the SSE reg alternative can be split
> > > > > to remove the not needed CC clobber.
> > > > >
> > > > > Finally I'm unsure about the add where I needed to place
> > > > > the SSE alternative before the 2nd op memory one since it
> > > > > otherwise gets the same cost and wins.
> > > > >
> > > > > So - how to go forward with this?
> > > >
> > > > Sorry to come a bit late to the discussion.
> > > >
> > > > We are aware of CMOV issue for quite some time, but the issue is not
> > > > understood yet in detail (I was hoping for Intel people to look at
> > > > this). However, you demonstrated that using PMAX and PMIN  instead of
> > > > scalar CMOV can bring us big gains, and this thread now deals on how
> > > > to best implement PMAX/PMIN for scalar code.
> > > >
> > > > I think that the way to go forward is with STV infrastructure.
> > > > Currently, the implementation only deals with DImode on SSE2 32bit
> > > > targets, but I see no issues on using STV pass also for SImode (on
> > > > 32bit and 64bit targets). There are actually two STV passes, the first
> > > > one (currently run on 64bit targets) is run before cse2, and the
> > > > second (which currently runs on 32bit SSE2 only) is run after combine
> > > > and before split1 pass. The second pass is interesting to us.
> > > >
> > > > The base idea of the second STV pass (for 32bit targets!) is that we
> > > > introduce a DImode _doubleword instructons that otherwise do not exist
> > > > with integer registers. Now, the passes up to and including combine
> > > > pass can use these instructions to simplify and optimize the insn
> > > > flow. Later, based on cost analysis, STV pass either converts the
> > > > _doubleword instructions to a real vector ones (e.g. V2DImode
> > > > patterns) or leaves them intact, and a follow-up split pass splits
> > > > them into scalar SImode instruction pairs. STV pass also takes care to
> > > > move and preload values from their scalar form to a vector
> > > > representation (using SUBREGs). Please note that all this happens on
> > > > pseudos, and register allocator will later simply use scalar (integer)
> > > > registers in scalar patterns and vector registers with vector insn
> > > > patterns.
> > > >
> > > > Your approach to amend existing scalar SImode patterns with vector
> > > > registers will introduce no end of problems. Register allocator will
> > > > do funny things during register pressure, where values will take a
> > > > trip to a vector register before being stored to memory (and vice
> > > > versa, you already found some of them). Current RA simply can't
> > > > distinguish clearly between two register sets.
> > > >
> > > > So, my advice would be to use STV pass also for SImode values, on
> > > > 64bit and 32bit targets. On both targets, we will be able to use
> > > > instructions that operate on vector register set, and for 32bit
> > > > targets (and to some extent on 64bit targets), we would perhaps be
> > > > able to relax register pressure in a kind of controlled way.
> > > >
> > > > So, to demonstrate the benefits of existing STV pass, it should be
> > > > relatively easy to introduce 64bit max/min pattern on 32bit target to
> > > > handle 64bit values. For 32bit values, the pass should be re-run to
> > > > convert SImode scalar operations to vector operations in a controlled
> > > > way, based on various cost functions.
> >
> > I've looked at STV before trying to use RA to solve the issue but
> > quickly stepped away because of its structure which seems to be
> > tied to particular modes, duplicating things for TImode and DImode
> > so it looked like I have to write up everything again for SImode...
> 
> ATM, DImode is used exclusively for x86_32 while TImode is used
> exclusively for x86_64. Also, TImode is used for different purpose
> before combine, while DImode is used after combine. I don't remember
> the details, but IIRC it made sense for the intended purpose.
> >
> > It really should be possible to run the pass once, handling a set
> > of modes rather than re-running it for the SImode case I am after.
> > See also a recent PR about STV slowness and tendency to hog memory
> > because it seems to enable every DF problem that is around...
> 
> Huh, I was not aware of implementation details...
> 
> > > Please find attached patch to see STV in action. The compilation will
> > > crash due to non-existing V2DImode SMAX insn, but in the _.268r.stv2
> > > dump, you will be able to see chain building, cost calculation and
> > > conversion insertion.
> >
> > So you unconditionally add a smaxdi3 pattern - indeed this looks
> > necessary even when going the STV route.  The actual regression
> > for the testcase could also be solved by turing the smaxsi3
> > back into a compare and jump rather than a conditional move sequence.
> > So I wonder how you'd do that given that there's pass_if_after_reload
> > after pass_split_after_reload and I'm not sure we can split
> > as late as pass_split_before_sched2 (there's also a split _after_
> > sched2 on x86 it seems).
> >
> > So how would you go implement {s,u}{min,max}{si,di}3 for the
> > case STV doesn't end up doing any transform?
> 
> If STV doesn't transform the insn, then a pre-reload splitter splits
> the insn back to compare+cmove.


OK, that would work.  But there's no way to force a jumpy sequence then
which we know is faster than compare+cmove because later RTL
if-conversion passes happily re-discover the smax (or conditional move)
sequence.

> However, considering the SImode move
> from/to int/xmm register is relatively cheap, the cost function should
> be tuned so that STV always converts smaxsi3 pattern.

Note that on both Zen and even more so bdverN the int/xmm transition
makes it no longer profitable but a _lot_ slower than the cmp/cmov
sequence... (for the loop in hmmer which is the only one I see
any effect of any of my patches).  So identifying chains that
start/end in memory is important for cost reasons.

So I think the splitting has to happen after the last if-conversion
pass (and thus we may need to allocate a scratch register for this
purpose?)

> (As said before,
> the fix of the slowdown with consecutive cmov insns is a side effect
> of the transformation to smax insn that helps in this particular case,
> I think that this issue should be fixed in a general way, there are
> already a couple of PRs reported).
> 
> > You could save me some guesswork here if you can come up with
> > a reasonably complete final set of patterns (ok, I only care
> > about smaxsi3) so I can have a look at the STV approach again
> > (you may remember I simply "split" at assembler emission time).
> 
> I think that the cost function should always enable smaxsi3
> generation. To further optimize STV chain (to avoid unnecessary
> xmm<->int transitions) we could add all integer logic, arithmetic and
> constant shifts to the candidates (the ones that DImode STV converts).
>
> Uros.
> 
> > Thanks,
> > Richard.
> >
> > > The testcase:
> > >
> > > --cut here--
> > > long long test (long long a, long long b)
> > > {
> > >   return (a > b) ? a : b;
> > > }
> > > --cut here--
> > >
> > > gcc -O2 -m32 -msse2 (-mstv):
> > >
> > > _.268r.stv2 dump:
> > >
> > > Searching for mode conversion candidates...
> > >   insn 2 is marked as a candidate
> > >   insn 3 is marked as a candidate
> > >   insn 7 is marked as a candidate
> > > Created a new instruction chain #1
> > > Building chain #1...
> > >   Adding insn 2 to chain #1
> > >   Adding insn 7 into chain's #1 queue
> > >   Adding insn 7 to chain #1
> > >   r85 use in insn 12 isn't convertible
> > >   Mark r85 def in insn 7 as requiring both modes in chain #1
> > >   Adding insn 3 into chain's #1 queue
> > >   Adding insn 3 to chain #1
> > > Collected chain #1...
> > >   insns: 2, 3, 7
> > >   defs to convert: r85
> > > Computing gain for chain #1...
> > >   Instruction conversion gain: 24
> > >   Registers conversion cost: 6
> > >   Total gain: 18
> > > Converting chain #1...
> > >
> > > ...
> > >
> > > (insn 2 5 3 2 (set (reg/v:DI 83 [ a ])
> > >         (mem/c:DI (reg/f:SI 16 argp) [1 a+0 S8 A32])) "max.c":2:1 66
> > > {*movdi_internal}
> > >      (nil))
> > > (insn 3 2 4 2 (set (reg/v:DI 84 [ b ])
> > >         (mem/c:DI (plus:SI (reg/f:SI 16 argp)
> > >                 (const_int 8 [0x8])) [1 b+0 S8 A32])) "max.c":2:1 66
> > > {*movdi_internal}
> > >      (nil))
> > > (note 4 3 7 2 NOTE_INSN_FUNCTION_BEG)
> > > (insn 7 4 15 2 (set (subreg:V2DI (reg:DI 85) 0)
> > >         (smax:V2DI (subreg:V2DI (reg/v:DI 84 [ b ]) 0)
> > >             (subreg:V2DI (reg/v:DI 83 [ a ]) 0))) "max.c":3:22 -1
> > >      (expr_list:REG_DEAD (reg/v:DI 84 [ b ])
> > >         (expr_list:REG_DEAD (reg/v:DI 83 [ a ])
> > >             (expr_list:REG_UNUSED (reg:CC 17 flags)
> > >                 (nil)))))
> > > (insn 15 7 16 2 (set (reg:V2DI 87)
> > >         (subreg:V2DI (reg:DI 85) 0)) "max.c":3:22 -1
> > >      (nil))
> > > (insn 16 15 17 2 (set (subreg:SI (reg:DI 86) 0)
> > >         (subreg:SI (reg:V2DI 87) 0)) "max.c":3:22 -1
> > >      (nil))
> > > (insn 17 16 18 2 (set (reg:V2DI 87)
> > >         (lshiftrt:V2DI (reg:V2DI 87)
> > >             (const_int 32 [0x20]))) "max.c":3:22 -1
> > >      (nil))
> > > (insn 18 17 12 2 (set (subreg:SI (reg:DI 86) 4)
> > >         (subreg:SI (reg:V2DI 87) 0)) "max.c":3:22 -1
> > >      (nil))
> > > (insn 12 18 13 2 (set (reg/i:DI 0 ax)
> > >         (reg:DI 86)) "max.c":4:1 66 {*movdi_internal}
> > >      (expr_list:REG_DEAD (reg:DI 86)
> > >         (nil)))
> > > (insn 13 12 0 2 (use (reg/i:DI 0 ax)) "max.c":4:1 -1
> > >      (nil))
> > >
> > > Uros.
> > >
> >
> > --
> > Richard Biener <rguent...@suse.de>
> > SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany;
> > GF: Felix Imendörffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG Nürnberg)
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany;
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG Nürnberg)

Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs

Reply via email to