On Fri, Aug 9, 2019 at 3:00 PM Richard Biener <rguent...@suse.de> wrote:
> > > > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI > > > > > > > > > > "TARGET_AVX512F"]) > > > > > > > > > > > > > > > > > > > > and then we need to split DImode for 32bits, too. > > > > > > > > > > > > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for > > > > > > > > > DImode > > > > > > > > > condition, I'll provide _doubleword splitter later. > > > > > > > > > > > > > > > > Shouldn't that be TARGET_AVX512VL instead? Or does the insn > > > > > > > > use %g0 etc. > > > > > > > > to force use of %zmmN? > > > > > > > > > > > > > > It generates V4SI mode, so - yes, AVX512VL. > > > > > > > > > > > > case SMAX: > > > > > > case SMIN: > > > > > > case UMAX: > > > > > > case UMIN: > > > > > > if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL)) > > > > > > || (mode == SImode && !TARGET_SSE4_1)) > > > > > > return false; > > > > > > > > > > > > so there's no way to use AVX512VL for 32bit? > > > > > > > > > > There is a way, but on 32bit targets, we need to split DImode > > > > > operation to a sequence of SImode operations for unconverted pattern. > > > > > This is of course doable, but somehow more complex than simply > > > > > emitting a DImode compare + DImode cmove, which is what current > > > > > splitter does. So, a follow-up task. > > > > > > > > Please find attached the complete .md part that enables SImode for > > > > TARGET_SSE4_1 and DImode for TARGET_AVX512VL for both, 32bit and 64bit > > > > targets. The patterns also allows for memory operand 2, so STV has > > > > chance to create the vector pattern with implicit load. In case STV > > > > fails, the memory operand 2 is loaded to the register first; operand > > > > 2 is used in compare and cmove instruction, so pre-loading of the > > > > operand should be beneficial. > > > > > > Thanks. > > > > > > > Also note, that splitting should happen rarely. Due to the cost > > > > function, STV should effectively always convert minmax to a vector > > > > insn. > > > > > > I've analyzed the 464.h264ref slowdown on Haswell and it is due to > > > this kind of "simple" conversion: > > > > > > 5.50 │1d0: test %esi,%es > > > 0.07 │ mov $0x0,%ex > > > │ cmovs %eax,%es > > > 5.84 │ imul %r8d,%es > > > > > > to > > > > > > 0.65 │1e0: vpxor %xmm0,%xmm0,%xmm0 > > > 0.32 │ vpmaxs -0x10(%rsp),%xmm0,%xmm0 > > > 40.45 │ vmovd %xmm0,%eax > > > 2.45 │ imul %r8d,%eax > > > > > > which looks like a RA artifact in the end. We spill %esi only > > > with -mstv here as STV introduces a (subreg:V4SI ...) use > > > of a pseudo ultimatively set from di. STV creates an additional > > > pseudo for this (copy-in) but it places that copy next to the > > > original def rather than next to the start of the chain it > > > converts which is probably the issue why we spill. And this > > > is because it inserts those at each definition of the pseudo > > > rather than just at the reaching definition(s) or at the > > > uses of the pseudo in the chain (that because there may be > > > defs of that pseudo in the chain itself). Note that STV emits > > > such "conversion" copies as simple reg-reg moves: > > > > > > (insn 1094 3 4 2 (set (reg:SI 777) > > > (reg/v:SI 438 [ y ])) "refbuf.i":4:1 -1 > > > (nil)) > > > > > > but those do not prevail very long (this one gets removed by CSE2). > > > So IRA just sees the (subreg:V4SI (reg/v:SI 438 [ y ]) 0) use > > > and computes > > > > > > r438: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS > > > a297(r438,l0) costs: SSE_REGS:5628,5628 MEM:3618,3618 > > > > > > so I wonder if STV shouldn't instead emit gpr->xmm moves > > > here (but I guess nothing again prevents RTL optimizers from > > > combining that with the single-use in the max instruction...). > > > > > > So this boils down to STV splitting live-ranges but other > > > passes undoing that and then RA not considering splitting > > > live-ranges here, arriving at unoptimal allocation. > > > > > > A testcase showing this issue is (simplified from 464.h264ref > > > UMVLine16Y_11): > > > > > > unsigned short > > > UMVLine16Y_11 (short unsigned int * Pic, int y, int width) > > > { > > > if (y != width) > > > { > > > y = y < 0 ? 0 : y; > > > return Pic[y * width]; > > > } > > > return Pic[y]; > > > } > > > > > > where the condition and the Pic[y] load mimics the other use of y. > > > Different, even worse spilling is generated by > > > > > > unsigned short > > > UMVLine16Y_11 (short unsigned int * Pic, int y, int width) > > > { > > > y = y < 0 ? 0 : y; > > > return Pic[y * width] + y; > > > } > > > > > > I guess this all shows that STVs "trick" of simply wrapping > > > integer mode pseudos in (subreg:vector-mode ...) is bad? > > > > > > I've added a (failing) testcase to reflect the above. > > > > Experimenting a bit with just for the conversion insns using > > V4SImode pseudos we end up preserving those moves (but I > > do have to use a lowpart set, using reg:V4SI = subreg:V4SI Simode-reg > > ends up using movv4si_internal which only leaves us with > > memory for the SImode operand) _plus_ moving the move next > > to the actual use has an effect. Not necssarily a good one > > though: > > > > vpxor %xmm0, %xmm0, %xmm0 > > vmovaps %xmm0, -16(%rsp) > > movl %esi, -16(%rsp) > > vpmaxsd -16(%rsp), %xmm0, %xmm0 > > vmovd %xmm0, %eax > > > > eh? I guess the lowpart set is not good (my patch has this > > as well, but I got saved by never having vector modes to subset...). > > Using > > > > (vec_merge:V4SI (vec_duplicate:V4SI (reg/v:SI 83 [ i ])) > > (const_vector:V4SI [ > > (const_int 0 [0]) repeated x4 > > ]) > > (const_int 1 [0x1]))) "t3.c":5:10 -1 > > > > for the move ends up with > > > > vpxor %xmm1, %xmm1, %xmm1 > > vpinsrd $0, %esi, %xmm1, %xmm0 > > > > eh? LRA chooses the correct alternative here but somehow > > postreload CSE CSEs the zero with the xmm1 clearing, leading > > to the vpinsrd... (I guess a general issue, not sure if really > > worse - definitely a larger instruction). Unfortunately > > postreload-cse doesn't add a reg-equal note. This happens only > > when emitting the reg move before the use, not doing that emits > > a vmovd as expected. > > > > At least the spilling is gone here. > > > > I am re-testing as follows, the main change is that > > general_scalar_chain::make_vector_copies now generates a > > vector pseudo as destination (and I've fixed up the code > > to not generate (subreg:V4SI (reg:V4SI 1234) 0)). > > > > Hope this fixes the observed slowdowns (it fixes the new testcase). > > It fixes the slowdown observed in 416.gamess and 464.h264ref. > > Bootstrapped on x86_64-unknown-linux-gnu, testing still in progress. > > CCing Jeff who "knows RTL". > > OK? Please add -mno-stv to gcc.target/i386/minmax-{1,2}.c to avoid spurious test failures on SSE4.1 targets. Uros. > Thanks, > Richard. > > > Richard. > > > > mccas.F:twotff_ for 416.gamess > > refbuf.c:UMVLine16Y_11 for 464.h264ref > > > > 2019-08-07 Richard Biener <rguent...@suse.de> > > > > PR target/91154 > > * config/i386/i386-features.h (scalar_chain::scalar_chain): Add > > mode arguments. > > (scalar_chain::smode): New member. > > (scalar_chain::vmode): Likewise. > > (dimode_scalar_chain): Rename to... > > (general_scalar_chain): ... this. > > (general_scalar_chain::general_scalar_chain): Take mode arguments. > > (timode_scalar_chain::timode_scalar_chain): Initialize scalar_chain > > base with TImode and V1TImode. > > * config/i386/i386-features.c (scalar_chain::scalar_chain): Adjust. > > (general_scalar_chain::vector_const_cost): Adjust for SImode > > chains. > > (general_scalar_chain::compute_convert_gain): Likewise. Fix > > reg-reg move cost gain, use ix86_cost->sse_op cost and adjust > > scalar costs. Add {S,U}{MIN,MAX} support. Dump per-instruction > > gain if not zero. > > (general_scalar_chain::replace_with_subreg): Use vmode/smode. > > Elide the subreg if the reg is already vector. > > (general_scalar_chain::make_vector_copies): Likewise. Handle > > non-DImode chains appropriately. Use a vector-mode pseudo as > > destination. > > (general_scalar_chain::convert_reg): Likewise. > > (general_scalar_chain::convert_op): Likewise. Elide the > > subreg if the reg is already vector. > > (general_scalar_chain::convert_insn): Likewise. Add > > fatal_insn_not_found if the result is not recognized. > > (convertible_comparison_p): Pass in the scalar mode and use that. > > (general_scalar_to_vector_candidate_p): Likewise. Rename from > > dimode_scalar_to_vector_candidate_p. Add {S,U}{MIN,MAX} support. > > (scalar_to_vector_candidate_p): Remove by inlining into single > > caller. > > (general_remove_non_convertible_regs): Rename from > > dimode_remove_non_convertible_regs. > > (remove_non_convertible_regs): Remove by inlining into single caller. > > (convert_scalars_to_vector): Handle SImode and DImode chains > > in addition to TImode chains. > > * config/i386/i386.md (<maxmin><SWI48>3): New insn split after STV. > > > > * gcc.target/i386/pr91154.c: New testcase. > > * gcc.target/i386/minmax-3.c: Likewise. > > * gcc.target/i386/minmax-4.c: Likewise. > > * gcc.target/i386/minmax-5.c: Likewise. > > * gcc.target/i386/minmax-6.c: Likewise.