On 8/9/19 7:00 AM, Richard Biener wrote: > On Fri, 9 Aug 2019, Richard Biener wrote: > >> On Fri, 9 Aug 2019, Richard Biener wrote: >> >>> On Fri, 9 Aug 2019, Uros Bizjak wrote: >>> >>>> On Mon, Aug 5, 2019 at 3:09 PM Uros Bizjak <ubiz...@gmail.com> wrote: >>>> >>>>>>>>>> (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI >>>>>>>>>> "TARGET_AVX512F"]) >>>>>>>>>> >>>>>>>>>> and then we need to split DImode for 32bits, too. >>>>>>>>> >>>>>>>>> For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode >>>>>>>>> condition, I'll provide _doubleword splitter later. >>>>>>>> >>>>>>>> Shouldn't that be TARGET_AVX512VL instead? Or does the insn use %g0 >>>>>>>> etc. >>>>>>>> to force use of %zmmN? >>>>>>> >>>>>>> It generates V4SI mode, so - yes, AVX512VL. >>>>>> >>>>>> case SMAX: >>>>>> case SMIN: >>>>>> case UMAX: >>>>>> case UMIN: >>>>>> if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL)) >>>>>> || (mode == SImode && !TARGET_SSE4_1)) >>>>>> return false; >>>>>> >>>>>> so there's no way to use AVX512VL for 32bit? >>>>> >>>>> There is a way, but on 32bit targets, we need to split DImode >>>>> operation to a sequence of SImode operations for unconverted pattern. >>>>> This is of course doable, but somehow more complex than simply >>>>> emitting a DImode compare + DImode cmove, which is what current >>>>> splitter does. So, a follow-up task. >>>> >>>> Please find attached the complete .md part that enables SImode for >>>> TARGET_SSE4_1 and DImode for TARGET_AVX512VL for both, 32bit and 64bit >>>> targets. The patterns also allows for memory operand 2, so STV has >>>> chance to create the vector pattern with implicit load. In case STV >>>> fails, the memory operand 2 is loaded to the register first; operand >>>> 2 is used in compare and cmove instruction, so pre-loading of the >>>> operand should be beneficial. >>> >>> Thanks. >>> >>>> Also note, that splitting should happen rarely. Due to the cost >>>> function, STV should effectively always convert minmax to a vector >>>> insn. >>> >>> I've analyzed the 464.h264ref slowdown on Haswell and it is due to >>> this kind of "simple" conversion: >>> >>> 5.50 │1d0: test %esi,%es >>> 0.07 │ mov $0x0,%ex >>> │ cmovs %eax,%es >>> 5.84 │ imul %r8d,%es >>> >>> to >>> >>> 0.65 │1e0: vpxor %xmm0,%xmm0,%xmm0 >>> 0.32 │ vpmaxs -0x10(%rsp),%xmm0,%xmm0 >>> 40.45 │ vmovd %xmm0,%eax >>> 2.45 │ imul %r8d,%eax >>> >>> which looks like a RA artifact in the end. We spill %esi only >>> with -mstv here as STV introduces a (subreg:V4SI ...) use >>> of a pseudo ultimatively set from di. STV creates an additional >>> pseudo for this (copy-in) but it places that copy next to the >>> original def rather than next to the start of the chain it >>> converts which is probably the issue why we spill. And this >>> is because it inserts those at each definition of the pseudo >>> rather than just at the reaching definition(s) or at the >>> uses of the pseudo in the chain (that because there may be >>> defs of that pseudo in the chain itself). Note that STV emits >>> such "conversion" copies as simple reg-reg moves: >>> >>> (insn 1094 3 4 2 (set (reg:SI 777) >>> (reg/v:SI 438 [ y ])) "refbuf.i":4:1 -1 >>> (nil)) >>> >>> but those do not prevail very long (this one gets removed by CSE2). >>> So IRA just sees the (subreg:V4SI (reg/v:SI 438 [ y ]) 0) use >>> and computes >>> >>> r438: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS >>> a297(r438,l0) costs: SSE_REGS:5628,5628 MEM:3618,3618 >>> >>> so I wonder if STV shouldn't instead emit gpr->xmm moves >>> here (but I guess nothing again prevents RTL optimizers from >>> combining that with the single-use in the max instruction...). >>> >>> So this boils down to STV splitting live-ranges but other >>> passes undoing that and then RA not considering splitting >>> live-ranges here, arriving at unoptimal allocation. >>> >>> A testcase showing this issue is (simplified from 464.h264ref >>> UMVLine16Y_11): >>> >>> unsigned short >>> UMVLine16Y_11 (short unsigned int * Pic, int y, int width) >>> { >>> if (y != width) >>> { >>> y = y < 0 ? 0 : y; >>> return Pic[y * width]; >>> } >>> return Pic[y]; >>> } >>> >>> where the condition and the Pic[y] load mimics the other use of y. >>> Different, even worse spilling is generated by >>> >>> unsigned short >>> UMVLine16Y_11 (short unsigned int * Pic, int y, int width) >>> { >>> y = y < 0 ? 0 : y; >>> return Pic[y * width] + y; >>> } >>> >>> I guess this all shows that STVs "trick" of simply wrapping >>> integer mode pseudos in (subreg:vector-mode ...) is bad? >>> >>> I've added a (failing) testcase to reflect the above. >> >> Experimenting a bit with just for the conversion insns using >> V4SImode pseudos we end up preserving those moves (but I >> do have to use a lowpart set, using reg:V4SI = subreg:V4SI Simode-reg >> ends up using movv4si_internal which only leaves us with >> memory for the SImode operand) _plus_ moving the move next >> to the actual use has an effect. Not necssarily a good one >> though: >> >> vpxor %xmm0, %xmm0, %xmm0 >> vmovaps %xmm0, -16(%rsp) >> movl %esi, -16(%rsp) >> vpmaxsd -16(%rsp), %xmm0, %xmm0 >> vmovd %xmm0, %eax >> >> eh? I guess the lowpart set is not good (my patch has this >> as well, but I got saved by never having vector modes to subset...). >> Using >> >> (vec_merge:V4SI (vec_duplicate:V4SI (reg/v:SI 83 [ i ])) >> (const_vector:V4SI [ >> (const_int 0 [0]) repeated x4 >> ]) >> (const_int 1 [0x1]))) "t3.c":5:10 -1 >> >> for the move ends up with >> >> vpxor %xmm1, %xmm1, %xmm1 >> vpinsrd $0, %esi, %xmm1, %xmm0 >> >> eh? LRA chooses the correct alternative here but somehow >> postreload CSE CSEs the zero with the xmm1 clearing, leading >> to the vpinsrd... (I guess a general issue, not sure if really >> worse - definitely a larger instruction). Unfortunately >> postreload-cse doesn't add a reg-equal note. This happens only >> when emitting the reg move before the use, not doing that emits >> a vmovd as expected. >> >> At least the spilling is gone here. >> >> I am re-testing as follows, the main change is that >> general_scalar_chain::make_vector_copies now generates a >> vector pseudo as destination (and I've fixed up the code >> to not generate (subreg:V4SI (reg:V4SI 1234) 0)). >> >> Hope this fixes the observed slowdowns (it fixes the new testcase). > > It fixes the slowdown observed in 416.gamess and 464.h264ref. > > Bootstrapped on x86_64-unknown-linux-gnu, testing still in progress. > > CCing Jeff who "knows RTL". What specifically do you want me to look at? I'm not really familiar with the STV stuff, but can certainly take a peek.
Jeff