On 8/9/19 7:00 AM, Richard Biener wrote:
> On Fri, 9 Aug 2019, Richard Biener wrote:
> 
>> On Fri, 9 Aug 2019, Richard Biener wrote:
>>
>>> On Fri, 9 Aug 2019, Uros Bizjak wrote:
>>>
>>>> On Mon, Aug 5, 2019 at 3:09 PM Uros Bizjak <ubiz...@gmail.com> wrote:
>>>>
>>>>>>>>>> (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI 
>>>>>>>>>> "TARGET_AVX512F"])
>>>>>>>>>>
>>>>>>>>>> and then we need to split DImode for 32bits, too.
>>>>>>>>>
>>>>>>>>> For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode
>>>>>>>>> condition, I'll provide _doubleword splitter later.
>>>>>>>>
>>>>>>>> Shouldn't that be TARGET_AVX512VL instead?  Or does the insn use %g0 
>>>>>>>> etc.
>>>>>>>> to force use of %zmmN?
>>>>>>>
>>>>>>> It generates V4SI mode, so - yes, AVX512VL.
>>>>>>
>>>>>>     case SMAX:
>>>>>>     case SMIN:
>>>>>>     case UMAX:
>>>>>>     case UMIN:
>>>>>>       if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL))
>>>>>>           || (mode == SImode && !TARGET_SSE4_1))
>>>>>>         return false;
>>>>>>
>>>>>> so there's no way to use AVX512VL for 32bit?
>>>>>
>>>>> There is a way, but on 32bit targets, we need to split DImode
>>>>> operation to a sequence of SImode operations for unconverted pattern.
>>>>> This is of course doable, but somehow more complex than simply
>>>>> emitting a DImode compare + DImode cmove, which is what current
>>>>> splitter does. So, a follow-up task.
>>>>
>>>> Please find attached the complete .md part that enables SImode for
>>>> TARGET_SSE4_1 and DImode for TARGET_AVX512VL for both, 32bit and 64bit
>>>> targets. The patterns also allows for memory operand 2, so STV has
>>>> chance to create the vector pattern with implicit load. In case STV
>>>> fails, the memory operand 2 is loaded to the register first;  operand
>>>> 2 is used in compare and cmove instruction, so pre-loading of the
>>>> operand should be beneficial.
>>>
>>> Thanks.
>>>
>>>> Also note, that splitting should happen rarely. Due to the cost
>>>> function, STV should effectively always convert minmax to a vector
>>>> insn.
>>>
>>> I've analyzed the 464.h264ref slowdown on Haswell and it is due to
>>> this kind of "simple" conversion:
>>>
>>>   5.50 │1d0:   test   %esi,%es
>>>   0.07 │       mov    $0x0,%ex
>>>        │       cmovs  %eax,%es
>>>   5.84 │       imul   %r8d,%es
>>>
>>> to
>>>
>>>   0.65 │1e0:   vpxor  %xmm0,%xmm0,%xmm0
>>>   0.32 │       vpmaxs -0x10(%rsp),%xmm0,%xmm0
>>>  40.45 │       vmovd  %xmm0,%eax
>>>   2.45 │       imul   %r8d,%eax
>>>
>>> which looks like a RA artifact in the end.  We spill %esi only
>>> with -mstv here as STV introduces a (subreg:V4SI ...) use
>>> of a pseudo ultimatively set from di.  STV creates an additional
>>> pseudo for this (copy-in) but it places that copy next to the
>>> original def rather than next to the start of the chain it
>>> converts which is probably the issue why we spill.  And this
>>> is because it inserts those at each definition of the pseudo
>>> rather than just at the reaching definition(s) or at the
>>> uses of the pseudo in the chain (that because there may be
>>> defs of that pseudo in the chain itself).  Note that STV emits
>>> such "conversion" copies as simple reg-reg moves:
>>>
>>> (insn 1094 3 4 2 (set (reg:SI 777)
>>>         (reg/v:SI 438 [ y ])) "refbuf.i":4:1 -1
>>>      (nil))
>>>
>>> but those do not prevail very long (this one gets removed by CSE2).
>>> So IRA just sees the (subreg:V4SI (reg/v:SI 438 [ y ]) 0) use
>>> and computes
>>>
>>>     r438: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS
>>>     a297(r438,l0) costs: SSE_REGS:5628,5628 MEM:3618,3618
>>>
>>> so I wonder if STV shouldn't instead emit gpr->xmm moves
>>> here (but I guess nothing again prevents RTL optimizers from
>>> combining that with the single-use in the max instruction...).
>>>
>>> So this boils down to STV splitting live-ranges but other
>>> passes undoing that and then RA not considering splitting
>>> live-ranges here, arriving at unoptimal allocation.
>>>
>>> A testcase showing this issue is (simplified from 464.h264ref
>>> UMVLine16Y_11):
>>>
>>> unsigned short
>>> UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
>>> {
>>>   if (y != width)
>>>     {
>>>       y = y < 0 ? 0 : y;
>>>       return Pic[y * width];
>>>     }
>>>   return Pic[y];
>>> }
>>>
>>> where the condition and the Pic[y] load mimics the other use of y.
>>> Different, even worse spilling is generated by
>>>
>>> unsigned short
>>> UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
>>> {
>>>   y = y < 0 ? 0 : y;
>>>   return Pic[y * width] + y;
>>> }
>>>
>>> I guess this all shows that STVs "trick" of simply wrapping
>>> integer mode pseudos in (subreg:vector-mode ...) is bad?
>>>
>>> I've added a (failing) testcase to reflect the above.
>>
>> Experimenting a bit with just for the conversion insns using
>> V4SImode pseudos we end up preserving those moves (but I
>> do have to use a lowpart set, using reg:V4SI = subreg:V4SI Simode-reg
>> ends up using movv4si_internal which only leaves us with
>> memory for the SImode operand) _plus_ moving the move next
>> to the actual use has an effect.  Not necssarily a good one
>> though:
>>
>>         vpxor   %xmm0, %xmm0, %xmm0
>>         vmovaps %xmm0, -16(%rsp)
>>         movl    %esi, -16(%rsp)
>>         vpmaxsd -16(%rsp), %xmm0, %xmm0
>>         vmovd   %xmm0, %eax
>>
>> eh?  I guess the lowpart set is not good (my patch has this
>> as well, but I got saved by never having vector modes to subset...).
>> Using
>>
>>     (vec_merge:V4SI (vec_duplicate:V4SI (reg/v:SI 83 [ i ]))
>>             (const_vector:V4SI [
>>                     (const_int 0 [0]) repeated x4
>>                 ])
>>             (const_int 1 [0x1]))) "t3.c":5:10 -1
>>
>> for the move ends up with
>>
>>         vpxor   %xmm1, %xmm1, %xmm1
>>         vpinsrd $0, %esi, %xmm1, %xmm0
>>
>> eh?  LRA chooses the correct alternative here but somehow
>> postreload CSE CSEs the zero with the xmm1 clearing, leading
>> to the vpinsrd...  (I guess a general issue, not sure if really
>> worse - definitely a larger instruction).  Unfortunately
>> postreload-cse doesn't add a reg-equal note.  This happens only
>> when emitting the reg move before the use, not doing that emits
>> a vmovd as expected.
>>
>> At least the spilling is gone here.
>>
>> I am re-testing as follows, the main change is that
>> general_scalar_chain::make_vector_copies now generates a
>> vector pseudo as destination (and I've fixed up the code
>> to not generate (subreg:V4SI (reg:V4SI 1234) 0)).
>>
>> Hope this fixes the observed slowdowns (it fixes the new testcase).
> 
> It fixes the slowdown observed in 416.gamess and 464.h264ref.
> 
> Bootstrapped on x86_64-unknown-linux-gnu, testing still in progress.
> 
> CCing Jeff who "knows RTL".
What specifically do you want me to look at?  I'm not really familiar
with the STV stuff, but can certainly take a peek.


Jeff

Reply via email to