Re: [PATCH] Detect a pack-unpack pattern in GCC vectorizer and optimize it.

Evgeny Stupachenko Fri, 21 Nov 2014 03:51:08 -0800

Hi,

Please note that currently the test:


int a[N];
short b[N*2];

for (int i = 0; i < N; ++i)
  a[i] = b[i*2];

Is compiled to (with -march=corei7 -O2 -ftree-vectorize):

        movdqa  b(%rax), %xmm0
        movdqa  b-16(%rax), %xmm2
        pand    %xmm1, %xmm0
        pand    %xmm1, %xmm2
        packusdw        %xmm2, %xmm0
        pmovsxwd        %xmm0, %xmm2
        psrldq  $8, %xmm0
        pmovsxwd        %xmm0, %xmm0
        movaps  %xmm2, a-32(%rax)
        movaps  %xmm0, a-16(%rax)

Which is more close to the requested sequence.

Thanks,
Evgeny


On Wed, Jun 25, 2014 at 8:34 PM, Cong Hou <co...@google.com> wrote:
> On Tue, Jun 24, 2014 at 4:05 AM, Richard Biener
> <richard.guent...@gmail.com> wrote:
>> On Sat, May 3, 2014 at 2:39 AM, Cong Hou <co...@google.com> wrote:
>>> On Mon, Apr 28, 2014 at 4:04 AM, Richard Biener <rguent...@suse.de> wrote:
>>>> On Thu, 24 Apr 2014, Cong Hou wrote:
>>>>
>>>>> Given the following loop:
>>>>>
>>>>> int a[N];
>>>>> short b[N*2];
>>>>>
>>>>> for (int i = 0; i < N; ++i)
>>>>>   a[i] = b[i*2];
>>>>>
>>>>>
>>>>> After being vectorized, the access to b[i*2] will be compiled into
>>>>> several packing statements, while the type promotion from short to int
>>>>> will be compiled into several unpacking statements. With this patch,
>>>>> each pair of pack/unpack statements will be replaced by less expensive
>>>>> statements (with shift or bit-and operations).
>>>>>
>>>>> On x86_64, the loop above will be compiled into the following assembly
>>>>> (with -O2 -ftree-vectorize):
>>>>>
>>>>> movdqu 0x10(%rcx),%xmm3
>>>>> movdqu -0x20(%rcx),%xmm0
>>>>> movdqa %xmm0,%xmm2
>>>>> punpcklwd %xmm3,%xmm0
>>>>> punpckhwd %xmm3,%xmm2
>>>>> movdqa %xmm0,%xmm3
>>>>> punpcklwd %xmm2,%xmm0
>>>>> punpckhwd %xmm2,%xmm3
>>>>> movdqa %xmm1,%xmm2
>>>>> punpcklwd %xmm3,%xmm0
>>>>> pcmpgtw %xmm0,%xmm2
>>>>> movdqa %xmm0,%xmm3
>>>>> punpckhwd %xmm2,%xmm0
>>>>> punpcklwd %xmm2,%xmm3
>>>>> movups %xmm0,-0x10(%rdx)
>>>>> movups %xmm3,-0x20(%rdx)
>>>>>
>>>>>
>>>>> With this patch, the generated assembly is shown below:
>>>>>
>>>>> movdqu 0x10(%rcx),%xmm0
>>>>> movdqu -0x20(%rcx),%xmm1
>>>>> pslld  $0x10,%xmm0
>>>>> psrad  $0x10,%xmm0
>>>>> pslld  $0x10,%xmm1
>>>>> movups %xmm0,-0x10(%rdx)
>>>>> psrad  $0x10,%xmm1
>>>>> movups %xmm1,-0x20(%rdx)
>>>>>
>>>>>
>>>>> Bootstrapped and tested on x86-64. OK for trunk?
>>>>
>>>> This is an odd place to implement such transform.  Also if it
>>>> is faster or not depends on the exact ISA you target - for
>>>> example ppc has constraints on the maximum number of shifts
>>>> carried out in parallel and the above has 4 in very short
>>>> succession.  Esp. for the sign-extend path.
>>>
>>> Thank you for the information about ppc. If this is an issue, I think
>>> we can do it in a target dependent way.
>>>
>>>
>>>>
>>>> So this looks more like an opportunity for a post-vectorizer
>>>> transform on RTL or for the vectorizer special-casing
>>>> widening loads with a vectorizer pattern.
>>>
>>> I am not sure if the RTL transform is more difficult to implement. I
>>> prefer the widening loads method, which can be detected in a pattern
>>> recognizer. The target related issue will be resolved by only
>>> expanding the widening load on those targets where this pattern is
>>> beneficial. But this requires new tree operations to be defined. What
>>> is your suggestion?
>>>
>>> I apologize for the delayed reply.
>>
>> Likewise ;)
>>
>> I suggest to implement this optimization in vector lowering in
>> tree-vect-generic.c.  This sees for your example
>>
>>   vect__5.7_32 = MEM[symbol: b, index: ivtmp.15_13, offset: 0B];
>>   vect__5.8_34 = MEM[symbol: b, index: ivtmp.15_13, offset: 16B];
>>   vect_perm_even_35 = VEC_PERM_EXPR <vect__5.7_32, vect__5.8_34, { 0,
>> 2, 4, 6, 8, 10, 12, 14 }>;
>>   vect__6.9_37 = [vec_unpack_lo_expr] vect_perm_even_35;
>>   vect__6.9_38 = [vec_unpack_hi_expr] vect_perm_even_35;
>>
>> where you can apply the pattern matching and transform (after checking
>> with the target, of course).
>
> This sounds good to me! I'll try to make a patch following your suggestion.
>
> Thank you!
>
>
> Cong
>
>>
>> Richard.
>>
>>>
>>> thanks,
>>> Cong
>>>
>>>>
>>>> Richard.

Re: [PATCH] Detect a pack-unpack pattern in GCC vectorizer and optimize it.

Reply via email to