Hi, Please note that currently the test:
int a[N]; short b[N*2]; for (int i = 0; i < N; ++i) a[i] = b[i*2]; Is compiled to (with -march=corei7 -O2 -ftree-vectorize): movdqa b(%rax), %xmm0 movdqa b-16(%rax), %xmm2 pand %xmm1, %xmm0 pand %xmm1, %xmm2 packusdw %xmm2, %xmm0 pmovsxwd %xmm0, %xmm2 psrldq $8, %xmm0 pmovsxwd %xmm0, %xmm0 movaps %xmm2, a-32(%rax) movaps %xmm0, a-16(%rax) Which is more close to the requested sequence. Thanks, Evgeny On Wed, Jun 25, 2014 at 8:34 PM, Cong Hou <co...@google.com> wrote: > On Tue, Jun 24, 2014 at 4:05 AM, Richard Biener > <richard.guent...@gmail.com> wrote: >> On Sat, May 3, 2014 at 2:39 AM, Cong Hou <co...@google.com> wrote: >>> On Mon, Apr 28, 2014 at 4:04 AM, Richard Biener <rguent...@suse.de> wrote: >>>> On Thu, 24 Apr 2014, Cong Hou wrote: >>>> >>>>> Given the following loop: >>>>> >>>>> int a[N]; >>>>> short b[N*2]; >>>>> >>>>> for (int i = 0; i < N; ++i) >>>>> a[i] = b[i*2]; >>>>> >>>>> >>>>> After being vectorized, the access to b[i*2] will be compiled into >>>>> several packing statements, while the type promotion from short to int >>>>> will be compiled into several unpacking statements. With this patch, >>>>> each pair of pack/unpack statements will be replaced by less expensive >>>>> statements (with shift or bit-and operations). >>>>> >>>>> On x86_64, the loop above will be compiled into the following assembly >>>>> (with -O2 -ftree-vectorize): >>>>> >>>>> movdqu 0x10(%rcx),%xmm3 >>>>> movdqu -0x20(%rcx),%xmm0 >>>>> movdqa %xmm0,%xmm2 >>>>> punpcklwd %xmm3,%xmm0 >>>>> punpckhwd %xmm3,%xmm2 >>>>> movdqa %xmm0,%xmm3 >>>>> punpcklwd %xmm2,%xmm0 >>>>> punpckhwd %xmm2,%xmm3 >>>>> movdqa %xmm1,%xmm2 >>>>> punpcklwd %xmm3,%xmm0 >>>>> pcmpgtw %xmm0,%xmm2 >>>>> movdqa %xmm0,%xmm3 >>>>> punpckhwd %xmm2,%xmm0 >>>>> punpcklwd %xmm2,%xmm3 >>>>> movups %xmm0,-0x10(%rdx) >>>>> movups %xmm3,-0x20(%rdx) >>>>> >>>>> >>>>> With this patch, the generated assembly is shown below: >>>>> >>>>> movdqu 0x10(%rcx),%xmm0 >>>>> movdqu -0x20(%rcx),%xmm1 >>>>> pslld $0x10,%xmm0 >>>>> psrad $0x10,%xmm0 >>>>> pslld $0x10,%xmm1 >>>>> movups %xmm0,-0x10(%rdx) >>>>> psrad $0x10,%xmm1 >>>>> movups %xmm1,-0x20(%rdx) >>>>> >>>>> >>>>> Bootstrapped and tested on x86-64. OK for trunk? >>>> >>>> This is an odd place to implement such transform. Also if it >>>> is faster or not depends on the exact ISA you target - for >>>> example ppc has constraints on the maximum number of shifts >>>> carried out in parallel and the above has 4 in very short >>>> succession. Esp. for the sign-extend path. >>> >>> Thank you for the information about ppc. If this is an issue, I think >>> we can do it in a target dependent way. >>> >>> >>>> >>>> So this looks more like an opportunity for a post-vectorizer >>>> transform on RTL or for the vectorizer special-casing >>>> widening loads with a vectorizer pattern. >>> >>> I am not sure if the RTL transform is more difficult to implement. I >>> prefer the widening loads method, which can be detected in a pattern >>> recognizer. The target related issue will be resolved by only >>> expanding the widening load on those targets where this pattern is >>> beneficial. But this requires new tree operations to be defined. What >>> is your suggestion? >>> >>> I apologize for the delayed reply. >> >> Likewise ;) >> >> I suggest to implement this optimization in vector lowering in >> tree-vect-generic.c. This sees for your example >> >> vect__5.7_32 = MEM[symbol: b, index: ivtmp.15_13, offset: 0B]; >> vect__5.8_34 = MEM[symbol: b, index: ivtmp.15_13, offset: 16B]; >> vect_perm_even_35 = VEC_PERM_EXPR <vect__5.7_32, vect__5.8_34, { 0, >> 2, 4, 6, 8, 10, 12, 14 }>; >> vect__6.9_37 = [vec_unpack_lo_expr] vect_perm_even_35; >> vect__6.9_38 = [vec_unpack_hi_expr] vect_perm_even_35; >> >> where you can apply the pattern matching and transform (after checking >> with the target, of course). > > This sounds good to me! I'll try to make a patch following your suggestion. > > Thank you! > > > Cong > >> >> Richard. >> >>> >>> thanks, >>> Cong >>> >>>> >>>> Richard.