On Thu, 24 Apr 2014, Cong Hou wrote: > Given the following loop: > > int a[N]; > short b[N*2]; > > for (int i = 0; i < N; ++i) > a[i] = b[i*2]; > > > After being vectorized, the access to b[i*2] will be compiled into > several packing statements, while the type promotion from short to int > will be compiled into several unpacking statements. With this patch, > each pair of pack/unpack statements will be replaced by less expensive > statements (with shift or bit-and operations). > > On x86_64, the loop above will be compiled into the following assembly > (with -O2 -ftree-vectorize): > > movdqu 0x10(%rcx),%xmm3 > movdqu -0x20(%rcx),%xmm0 > movdqa %xmm0,%xmm2 > punpcklwd %xmm3,%xmm0 > punpckhwd %xmm3,%xmm2 > movdqa %xmm0,%xmm3 > punpcklwd %xmm2,%xmm0 > punpckhwd %xmm2,%xmm3 > movdqa %xmm1,%xmm2 > punpcklwd %xmm3,%xmm0 > pcmpgtw %xmm0,%xmm2 > movdqa %xmm0,%xmm3 > punpckhwd %xmm2,%xmm0 > punpcklwd %xmm2,%xmm3 > movups %xmm0,-0x10(%rdx) > movups %xmm3,-0x20(%rdx) > > > With this patch, the generated assembly is shown below: > > movdqu 0x10(%rcx),%xmm0 > movdqu -0x20(%rcx),%xmm1 > pslld $0x10,%xmm0 > psrad $0x10,%xmm0 > pslld $0x10,%xmm1 > movups %xmm0,-0x10(%rdx) > psrad $0x10,%xmm1 > movups %xmm1,-0x20(%rdx) > > > Bootstrapped and tested on x86-64. OK for trunk?
This is an odd place to implement such transform. Also if it is faster or not depends on the exact ISA you target - for example ppc has constraints on the maximum number of shifts carried out in parallel and the above has 4 in very short succession. Esp. for the sign-extend path. So this looks more like an opportunity for a post-vectorizer transform on RTL or for the vectorizer special-casing widening loads with a vectorizer pattern. Richard.