On Thu, Jan 03, 2019 at 12:16:31PM +0100, Richard Biener wrote: > I guess it depends on target capabilities - I think > __builtin_convertvector is a bit "misdesigned" for pack/unpack. You > also have to consider a v2di to v2qi conversion which requires
I'm aware of that, I know supportable_{widening,narrowing}_conversion in the vectorizer handles those, but they are vectorizer specific and written for the model vectorizer uses. In any case, I wanted to have something correct first (i.e. the scalar ops fallback in there) and then improve what I can, and start with the 2x narrowing and widening first and only when that works go further. > several unpack steps. Does the clang documentation given any > hints how to "efficiently" use __builtin_convertvector for > packing/unpacking without exposing too much of the target architecture? The clang documentation is completely useless here, trying e.g. typedef signed char v16qi __attribute__((vector_size (16 * sizeof (signed char)))); typedef int v16si __attribute__((vector_size (16 * sizeof (int)))); void foo (v16si *x, v16qi *y) { *y = __builtin_convertvector (*x, v16qi); } void bar (v16qi *x, v16si *y) { *y = __builtin_convertvector (*x, v16si); } with clang -O2 -mavx512{bw,vl,dq} shows efficient: vmovdqa64 (%rdi), %zmm0 vpmovdb %zmm0, (%rsi) and vpmovsxbd (%rdi), %zmm0 vmovdqa64 %zmm0, (%rsi) With -O2 -mavx2 bar is: vpmovsxbd (%rdi), %ymm0 vpmovsxbd 8(%rdi), %ymm1 vmovdqa %ymm1, 32(%rsi) vmovdqa %ymm0, (%rsi) which is what would be emitted for v8[qs]i twice, and foo is: vmovdqa (%rdi), %ymm0 vmovdqa 32(%rdi), %ymm1 vmovdqa .LCPI0_0(%rip), %ymm2 # ymm2 = [0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15,16,17,20,21,24,25,28,29,24,25,28,29,28,29,30,31] vpshufb %ymm2, %ymm1, %ymm1 vpermq $232, %ymm1, %ymm1 # ymm1 = ymm1[0,2,2,3] vmovdqa .LCPI0_1(%rip), %xmm3 # xmm3 = <0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u> vpshufb %xmm3, %xmm1, %xmm1 vpshufb %ymm2, %ymm0, %ymm0 vpermq $232, %ymm0, %ymm0 # ymm0 = ymm0[0,2,2,3] vpshufb %xmm3, %xmm0, %xmm0 vpunpcklqdq %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0],xmm1[0] vmovdqa %xmm0, (%rsi) which looks quite complicated to me. I would think we could emit e.g. what we emit for: typedef signed char v16qi __attribute__((vector_size (16 * sizeof (signed char)))); typedef signed char v32qi __attribute__((vector_size (32 * sizeof (signed char)))); void baz (v32qi *x, v16qi *y) { v32qi z = __builtin_shuffle (x[0], x[1], (v32qi) { 0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60 }); v16qi u; __builtin_memcpy (&u, &z, sizeof (u)); *y = u; } which is with gcc trunk: vmovdqa (%rdi), %ymm0 vmovdqa 32(%rdi), %ymm1 vpshufb .LC0(%rip), %ymm0, %ymm0 vpshufb .LC1(%rip), %ymm1, %ymm1 vpermq $78, %ymm0, %ymm3 vpermq $78, %ymm1, %ymm2 vpor %ymm3, %ymm0, %ymm0 vpor %ymm2, %ymm1, %ymm1 vpor %ymm1, %ymm0, %ymm0 vmovaps %xmm0, (%rsi) although really the upper half is a don't care (so a properly implemented __builtin_shufflevector might be handy too, with __builtin_shufflevector (x[0], x[1], 0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1); ). > > Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk? > > do_vec_conversion needs a comment. Overall the patch (with its Will add one (though do_unop/do_binop don't have one either) and add documentation. > existing features) looks OK to me. > > As of Marcs comments I agree that vector lowering happens quite late. > It might be for example useful to lower before vectorization (or > any loop optimization) so that un-handled generic vector code can be > eventually vectorized differently. But that's sth to investigate for > GCC 10. > > Giving FE maintainers a chance to comment, so no overall ACK yet. Ok. Jakub