On Thu, Jul 20, 2017 at 9:47 AM, Jakub Jelinek <ja...@redhat.com> wrote: > Hi! > > Richard has asked me recently to look at V[24]TI vector extraction > and initialization, which he wants to use from the vectorizer. > > The following is an attempt to implement that. > > On the testcases included in the patch we get usually better or > significantly better code generated, the exception is f1, > where the change is: > - movq %rdi, -32(%rsp) > - movq %rsi, -24(%rsp) > - movq %rdx, -16(%rsp) > - movq %rcx, -8(%rsp) > - vmovdqa -32(%rsp), %ymm0 > + movq %rdi, -16(%rsp) > + movq %rsi, -8(%rsp) > + movq %rdx, -32(%rsp) > + movq %rcx, -24(%rsp) > + vmovdqa -32(%rsp), %xmm0 > + vmovdqa -16(%rsp), %xmm1 > + vinserti128 $0x1, %xmm0, %ymm1, %ymm0 > which is something that is hard to handle before RA. If the RA > would spill it the other way around, perhaps it would be solveable by > transforming > vmovdqa -32(%rsp), %xmm1 > vmovdqa -16(%rsp), %xmm0 > vinserti128 $0x01, %xmm0, %ymm1, %ymm0 > into > vmovdqa -32(%rsp), %ymm0 > using peephole2, but no idea how to force it that way. And f11 also > has similar problem, that time with 3 extra insns. But if the TImode > variable is allocated in a %?mm* register, we get better code even in those > cases.
Please fill a PR about this issze. IIRC, I have seen this spill problem some time ago. > For V4TImode perhaps we could improve some special cases of vec_initv4ti, > like broadcast or only one variable otherwise everything constant, but at > least for the broadcast I'm not really sure what is the optimal sequence. > vbroadcasti32x4 is only able to broadcast from memory, which is good if the > TImode input lives in memory, but if it doesn't? __builtin_shuffle right > now generates vpermq with the indices loaded from memory, but that needs to > wait for memory load... > > Another thing is that we actually don't permit a normal move instruction > for V4TImode unless AVX512BW, so we used to generate terrible code (spill it > into memory using GPRs and then load back). Any reason for that? > I've found: > https://gcc.gnu.org/ml/gcc-patches/2014-08/msg01465.html >> > > - (V2TI "TARGET_AVX") V1TI >> > > + (V4TI "TARGET_AVX") (V2TI "TARGET_AVX") V1TI >> > >> > Are you sure TARGET_AVX is the correct condition for V4TI? >> Right! This should be TARGET_AVX512BW (because corresponding shifts >> belong to AVX-512BW). > but it isn't at all clear what shifts this is talking about. This is VMOVE, > which is used just in mov<mode>, mov<mode>_internal and movmisalign<mode> > patterns, I fail to see what kind of shifts would those produce. > Those should only produce vmovdqa64, vmovdqu64, vpxord or vpternlogd insns > with %zmm* operands, those are all AVX512F already. > > Anyway, bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk? > > Maybe it would be nice to also improve bitwise logical operations on > V2TI/V4TImode - probably just expanders like {and,ior,xor}v[24]ti > and maybe __builtin_shuffle. > > Richard also talked about V2OImode support, but I'm afraid that is going to > be way too hard, we don't really have OImode support in most places. > > 2017-07-20 Jakub Jelinek <ja...@redhat.com> > > PR target/80846 > * config/i386/i386.c (ix86_expand_vector_init_general): Handle > V2TImode and V4TImode. > (ix86_expand_vector_extract): Likewise. > * config/i386/sse.md (VMOVE): Enable V4TImode even for just > TARGET_AVX512F, instead of only for TARGET_AVX512BW. > (ssescalarmode): Handle V4TImode and V2TImode. > (VEC_EXTRACT_MODE): Add V4TImode and V2TImode. > (*vec_extractv2ti, *vec_extractv4ti): New insns. > (VEXTRACTI128_MODE): New mode iterator. > (splitter for *vec_extractv?ti first element): New. > (VEC_INIT_MODE): New mode iterator. > (vec_init<mode>): Consolidate 3 expanders into one using > VEC_INIT_MODE mode iterator. > > * gcc.target/i386/avx-pr80846.c: New test. > * gcc.target/i386/avx2-pr80846.c: New test. > * gcc.target/i386/avx512f-pr80846.c: New test. LGTM. Thanks, Uros.