On Thu, Aug 12, 2021 at 5:23 PM Jakub Jelinek <ja...@redhat.com> wrote: > > On Thu, Aug 12, 2021 at 01:43:23PM +0800, liuhongt wrote: > > Hi: > > This is another patch to optimize vec_perm_expr to match vpmov{dw,dq,wb} > > under AVX512. > > For scenarios(like pr101846-2.c) where the upper half is not used, this > > patch > > generates better code with only one vpmov{wb,dw,qd} instruction. For > > scenarios(like pr101846-3.c) where the upper half is actually used, if the > > src > > vector length is 256/512bits, the patch can still generate better code, but > > for > > 128bits, the code generation is worse. > > > > 128 bits upper half not used. > > > > - vpshufb .LC2(%rip), %xmm0, %xmm0 > > + vpmovdw %xmm0, %xmm0 > > > > 128 bits upper half used. > > - vpshufb .LC2(%rip), %xmm0, %xmm0 > > + vpmovdw %xmm0, %xmm1 > > + vmovq %xmm1, %rax > > + vpinsrq $0, %rax, %xmm0, %xmm0 > > > > Maybe expand_vec_perm_trunc_vinsert should only deal with 256/512bits of > > vectors, but considering the real use of scenarios like pr101846-3.c > > foo_*_128 possibility is relatively low, I still keep this part of the code. > > I actually am not sure if even > foo_dw_512: > .LFB0: > .cfi_startproc > - vmovdqa64 %zmm0, %zmm1 > - vmovdqa64 .LC0(%rip), %zmm0 > - vpermi2w %zmm1, %zmm1, %zmm0 > + vpmovdw %zmm0, %ymm1 > + vinserti64x4 $0x0, %ymm1, %zmm0, %zmm0 > ret > is always a win, the permutations we should care most about are in loops Yes, and vpmov{qd,dw} and vpermi{w,d} are both available under avx512f, so expand_vec_perm_trunc_vinsert will never be matched when it's placed in other 2 insn cases. The only part we need to handle is vpmovwb which is under avx512bw but vpermib require avx512vbmi. > and the constant load as well as the first move in that case likely go > away and it is one permutation insn vs. two. > Different case is e.g. > - vmovdqa64 .LC5(%rip), %zmm2 > - vmovdqa64 %zmm0, %zmm1 > - vmovdqa64 .LC0(%rip), %zmm0 > - vpermi2w %zmm1, %zmm1, %zmm2 > - vpermi2w %zmm1, %zmm1, %zmm0 > - vpshufb .LC6(%rip), %zmm0, %zmm0 > - vpshufb .LC7(%rip), %zmm2, %zmm1 > - vporq %zmm1, %zmm0, %zmm0 > + vpmovwb %zmm0, %ymm1 > + vinserti64x4 $0x0, %ymm1, %zmm0, %zmm0 > So, I wonder if your new routine shouldn't be instead done after > in ix86_expand_vec_perm_const_1 after vec_perm_1 among other 2 insn cases > and handle the other vpmovdw etc. cases in combine splitters (see that we > only use low half or quarter of the result and transform whatever > permutation we've used into what we want). > Got it, i'll try that way. > And perhaps make the routine eventually more general, don't handle > just identity permutation in the upper half, but allow there other > permutations too (ones where that half can be represented by a single insn > permutation). > > > > Bootstrapped and regtested on x86_64-linux-gnu{-m32,}. > > Ok for trunk? > > > > gcc/ChangeLog: > > > > PR target/101846 > > * config/i386/i386-expand.c (expand_vec_perm_trunc_vinsert): > > New function. > > (ix86_vectorize_vec_perm_const): Call > > expand_vec_perm_trunc_vinsert. > > * config/i386/sse.md (vec_set_lo_v32hi): New define_insn. > > (vec_set_lo_v64qi): Ditto. > > (vec_set_lo_<mode><mask_name>): Extend to no-avx512dq. > > > > gcc/testsuite/ChangeLog: > > > > PR target/101846 > > * gcc.target/i386/pr101846-2.c: New test. > > * gcc.target/i386/pr101846-3.c: New test. > > --- > > gcc/config/i386/i386-expand.c | 125 +++++++++++++++++++++ > > gcc/config/i386/sse.md | 60 +++++++++- > > gcc/testsuite/gcc.target/i386/pr101846-2.c | 81 +++++++++++++ > > gcc/testsuite/gcc.target/i386/pr101846-3.c | 95 ++++++++++++++++ > > 4 files changed, 359 insertions(+), 2 deletions(-) > > create mode 100644 gcc/testsuite/gcc.target/i386/pr101846-2.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pr101846-3.c > > > > diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c > > index bd21efa9530..519caac2e15 100644 > > --- a/gcc/config/i386/i386-expand.c > > +++ b/gcc/config/i386/i386-expand.c > > @@ -18317,6 +18317,126 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d) > > return false; > > } > > > > +/* A subroutine of ix86_expand_vec_perm_const_1. Try to implement D > > + in terms of a pair of vpmovdw + vinserti128 instructions. */ > > +static bool > > +expand_vec_perm_trunc_vinsert (struct expand_vec_perm_d *d) > > +{ > > + unsigned i, nelt = d->nelt, mask = d->nelt - 1; > > + unsigned half = nelt / 2; > > + machine_mode half_mode, trunc_mode; > > + > > + /* vpmov{wb,dw,qd} only available under AVX512. */ > > + if (!d->one_operand_p || !TARGET_AVX512F > > + || (!TARGET_AVX512VL && GET_MODE_SIZE (d->vmode) < 64) > > Too many spaces. > > + || GET_MODE_SIZE (GET_MODE_INNER (d->vmode)) > 4) > > + return false; > > + > > + /* TARGET_AVX512BW is needed for vpmovwb. */ > > + if (GET_MODE_INNER (d->vmode) == E_QImode && !TARGET_AVX512BW) > > + return false; > > + > > + for (i = 0; i < nelt; i++) > > + { > > + unsigned idx = d->perm[i] & mask; > > + if (idx != i * 2 && i < half) > > + return false; > > + if (idx != i && i >= half) > > + return false; > > + } > > + > > + rtx (*gen_trunc) (rtx, rtx) = NULL; > > + rtx (*gen_vec_set_lo) (rtx, rtx, rtx) = NULL; > > + switch (d->vmode) > > + { > > + case E_V16QImode: > > + gen_trunc = gen_truncv8hiv8qi2; > > + gen_vec_set_lo = gen_vec_setv2di; > > + half_mode = V8QImode; > > + trunc_mode = V8HImode; > > + break; > > + case E_V32QImode: > > + gen_trunc = gen_truncv16hiv16qi2; > > + gen_vec_set_lo = gen_vec_set_lo_v32qi; > > + half_mode = V16QImode; > > + trunc_mode = V16HImode; > > + break; > > + case E_V64QImode: > > + gen_trunc = gen_truncv32hiv32qi2; > > + gen_vec_set_lo = gen_vec_set_lo_v64qi; > > + half_mode = V32QImode; > > + trunc_mode = V32HImode; > > + break; > > + case E_V8HImode: > > + gen_trunc = gen_truncv4siv4hi2; > > + gen_vec_set_lo = gen_vec_setv2di; > > + half_mode = V4HImode; > > + trunc_mode = V4SImode; > > + break; > > + case E_V16HImode: > > + gen_trunc = gen_truncv8siv8hi2; > > + gen_vec_set_lo = gen_vec_set_lo_v16hi; > > + half_mode = V8HImode; > > + trunc_mode = V8SImode; > > + break; > > + case E_V32HImode: > > + gen_trunc = gen_truncv16siv16hi2; > > + gen_vec_set_lo = gen_vec_set_lo_v32hi; > > + half_mode = V16HImode; > > + trunc_mode = V16SImode; > > + break; > > + case E_V4SImode: > > + gen_trunc = gen_truncv2div2si2; > > + gen_vec_set_lo = gen_vec_setv2di; > > + half_mode = V2SImode; > > + trunc_mode = V2DImode; > > + break; > > + case E_V8SImode: > > + gen_trunc = gen_truncv4div4si2; > > + gen_vec_set_lo = gen_vec_set_lo_v8si; > > + half_mode = V4SImode; > > + trunc_mode = V4DImode; > > + break; > > + case E_V16SImode: > > + gen_trunc = gen_truncv8div8si2; > > + gen_vec_set_lo = gen_vec_set_lo_v16si; > > + half_mode = V8SImode; > > + trunc_mode = V8DImode; > > + break; > > + > > + default: > > + break; > > + } > > + > > + if (gen_trunc == NULL) > > + return false; > > Isn't it simpler to return false; in the default: case? > > @@ -21028,6 +21148,11 @@ ix86_vectorize_vec_perm_const (machine_mode vmode, > > rtx target, rtx op0, > > d.op0 = nop0; > > d.op1 = force_reg (vmode, d.op1); > > > > + /* Try to match vpmov{wb,dw,qd}, although vinserti128 will be generated, > > + it's very likely to be optimized off. So let's put the function here. > > */ > > Two spaces after full stop. > > Jakub >
-- BR, Hongtao