PING^7: [PATCH] i386: Properly encode xmm16-xmm31/ymm16-ymm31 for vector move

H.J. Lu Thu, 13 Feb 2020 05:09:12 -0800

On Thu, Feb 6, 2020 at 8:17 PM H.J. Lu <hjl.to...@gmail.com> wrote:
>
> On Mon, Jan 27, 2020 at 10:59 AM H.J. Lu <hjl.to...@gmail.com> wrote:
> >
> > On Mon, Jul 8, 2019 at 8:19 AM H.J. Lu <hjl.to...@gmail.com> wrote:
> > >
> > > On Tue, Jun 18, 2019 at 8:59 AM H.J. Lu <hjl.to...@gmail.com> wrote:
> > > >
> > > > On Fri, May 31, 2019 at 10:38 AM H.J. Lu <hjl.to...@gmail.com> wrote:
> > > > >
> > > > > On Tue, May 21, 2019 at 2:43 PM H.J. Lu <hjl.to...@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Feb 22, 2019 at 8:25 AM H.J. Lu <hongjiu...@intel.com> 
> > > > > > wrote:
> > > > > > >
> > > > > > > Hi Jan, Uros,
> > > > > > >
> > > > > > > This patch fixes the wrong code bug:
> > > > > > >
> > > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89229
> > > > > > >
> > > > > > > Tested on AVX2 and AVX512 with and without --with-arch=native.
> > > > > > >
> > > > > > > OK for trunk?
> > > > > > >
> > > > > > > Thanks.
> > > > > > >
> > > > > > > H.J.
> > > > > > > --
> > > > > > > i386 backend has
> > > > > > >
> > > > > > > INT_MODE (OI, 32);
> > > > > > > INT_MODE (XI, 64);
> > > > > > >
> > > > > > > So, XI_MODE represents 64 INTEGER bytes = 64 * 8 = 512 bit 
> > > > > > > operation,
> > > > > > > in case of const_1, all 512 bits set.
> > > > > > >
> > > > > > > We can load zeros with narrower instruction, (e.g. 256 bit by 
> > > > > > > inherent
> > > > > > > zeroing of highpart in case of 128 bit xor), so TImode in this 
> > > > > > > case.
> > > > > > >
> > > > > > > Some targets prefer V4SF mode, so they will emit float xorps for 
> > > > > > > zeroing.
> > > > > > >
> > > > > > > sse.md has
> > > > > > >
> > > > > > > (define_insn "mov<mode>_internal"
> > > > > > >   [(set (match_operand:VMOVE 0 "nonimmediate_operand"
> > > > > > >          "=v,v ,v ,m")
> > > > > > >         (match_operand:VMOVE 1 "nonimmediate_or_sse_const_operand"
> > > > > > >          " C,BC,vm,v"))]
> > > > > > > ....
> > > > > > >       /* There is no evex-encoded vmov* for sizes smaller than 
> > > > > > > 64-bytes
> > > > > > >          in avx512f, so we need to use workarounds, to access sse 
> > > > > > > registers
> > > > > > >          16-31, which are evex-only. In avx512vl we don't need 
> > > > > > > workarounds.  */
> > > > > > >       if (TARGET_AVX512F && <MODE_SIZE> < 64 && !TARGET_AVX512VL
> > > > > > >           && (EXT_REX_SSE_REG_P (operands[0])
> > > > > > >               || EXT_REX_SSE_REG_P (operands[1])))
> > > > > > >         {
> > > > > > >           if (memory_operand (operands[0], <MODE>mode))
> > > > > > >             {
> > > > > > >               if (<MODE_SIZE> == 32)
> > > > > > >                 return "vextract<shuffletype>64x4\t{$0x0, %g1, 
> > > > > > > %0|%0, %g1, 0x0}";
> > > > > > >               else if (<MODE_SIZE> == 16)
> > > > > > >                 return "vextract<shuffletype>32x4\t{$0x0, %g1, 
> > > > > > > %0|%0, %g1, 0x0}";
> > > > > > >               else
> > > > > > >                 gcc_unreachable ();
> > > > > > >             }
> > > > > > > ...
> > > > > > >
> > > > > > > However, since ix86_hard_regno_mode_ok has
> > > > > > >
> > > > > > >      /* TODO check for QI/HI scalars.  */
> > > > > > >       /* AVX512VL allows sse regs16+ for 128/256 bit modes.  */
> > > > > > >       if (TARGET_AVX512VL
> > > > > > >           && (mode == OImode
> > > > > > >               || mode == TImode
> > > > > > >               || VALID_AVX256_REG_MODE (mode)
> > > > > > >               || VALID_AVX512VL_128_REG_MODE (mode)))
> > > > > > >         return true;
> > > > > > >
> > > > > > >       /* xmm16-xmm31 are only available for AVX-512.  */
> > > > > > >       if (EXT_REX_SSE_REGNO_P (regno))
> > > > > > >         return false;
> > > > > > >
> > > > > > >       if (TARGET_AVX512F && <MODE_SIZE> < 64 && !TARGET_AVX512VL
> > > > > > >           && (EXT_REX_SSE_REG_P (operands[0])
> > > > > > >               || EXT_REX_SSE_REG_P (operands[1])))
> > > > > > >
> > > > > > > is a dead code.
> > > > > > >
> > > > > > > Also for
> > > > > > >
> > > > > > > long long *p;
> > > > > > > volatile __m256i yy;
> > > > > > >
> > > > > > > void
> > > > > > > foo (void)
> > > > > > > {
> > > > > > >    _mm256_store_epi64 (p, yy);
> > > > > > > }
> > > > > > >
> > > > > > > with AVX512VL, we should generate
> > > > > > >
> > > > > > >         vmovdqa         %ymm0, (%rax)
> > > > > > >
> > > > > > > not
> > > > > > >
> > > > > > >         vmovdqa64       %ymm0, (%rax)
> > > > > > >
> > > > > > > All TYPE_SSEMOV vector moves are consolidated to 
> > > > > > > ix86_output_ssemov:
> > > > > > >
> > > > > > > 1. If xmm16-xmm31/ymm16-ymm31 registers aren't used, SSE/AVX 
> > > > > > > vector
> > > > > > > moves will be generated.
> > > > > > > 2. If xmm16-xmm31/ymm16-ymm31 registers are used:
> > > > > > >    a. With AVX512VL, AVX512VL vector moves will be generated.
> > > > > > >    b. Without AVX512VL, xmm16-xmm31/ymm16-ymm31 register to 
> > > > > > > register
> > > > > > >       move will be done with zmm register move.
> > > > > > >
> > > > > > > ext_sse_reg_operand is removed since it is no longer needed.
> > > > > > >
> > > > > > > Tested on AVX2 and AVX512 with and without --with-arch=native.
> > > > > > >
> > > > > > > gcc/
> > > > > > >
> > > > > > >         PR target/89229
> > > > > > >         PR target/89346
> > > > > > >         * config/i386/i386-protos.h (ix86_output_ssemov): New 
> > > > > > > prototype.
> > > > > > >         * config/i386/i386.c (ix86_get_ssemov): New function.
> > > > > > >         (ix86_output_ssemov): Likewise.
> > > > > > >         * config/i386/i386.md (*movxi_internal_avx512f): Call
> > > > > > >         ix86_output_ssemov for TYPE_SSEMOV.
> > > > > > >         (*movoi_internal_avx): Call ix86_output_ssemov for 
> > > > > > > TYPE_SSEMOV.
> > > > > > >         Remove ext_sse_reg_operand and TARGET_AVX512VL check.
> > > > > > >         (*movti_internal): Likewise.
> > > > > > >         (*movdi_internal): Call ix86_output_ssemov for 
> > > > > > > TYPE_SSEMOV.
> > > > > > >         Remove ext_sse_reg_operand check.
> > > > > > >         (*movsi_internal): Likewise.
> > > > > > >         (*movtf_internal): Call ix86_output_ssemov for 
> > > > > > > TYPE_SSEMOV.
> > > > > > >         (*movdf_internal): Call ix86_output_ssemov for 
> > > > > > > TYPE_SSEMOV.
> > > > > > >         Remove TARGET_AVX512F, TARGET_PREFER_AVX256, 
> > > > > > > TARGET_AVX512VL
> > > > > > >         and ext_sse_reg_operand check.
> > > > > > >         (*movsf_internal_avx): Call ix86_output_ssemov for 
> > > > > > > TYPE_SSEMOV.
> > > > > > >         Remove TARGET_PREFER_AVX256, TARGET_AVX512VL and
> > > > > > >         ext_sse_reg_operand check.
> > > > > > >         * config/i386/mmx.md (MMXMODE:*mov<mode>_internal): Call
> > > > > > >         ix86_output_ssemov for TYPE_SSEMOV.  Remove 
> > > > > > > ext_sse_reg_operand
> > > > > > >         check.
> > > > > > >         * config/i386/sse.md (VMOVE:mov<mode>_internal): Call
> > > > > > >         ix86_output_ssemov for TYPE_SSEMOV.  Remove 
> > > > > > > TARGET_AVX512VL
> > > > > > >         check.
> > > > > > >         * config/i386/predicates.md (ext_sse_reg_operand): 
> > > > > > > Removed.
> > > > > > >
> > > > > > > gcc/testsuite/
> > > > > > >
> > > > > > >         PR target/89229
> > > > > > >         PR target/89346
> > > > > > >         * gcc.target/i386/avx512vl-vmovdqa64-1.c: Updated.
> > > > > > >         * gcc.target/i386/pr89229-2a.c: New test.
> > > > > > >         * gcc.target/i386/pr89229-2b.c: Likewise.
> > > > > > >         * gcc.target/i386/pr89229-2c.c: Likewise.
> > > > > > >         * gcc.target/i386/pr89229-3a.c: Likewise.
> > > > > > >         * gcc.target/i386/pr89229-3b.c: Likewise.
> > > > > > >         * gcc.target/i386/pr89229-3c.c: Likewise.
> > > > > > >         * gcc.target/i386/pr89229-4a.c: Likewise.
> > > > > > >         * gcc.target/i386/pr89229-4b.c: Likewise.
> > > > > > >         * gcc.target/i386/pr89229-4c.c: Likewise.
> > > > > > >         * gcc.target/i386/pr89229-5a.c: Likewise.
> > > > > > >         * gcc.target/i386/pr89229-5b.c: Likewise.
> > > > > > >         * gcc.target/i386/pr89229-5c.c: Likewise.
> > > > > > >         * gcc.target/i386/pr89229-6a.c: Likewise.
> > > > > > >         * gcc.target/i386/pr89229-6b.c: Likewise.
> > > > > > >         * gcc.target/i386/pr89229-6c.c: Likewise.
> > > > > > >         * gcc.target/i386/pr89229-7a.c: Likewise.
> > > > > > >         * gcc.target/i386/pr89229-7b.c: Likewise.
> > > > > > >         * gcc.target/i386/pr89229-7c.c: Likewise.
> > > > > > > ---
> > > > > >
> > > > > > PING:
> > > > > >
> > > > > > https://gcc.gnu.org/ml/gcc-patches/2019-02/msg01841.html
> > > > > >
> > > > > >
> > > > >
> > > > > PING.
> > > > >
> > > >
> > > > PING.
> > > >
> > >
> > > PING.
> > >
> >
> > Here is the rebased patch.  I'd like to see it got fixed for GCC 10.
> >
>
> PING.   Here is the rebased patch.
>


PING.

https://gcc.gnu.org/ml/gcc-patches/2020-02/msg00415.html


-- 
H.J.

PING^7: [PATCH] i386: Properly encode xmm16-xmm31/ymm16-ymm31 for vector move

Reply via email to