On Sun, Jan 7, 2024 at 6:53 AM Roger Sayle <ro...@nextmovesoftware.com> wrote: > > Hi Hongtao, > > Many thanks for the review. This revised patch implements several > of your suggestions, specifically to use pshufd for V4SImode and > punpcklqdq for V2DImode. These changes are demonstrated by the > examples below: > > typedef unsigned int v4si __attribute((vector_size(16))); > typedef unsigned long long v2di __attribute((vector_size(16))); > > v4si foo() { return (v4si){1,1,1,1}; } > v2di bar() { return (v2di){1,1}; } > > The previous version of my patch generated: > > foo: movdqa .LC0(%rip), %xmm0 > ret > bar: movdqa .LC1(%rip), %xmm0 > ret > > with this revised version, -O2 generates: > > foo: movl $1, %eax > movd %eax, %xmm0 > pshufd $0, %xmm0, %xmm0 > ret > bar: movl $1, %eax > movq %rax, %xmm0 > punpcklqdq %xmm0, %xmm0 > ret > > However, if it's OK with you, I'd prefer to allow this function to > return false, safely falling back to emitting a vector load from > the constant bool rather than ICEing from a gcc_assert. For one Sure, that makes sense. > thing this isn't a unrecoverable correctness issue, but at worst > a missed optimization. The deeper reason is that this usefully > provides a handle for tuning on different microarchitectures. > On some (AMD?) machines, where !TARGET_INTER_UNIT_MOVES_TO_VEC, > the first form above may be preferable to the second. Currently > the start of ix86_convert_const_wide_int_to_broadcast disables > broadcasts for !TARGET_INTER_UNIT_MOVES_TO_VEC even when an > implementation doesn't reuire an inter unit move, such as a > broadcast from memory. I plan follow-up patches that benefit > from this flexibility. > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > and make -k check, both with and without --target_board=unix{-m32} > with no new failures. Ok for mainline? Ok. > > gcc/ChangeLog > PR target/112992 > * config/i386/i386-expand.cc > (ix86_convert_const_wide_int_to_broadcast): Allow call to > ix86_expand_vector_init_duplicate to fail, and return NULL_RTX. > (ix86_broadcast_from_constant): Revert recent change; Return a > suitable MEMREF independently of mode/target combinations. > (ix86_expand_vector_move): Allow ix86_expand_vector_init_duplicate > to decide whether expansion is possible/preferrable. Only try > forcing DImode constants to memory (and trying again) if calling > ix86_expand_vector_init_duplicate fails with an DImode immediate > constant. > (ix86_expand_vector_init_duplicate) <case E_V2DImode>: Try using > V4SImode for suitable immediate constants. > <case E_V4DImode>: Try using V8SImode for suitable constants. > <case E_V4HImode>: Fail for CONST_INT_P, i.e. use constant pool. > <case E_V2HImode>: Likewise. > <case E_V8HImode>: For CONST_INT_P try using V4SImode via widen. > <case E_V16QImode>: For CONT_INT_P try using V8HImode via widen. > <label widen>: Handle CONT_INTs via simplify_binary_operation. > Allow recursive calls to ix86_expand_vector_init_duplicate to fail. > <case E_V16HImode>: For CONST_INT_P try V8SImode via widen. > <case E_V32QImode>: For CONST_INT_P try V16HImode via widen. > (ix86_expand_vector_init): Move try using a broadcast for all_same > with ix86_expand_vector_init_duplicate before using constant pool. > > gcc/testsuite/ChangeLog > * gcc.target/i386/auto-init-8.c: Update test case. > * gcc.target/i386/avx512f-broadcast-pr87767-1.c: Likewise. > * gcc.target/i386/avx512f-broadcast-pr87767-5.c: Likewise. > * gcc.target/i386/avx512fp16-13.c: Likewise. > * gcc.target/i386/avx512vl-broadcast-pr87767-1.c: Likewise. > * gcc.target/i386/avx512vl-broadcast-pr87767-5.c: Likewise. > * gcc.target/i386/pr100865-1.c: Likewise. > * gcc.target/i386/pr100865-10a.c: Likewise. > * gcc.target/i386/pr100865-10b.c: Likewise. > * gcc.target/i386/pr100865-2.c: Likewise. > * gcc.target/i386/pr100865-3.c: Likewise. > * gcc.target/i386/pr100865-4a.c: Likewise. > * gcc.target/i386/pr100865-4b.c: Likewise. > * gcc.target/i386/pr100865-5a.c: Likewise. > * gcc.target/i386/pr100865-5b.c: Likewise. > * gcc.target/i386/pr100865-9a.c: Likewise. > * gcc.target/i386/pr100865-9b.c: Likewise. > * gcc.target/i386/pr102021.c: Likewise. > * gcc.target/i386/pr90773-17.c: Likewise. > > Thanks in advance. > Roger > -- > > > -----Original Message----- > > From: Hongtao Liu <crazy...@gmail.com> > > Sent: 02 January 2024 05:40 > > To: Roger Sayle <ro...@nextmovesoftware.com> > > Cc: gcc-patches@gcc.gnu.org; Uros Bizjak <ubiz...@gmail.com> > > Subject: Re: [x86_64 PATCH] PR target/112992: Optimize mode for broadcast of > > constants. > > > > On Fri, Dec 22, 2023 at 6:25 PM Roger Sayle <ro...@nextmovesoftware.com> > > wrote: > > > > > > > > > This patch resolves the second part of PR target/112992, building upon > > > Hongtao Liu's solution to the first part. > > > > > > The issue addressed by this patch is that when initializing vectors by > > > broadcasting integer constants, the compiler has the flexibility to > > > select the most appropriate vector mode to perform the broadcast, as > > > long as the resulting vector has an identical bit pattern. For > > > example, the following constants are all equivalent: > > > V4SImode {0x01010101, 0x01010101, 0x01010101, 0x01010101 } V8HImode > > > {0x0101, 0x0101, 0x0101, 0x0101, 0x0101, 0x0101, 0x0101, 0x0101 } > > > V16QImode {0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, ... > > > 0x01 } So instruction sequences that construct any of these can be > > > used to construct the others (with a suitable cast/SUBREG). > > > > > > On x86_64, it turns out that broadcasts of SImode constants are > > > preferred, as DImode constants often require a longer movabs > > > instruction, and HImode and QImode broadcasts require multiple uops on > > > some > > architectures. > > > Hence, SImode is always the equal shortest/fastest implementation. > > > > > > Examples of this improvement, can be seen in the testsuite. > > > > > > gcc.target/i386/pr102021.c > > > Before: > > > 0: 48 b8 0c 00 0c 00 0c movabs $0xc000c000c000c,%rax > > > 7: 00 0c 00 > > > a: 62 f2 fd 28 7c c0 vpbroadcastq %rax,%ymm0 > > > 10: c3 retq > > > > > > After: > > > 0: b8 0c 00 0c 00 mov $0xc000c,%eax > > > 5: 62 f2 7d 28 7c c0 vpbroadcastd %eax,%ymm0 > > > b: c3 retq > > > > > > and > > > gcc.target/i386/pr90773-17.c: > > > Before: > > > 0: 48 8b 15 00 00 00 00 mov 0x0(%rip),%rdx # 7 <foo+0x7> > > > 7: b8 0c 00 00 00 mov $0xc,%eax > > > c: 62 f2 7d 08 7a c0 vpbroadcastb %eax,%xmm0 > > > 12: 62 f1 7f 08 7f 02 vmovdqu8 %xmm0,(%rdx) > > > 18: c7 42 0f 0c 0c 0c 0c movl $0xc0c0c0c,0xf(%rdx) > > > 1f: c3 retq > > > > > > After: > > > 0: 48 8b 15 00 00 00 00 mov 0x0(%rip),%rdx # 7 <foo+0x7> > > > 7: b8 0c 0c 0c 0c mov $0xc0c0c0c,%eax > > > c: 62 f2 7d 08 7c c0 vpbroadcastd %eax,%xmm0 > > > 12: 62 f1 7f 08 7f 02 vmovdqu8 %xmm0,(%rdx) > > > 18: c7 42 0f 0c 0c 0c 0c movl $0xc0c0c0c,0xf(%rdx) > > > 1f: c3 retq > > > > > > where according to Agner Fog's instruction tables broadcastd is > > > slightly faster on some microarchitectures, for example Knight's Landing. > > > > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > > > and make -k check, both with and without --target_board=unix{-m32} > > > with no new failures. Ok for mainline? > > > > > > > > > 2023-12-21 Roger Sayle <ro...@nextmovesoftware.com> > > > > > > gcc/ChangeLog > > > PR target/112992 > > > * config/i386/i386-expand.cc > > > (ix86_convert_const_wide_int_to_broadcast): Allow call to > > > ix86_expand_vector_init_duplicate to fail, and return NULL_RTX. > > > (ix86_broadcast_from_constant): Revert recent change; Return a > > > suitable MEMREF independently of mode/target combinations. > > > (ix86_expand_vector_move): Allow ix86_expand_vector_init_duplicate > > > to decide whether expansion is possible/preferrable. Only try > > > forcing DImode constants to memory (and trying again) if calling > > > ix86_expand_vector_init_duplicate fails with an DImode immediate > > > constant. > > > (ix86_expand_vector_init_duplicate) <case E_V2DImode>: Try using > > > V4SImode for suitable immediate constants. > > > <case E_V4DImode>: Try using V8SImode for suitable constants. > > > <case E_V4SImode>: Use constant pool for AVX without AVX2. > > > <case E_V4HImode>: Fail for CONST_INT_P, i.e. use constant pool. > > > <case E_V2HImode>: Likewise. > > > <case E_V8HImode>: For CONST_INT_P try using V4SImode via widen. > > > <case E_V16QImode>: For CONT_INT_P try using V8HImode via widen. > > > <label widen>: Handle CONT_INTs via simplify_binary_operation. > > > Allow recursive calls to ix86_expand_vector_init_duplicate to > > > fail. > > > <case E_V16HImode>: For CONST_INT_P try V8SImode via widen. > > > <case E_V32QImode>: For CONST_INT_P try V16HImode via widen. > > > (ix86_expand_vector_init): Move try using a broadcast for all_same > > > with ix86_expand_vector_init_duplicate before using constant pool. > > > > > > gcc/testsuite/ChangeLog > > > * gcc.target/i386/avx512f-broadcast-pr87767-1.c: Update test case. > > > * gcc.target/i386/avx512f-broadcast-pr87767-5.c: Likewise. > > > * gcc.target/i386/avx512fp16-13.c: Likewise. > > > * gcc.target/i386/avx512vl-broadcast-pr87767-1.c: Likewise. > > > * gcc.target/i386/avx512vl-broadcast-pr87767-5.c: Likewise. > > > * gcc.target/i386/pr100865-10a.c: Likewise. > > > * gcc.target/i386/pr100865-10b.c: Likewise. > > > * gcc.target/i386/pr100865-11c.c: Likewise. > > > * gcc.target/i386/pr100865-12c.c: Likewise. > > > * gcc.target/i386/pr100865-2.c: Likewise. > > > * gcc.target/i386/pr100865-3.c: Likewise. > > > * gcc.target/i386/pr100865-4a.c: Likewise. > > > * gcc.target/i386/pr100865-4b.c: Likewise. > > > * gcc.target/i386/pr100865-5a.c: Likewise. > > > * gcc.target/i386/pr100865-5b.c: Likewise. > > > * gcc.target/i386/pr100865-9a.c: Likewise. > > > * gcc.target/i386/pr100865-9b.c: Likewise. > > > * gcc.target/i386/pr102021.c: Likewise. > > > * gcc.target/i386/pr90773-17.c: Likewise. > > > > > > > > > Thanks in advance, > > > Roger > > > -- > > > > > > > >+ case E_V2DImode: > > >+ if (CONST_INT_P (val)) > > >+ { > > >+ int tmp = (int)INTVAL (val); > > >+ if (tmp == (int)(INTVAL (val) >> 32)) > > >+ { > > >+ rtx reg = gen_reg_rtx (V4SImode); > > >+ ok = ix86_vector_duplicate_value (V4SImode, reg, > > >+ GEN_INT (tmp)); > > >+ if (ok) > > >+ { > > >+ emit_move_insn (target, gen_lowpart (V2DImode, reg)); > > >+ return true; > > >+ } > > >+ } > > >+ if (!TARGET_AVX) > > >+ return false; > > >+ if (!TARGET_AVX2) > > >+ val = force_const_mem (DImode, val); > > > > We can use pshufd/pshufss to broadcast for v4si and punpcklqdq for v2di, so > > I > > think there is no need to return false or force_const_mem here. > > > > >+ case E_V4SImode: > > >+ if (CONST_INT_P (val)) > > >+ { > > >+ if (!TARGET_AVX) > > >+ return false; > > >+ if (!TARGET_AVX2) > > >+ val = force_const_mem (SImode, val); > > > > Ditto. > > > > >+ } > > >+ return ix86_vector_duplicate_value (mode, target, val); > > > > > > > > > x = gen_reg_rtx (wvmode); > > > ok = ix86_expand_vector_init_duplicate (mmx_ok, wvmode, x, val); > > >- gcc_assert (ok); > > >+ if (!ok) > > >+ return false; > > Then we can still gcc_assert (ok) here. > > > emit_move_insn (target, gen_lowpart (GET_MODE (target), x)); > > >- return ok; > > >+ return true; > > > > > > >+ case E_V32QImode: > > >+ if (CONST_INT_P (val)) > > >+ goto widen; > > >+ /* FALLTHRU */ > > >+ > > > case E_V16HFmode: > > > case E_V16BFmode: > > >- case E_V32QImode: > > > if (TARGET_AVX2) > > > return ix86_vector_duplicate_value (mode, target, val); > > > else > > >@@ -15904,7 +15965,8 @@ ix86_expand_vector_init_duplicate (bool mmx_ok, > > machine_mode mode, > > > rtx x = gen_reg_rtx (hvmode); > > > > > > ok = ix86_expand_vector_init_duplicate (false, hvmode, x, val); > > >- gcc_assert (ok); > > >+ if (!ok) > > >+ return false; > > > > > > x = gen_rtx_VEC_CONCAT (mode, x, x); > > > emit_insn (gen_rtx_SET (target, x)); @@ -15941,7 +16003,8 @@ > > >ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode, > > > rtx x = gen_reg_rtx (hvmode); > > > > > > ok = ix86_expand_vector_init_duplicate (false, hvmode, x, val); > > >- gcc_assert (ok); > > >+ if (!ok) > > Assume it's always true for vec_dupv8si?(vec_dupv32qi widen to vec_dupv16hi > > widen to vec_dupv8si. > > >+ return false; > > > > -- > > BR, > > Hongtao
-- BR, Hongtao