On Fri, Dec 22, 2023 at 6:25 PM Roger Sayle <ro...@nextmovesoftware.com> wrote: > > > This patch resolves the second part of PR target/112992, building upon > Hongtao Liu's solution to the first part. > > The issue addressed by this patch is that when initializing vectors by > broadcasting integer constants, the compiler has the flexibility to > select the most appropriate vector mode to perform the broadcast, as > long as the resulting vector has an identical bit pattern. For > example, the following constants are all equivalent: > V4SImode {0x01010101, 0x01010101, 0x01010101, 0x01010101 } > V8HImode {0x0101, 0x0101, 0x0101, 0x0101, 0x0101, 0x0101, 0x0101, 0x0101 } > V16QImode {0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, ... 0x01 } > So instruction sequences that construct any of these can be used to > construct the others (with a suitable cast/SUBREG). > > On x86_64, it turns out that broadcasts of SImode constants are preferred, > as DImode constants often require a longer movabs instruction, and > HImode and QImode broadcasts require multiple uops on some architectures. > Hence, SImode is always the equal shortest/fastest implementation. > > Examples of this improvement, can be seen in the testsuite. > > gcc.target/i386/pr102021.c > Before: > 0: 48 b8 0c 00 0c 00 0c movabs $0xc000c000c000c,%rax > 7: 00 0c 00 > a: 62 f2 fd 28 7c c0 vpbroadcastq %rax,%ymm0 > 10: c3 retq > > After: > 0: b8 0c 00 0c 00 mov $0xc000c,%eax > 5: 62 f2 7d 28 7c c0 vpbroadcastd %eax,%ymm0 > b: c3 retq > > and > gcc.target/i386/pr90773-17.c: > Before: > 0: 48 8b 15 00 00 00 00 mov 0x0(%rip),%rdx # 7 <foo+0x7> > 7: b8 0c 00 00 00 mov $0xc,%eax > c: 62 f2 7d 08 7a c0 vpbroadcastb %eax,%xmm0 > 12: 62 f1 7f 08 7f 02 vmovdqu8 %xmm0,(%rdx) > 18: c7 42 0f 0c 0c 0c 0c movl $0xc0c0c0c,0xf(%rdx) > 1f: c3 retq > > After: > 0: 48 8b 15 00 00 00 00 mov 0x0(%rip),%rdx # 7 <foo+0x7> > 7: b8 0c 0c 0c 0c mov $0xc0c0c0c,%eax > c: 62 f2 7d 08 7c c0 vpbroadcastd %eax,%xmm0 > 12: 62 f1 7f 08 7f 02 vmovdqu8 %xmm0,(%rdx) > 18: c7 42 0f 0c 0c 0c 0c movl $0xc0c0c0c,0xf(%rdx) > 1f: c3 retq > > where according to Agner Fog's instruction tables broadcastd is slightly > faster on some microarchitectures, for example Knight's Landing. > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > and make -k check, both with and without --target_board=unix{-m32} > with no new failures. Ok for mainline? > > > 2023-12-21 Roger Sayle <ro...@nextmovesoftware.com> > > gcc/ChangeLog > PR target/112992 > * config/i386/i386-expand.cc > (ix86_convert_const_wide_int_to_broadcast): Allow call to > ix86_expand_vector_init_duplicate to fail, and return NULL_RTX. > (ix86_broadcast_from_constant): Revert recent change; Return a > suitable MEMREF independently of mode/target combinations. > (ix86_expand_vector_move): Allow ix86_expand_vector_init_duplicate > to decide whether expansion is possible/preferrable. Only try > forcing DImode constants to memory (and trying again) if calling > ix86_expand_vector_init_duplicate fails with an DImode immediate > constant. > (ix86_expand_vector_init_duplicate) <case E_V2DImode>: Try using > V4SImode for suitable immediate constants. > <case E_V4DImode>: Try using V8SImode for suitable constants. > <case E_V4SImode>: Use constant pool for AVX without AVX2. > <case E_V4HImode>: Fail for CONST_INT_P, i.e. use constant pool. > <case E_V2HImode>: Likewise. > <case E_V8HImode>: For CONST_INT_P try using V4SImode via widen. > <case E_V16QImode>: For CONT_INT_P try using V8HImode via widen. > <label widen>: Handle CONT_INTs via simplify_binary_operation. > Allow recursive calls to ix86_expand_vector_init_duplicate to fail. > <case E_V16HImode>: For CONST_INT_P try V8SImode via widen. > <case E_V32QImode>: For CONST_INT_P try V16HImode via widen. > (ix86_expand_vector_init): Move try using a broadcast for all_same > with ix86_expand_vector_init_duplicate before using constant pool. > > gcc/testsuite/ChangeLog > * gcc.target/i386/avx512f-broadcast-pr87767-1.c: Update test case. > * gcc.target/i386/avx512f-broadcast-pr87767-5.c: Likewise. > * gcc.target/i386/avx512fp16-13.c: Likewise. > * gcc.target/i386/avx512vl-broadcast-pr87767-1.c: Likewise. > * gcc.target/i386/avx512vl-broadcast-pr87767-5.c: Likewise. > * gcc.target/i386/pr100865-10a.c: Likewise. > * gcc.target/i386/pr100865-10b.c: Likewise. > * gcc.target/i386/pr100865-11c.c: Likewise. > * gcc.target/i386/pr100865-12c.c: Likewise. > * gcc.target/i386/pr100865-2.c: Likewise. > * gcc.target/i386/pr100865-3.c: Likewise. > * gcc.target/i386/pr100865-4a.c: Likewise. > * gcc.target/i386/pr100865-4b.c: Likewise. > * gcc.target/i386/pr100865-5a.c: Likewise. > * gcc.target/i386/pr100865-5b.c: Likewise. > * gcc.target/i386/pr100865-9a.c: Likewise. > * gcc.target/i386/pr100865-9b.c: Likewise. > * gcc.target/i386/pr102021.c: Likewise. > * gcc.target/i386/pr90773-17.c: Likewise. > > > Thanks in advance, > Roger > -- >
>+ case E_V2DImode: >+ if (CONST_INT_P (val)) >+ { >+ int tmp = (int)INTVAL (val); >+ if (tmp == (int)(INTVAL (val) >> 32)) >+ { >+ rtx reg = gen_reg_rtx (V4SImode); >+ ok = ix86_vector_duplicate_value (V4SImode, reg, >+ GEN_INT (tmp)); >+ if (ok) >+ { >+ emit_move_insn (target, gen_lowpart (V2DImode, reg)); >+ return true; >+ } >+ } >+ if (!TARGET_AVX) >+ return false; >+ if (!TARGET_AVX2) >+ val = force_const_mem (DImode, val); We can use pshufd/pshufss to broadcast for v4si and punpcklqdq for v2di, so I think there is no need to return false or force_const_mem here. >+ case E_V4SImode: >+ if (CONST_INT_P (val)) >+ { >+ if (!TARGET_AVX) >+ return false; >+ if (!TARGET_AVX2) >+ val = force_const_mem (SImode, val); Ditto. >+ } >+ return ix86_vector_duplicate_value (mode, target, val); > x = gen_reg_rtx (wvmode); > ok = ix86_expand_vector_init_duplicate (mmx_ok, wvmode, x, val); >- gcc_assert (ok); >+ if (!ok) >+ return false; Then we can still gcc_assert (ok) here. > emit_move_insn (target, gen_lowpart (GET_MODE (target), x)); >- return ok; >+ return true; >+ case E_V32QImode: >+ if (CONST_INT_P (val)) >+ goto widen; >+ /* FALLTHRU */ >+ > case E_V16HFmode: > case E_V16BFmode: >- case E_V32QImode: > if (TARGET_AVX2) > return ix86_vector_duplicate_value (mode, target, val); > else >@@ -15904,7 +15965,8 @@ ix86_expand_vector_init_duplicate (bool mmx_ok, >machine_mode mode, > rtx x = gen_reg_rtx (hvmode); > > ok = ix86_expand_vector_init_duplicate (false, hvmode, x, val); >- gcc_assert (ok); >+ if (!ok) >+ return false; > > x = gen_rtx_VEC_CONCAT (mode, x, x); > emit_insn (gen_rtx_SET (target, x)); >@@ -15941,7 +16003,8 @@ ix86_expand_vector_init_duplicate (bool mmx_ok, >machine_mode mode, > rtx x = gen_reg_rtx (hvmode); > > ok = ix86_expand_vector_init_duplicate (false, hvmode, x, val); >- gcc_assert (ok); >+ if (!ok) Assume it's always true for vec_dupv8si?(vec_dupv32qi widen to vec_dupv16hi widen to vec_dupv8si. >+ return false; -- BR, Hongtao