On Fri, Jul 1, 2022 at 10:12 AM Hongtao Liu <crazy...@gmail.com> wrote: > > On Fri, Jul 1, 2022 at 2:42 AM Roger Sayle <ro...@nextmovesoftware.com> wrote: > > > > > > This patch is a follow-up to Hongtao's fix for PR target/105854. That > > fix is perfectly correct, but the thing that caught my eye was why is > > the compiler generating a shift by zero at all. Digging deeper it > > turns out that we can easily optimize __builtin_ia32_palignr for > > alignments of 0 and 64 respectively, which may be simplified to moves > > from the highpart or lowpart. > > > > After adding optimizations to simplify the 64-bit DImode palignr, > > I started to add the corresponding optimizations for vpalignr (i.e. > > 128-bit). The first oddity is that sse.md uses TImode and a special > > SSESCALARMODE iterator, rather than V1TImode, and indeed the comment > > above SSESCALARMODE hints that this should be "dropped in favor of > > VIMAX_AVX2_AVX512BW". Hence this patch includes the migration of > > <ssse3_avx2>_palignr<mode> to use VIMAX_AVX2_AVX512BW, basically > > using V1TImode instead of TImode for 128-bit palignr. > > > > But it was only after I'd implemented this clean-up that I stumbled > > across the strange semantics of 128-bit [v]palignr. According to > > https://www.felixcloutier.com/x86/palignr, the semantics are subtly > > different based upon how the instruction is encoded. PALIGNR leaves > > the highpart unmodified, whilst VEX.128 encoded VPALIGNR clears the > > highpart, and (unless I'm mistaken) it looks like GCC currently uses > > the exact same RTL/templates for both, treating one as an alternative > > for the other. > I think as long as patterns or intrinsics only care about the low > part, they should be ok. > But if we want to use default behavior for upper bits, we need to > restrict them under specific isa(.i.e. vmovq in vec_set<mode>_0). > Generally, 128-bit sse legacy instructions have different behaviors > for upper bits from AVX ones, and that's why vzeroupper is introduced > for sse <-> avx instructions transition. > > > > Hence I thought I'd post what I have so far (part optimization and > > part clean-up), to then ask the x86 experts for their opinions. > > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > > and make -k check, both with and without --target_board=unix{-,32}, > > with no new failures. Ok for mainline? > > > > > > 2022-06-30 Roger Sayle <ro...@nextmovesoftware.com> > > > > gcc/ChangeLog > > * config/i386/i386-builtin.def (__builtin_ia32_palignr128): Change > > CODE_FOR_ssse3_palignrti to CODE_FOR_ssse3_palignrv1ti. > > * config/i386/i386-expand.cc (expand_vec_perm_palignr): Use V1TImode > > and gen_ssse3_palignv1ti instead of TImode. > > * config/i386/sse.md (SSESCALARMODE): Delete. > > (define_mode_attr ssse3_avx2): Handle V1TImode instead of TImode. > > (<ssse3_avx2>_palignr<mode>): Use VIMAX_AVX2_AVX512BW as a mode > > iterator instead of SSESCALARMODE. > > > > (ssse3_palignrdi): Optimize cases when operands[3] is 0 or 64, > > using a single move instruction (if required). > > (define_split): Likewise split UNSPEC_PALIGNR $0 into a move. > > (define_split): Likewise split UNSPEC_PALIGNR $64 into a move. > > > > gcc/testsuite/ChangeLog > > * gcc.target/i386/ssse3-palignr-2.c: New test case. > > > > > > Thanks in advance, > > Roger > > -- > > > > +(define_split > + [(set (match_operand:DI 0 "register_operand") > + (unspec:DI [(match_operand:DI 1 "register_operand") > + (match_operand:DI 2 "register_mmxmem_operand") > + (const_int 0)] > + UNSPEC_PALIGNR))] > + "" > + [(set (match_dup 0) (match_dup 2))]) > + > +(define_split > + [(set (match_operand:DI 0 "register_operand") > + (unspec:DI [(match_operand:DI 1 "register_operand") > + (match_operand:DI 2 "register_mmxmem_operand") > + (const_int 64)] > + UNSPEC_PALIGNR))] > + "" > + [(set (match_dup 0) (match_dup 1))]) > + > define_split is assumed to be splitted to 2(or more) insns, hence > pass_combine will only try define_split if the number of merged insns > is greater than 2. > For palignr, i think most time there would be only 2 merged > insns(constant propagation), so better to change them as pre_reload > splitter. > (.i.e. (define_insn_and_split "*avx512bw_permvar_truncv16siv16hi_1"). I think you can just merge 2 define_split into define_insn_and_split "ssse3_palignrdi" by relaxing split condition as
- "TARGET_SSSE3 && reload_completed - && SSE_REGNO_P (REGNO (operands[0]))" + "(TARGET_SSSE3 && reload_completed + && SSE_REGNO_P (REGNO (operands[0]))) + || INVAL(operands[3]) == 0 + || INVAL(operands[3]) == 64" and you have already handled them by + if (operands[3] == const0_rtx) + { + if (!rtx_equal_p (operands[0], operands[2])) + emit_move_insn (operands[0], operands[2]); + else + emit_note (NOTE_INSN_DELETED); + DONE; + } + else if (INTVAL (operands[3]) == 64) + { + if (!rtx_equal_p (operands[0], operands[1])) + emit_move_insn (operands[0], operands[1]); + else + emit_note (NOTE_INSN_DELETED); + DONE; + } + > > > -- > BR, > Hongtao -- BR, Hongtao