On Fri, Jun 3, 2022 at 12:08 PM Uros Bizjak <ubiz...@gmail.com> wrote: > > On Fri, Jun 3, 2022 at 11:49 AM Roger Sayle <ro...@nextmovesoftware.com> > wrote: > > > > > > Technically, PR target/91681 has already been resolved; we now recognize the > > highpart multiplication at the tree-level, we no longer use the stack, and > > we currently generate the same number of instructions as LLVM. However, it > > is still possible to do better, the current x86_64 code to generate a double > > word addition of a zero extended operand, looks like: > > > > xorl %r11d, %r11d > > addq %r10, %rax > > adcq %r11, %rdx > > > > when it's possible (as LLVM does) to use an immediate constant: > > > > addq %r10, %rax > > adcq $0, %rdx > > > > To do this, the backend required one or two simple changes, that > > then themselves required one or two more obscure tweaks. > > > > The simple starting point is to define a zero_extendditi2 pattern, > > for zero extension from DImode to TImode on TARGET_64BIT that is > > split after reload. Double word (TImode) addition/subtraction is > > split after reload, so that constrains when things should happen. > > > > With zero extension now visible to combine, we add two new > > define_insn_and_split that add/subtract a zero extended operand > > in double word mode. These apply to both 32-bit and 64-bit code > > generation, to produce adc $0 and sbb $0. > > > > The first strange tweak is that these new patterns interfere with > > the optimization that recognizes DW:DI = (HI:SI<<32)+LO:SI as a pair > > of register moves, or more accurately the combine splitter no longer > > triggers as we're now converting two instructions into two instructions > > (not three instructions into two instructions). This is easily > > repaired (and extended to handle TImode) by changing from a pair > > of define_split (that handle operand commutativity) to a set of > > four define_insn_and_split (again to handle operand commutativity). > > > > The other/final strange tweak that the above splitters now interfere > > with AVX512's kunpckdq instruction which is defined as identical RTL, > > DW:DI = (HI:SI<<32)|zero_extend(LO:SI). To distinguish this, and > > also avoid AVX512 mask registers being used by reload to perform > > SImode scalar shifts, I've added the explicit (unspec UNSPEC_MASKOP) > > to the unpack mask operations, which matches what sse.md does for > > the other mask specific (logic) operations. > > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > > and make -k check, both with and without --target_board=unix{-m32}, > > with no new failures. Ok for mainline? > > > > > > 2022-06-03 Roger Sayle <ro...@nextmovesoftware.com> > > > > gcc/ChangeLog > > PR target/91681 > > * config/i386/i386.md (zero_extendditi2): New define_insn_and_split. > > (*add<dwi>3_doubleword_zext): New define_insn_and_split. > > (*sub<dwi>3_doubleword_zext): New define_insn_and_split. > > (*concat<mode><dwi>3_1): New define_insn_and_split replacing > > previous define_split for implementing DST = (HI<<32)|LO as > > pair of move instructions, setting lopart and hipart. > > (*concat<mode><dwi>3_2): Likewise. > > (*concat<mode><dwi>3_3): Likewise, where HI is zero_extended. > > (*concat<mode><dwi>3_4): Likewise, where HI is zero_extended. > > * config/i386/sse.md (kunpckhi): Add UNSPEC_MASKOP unspec. > > (kunpcksi): Likewise, add UNSPEC_MASKOP unspec. > > (kunpckdi): Likewise, add UNSPEC_MASKOP unspec. > > (vec_pack_trunc_qi): Update to specify required UNSPEC_MASKOP > > unspec. > > (vec_pack_trunc_<mode>): Likewise. > > > > gcc/testsuite/ChangeLog > > PR target/91681 > > * g++.target/i386/pr91681.C: New test case (from the PR). > > * gcc.target/i386/pr91681-1.c: New int128 test case. > > * gcc.target/i386/pr91681-2.c: Likewise. > > * gcc.target/i386/pr91681-3.c: Likewise, but for ia32.
+(define_insn_and_split "*concat<mode><dwi>3_1" + [(set (match_operand:<DWI> 0 "register_operand" "=r") + (any_or_plus:<DWI> + (ashift:<DWI> (match_operand:<DWI> 1 "register_operand" "r") + (match_operand:<DWI> 2 "const_int_operand" "n")) You can remove "n" when we deal with non-commutative (without %) operands. + (zero_extend:<DWI> (match_operand:DWIH 3 "register_operand" "r"))))] + "INTVAL (operands[2]) == <MODE_SIZE> * BITS_PER_UNIT + && ix86_pre_reload_split ()" + "#" + "&& 1" + [(set (match_dup 4) (match_dup 3)) + (set (match_dup 5) (match_dup 6))] { - operands[3] = gen_highpart (SImode, operands[0]); - operands[4] = gen_lowpart (SImode, operands[1]); - operands[5] = gen_lowpart (SImode, operands[0]); + operands[4] = gen_lowpart (<MODE>mode, operands[0]); + operands[5] = gen_highpart (<MODE>mode, operands[0]); + operands[6] = gen_lowpart (<MODE>mode, operands[1]); Pre-reload splitters need to force_reg their register_operands. You can have SUBREG here and gen_lowpar/gen_higpart will fail on SUBREG. Uros.