Hi Uros, Many thanks for your speedy review. This revised patch implements all three of your recommended improvements; the use of ix86_binary_operator_ok with code UNKNOWN, the removal of "n" constraints from const_int_operand predicates, and the use of force_reg (for input operands, and REG_P for output operands) to ensure that it's always safe to call gen_lowpart/gen_highpart.
[If we proceed with the recent proposal to split double word addition, subtraction and other operations before reload, then these new add/sub variants should be updated at the same time, but for now this patch keeps double word patterns consistent]. This revised patch has been retested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2022-06-05 Roger Sayle <ro...@nextmovesoftware.com> Uroš Bizjak <ubiz...@gmail.com> gcc/ChangeLog PR target/91681 * config/i386/i386.md (zero_extendditi2): New define_insn_and_split. (*add<dwi>3_doubleword_zext): New define_insn_and_split. (*sub<dwi>3_doubleword_zext): New define_insn_and_split. (*concat<mode><dwi>3_1): New define_insn_and_split replacing previous define_split for implementing DST = (HI<<32)|LO as pair of move instructions, setting lopart and hipart. (*concat<mode><dwi>3_2): Likewise. (*concat<mode><dwi>3_3): Likewise, where HI is zero_extended. (*concat<mode><dwi>3_4): Likewise, where HI is zero_extended. * config/i386/sse.md (kunpckhi): Add UNSPEC_MASKOP unspec. (kunpcksi): Likewise, add UNSPEC_MASKOP unspec. (kunpckdi): Likewise, add UNSPEC_MASKOP unspec. (vec_pack_trunc_qi): Update to specify required UNSPEC_MASKOP unspec. (vec_pack_trunc_<mode>): Likewise. gcc/testsuite/ChangeLog PR target/91681 * g++.target/i386/pr91681.C: New test case (from the PR). * gcc.target/i386/pr91681-1.c: New int128 test case. * gcc.target/i386/pr91681-2.c: Likewise. * gcc.target/i386/pr91681-3.c: Likewise, but for ia32. Thanks again, Roger -- > -----Original Message----- > From: Uros Bizjak <ubiz...@gmail.com> > Sent: 03 June 2022 11:08 > To: Roger Sayle <ro...@nextmovesoftware.com> > Cc: GCC Patches <gcc-patches@gcc.gnu.org> > Subject: Re: [x86 PATCH] PR target/91681: zero_extendditi2 pattern for more > optimizations. > > On Fri, Jun 3, 2022 at 11:49 AM Roger Sayle <ro...@nextmovesoftware.com> > wrote: > > > > > > Technically, PR target/91681 has already been resolved; we now > > recognize the highpart multiplication at the tree-level, we no longer > > use the stack, and we currently generate the same number of > > instructions as LLVM. However, it is still possible to do better, the > > current x86_64 code to generate a double word addition of a zero extended > operand, looks like: > > > > xorl %r11d, %r11d > > addq %r10, %rax > > adcq %r11, %rdx > > > > when it's possible (as LLVM does) to use an immediate constant: > > > > addq %r10, %rax > > adcq $0, %rdx > > > > To do this, the backend required one or two simple changes, that then > > themselves required one or two more obscure tweaks. > > > > The simple starting point is to define a zero_extendditi2 pattern, for > > zero extension from DImode to TImode on TARGET_64BIT that is split > > after reload. Double word (TImode) addition/subtraction is split > > after reload, so that constrains when things should happen. > > > > With zero extension now visible to combine, we add two new > > define_insn_and_split that add/subtract a zero extended operand in > > double word mode. These apply to both 32-bit and 64-bit code > > generation, to produce adc $0 and sbb $0. > > > > The first strange tweak is that these new patterns interfere with the > > optimization that recognizes DW:DI = (HI:SI<<32)+LO:SI as a pair of > > register moves, or more accurately the combine splitter no longer > > triggers as we're now converting two instructions into two > > instructions (not three instructions into two instructions). This is > > easily repaired (and extended to handle TImode) by changing from a > > pair of define_split (that handle operand commutativity) to a set of > > four define_insn_and_split (again to handle operand commutativity). > > > > The other/final strange tweak that the above splitters now interfere > > with AVX512's kunpckdq instruction which is defined as identical RTL, > > DW:DI = (HI:SI<<32)|zero_extend(LO:SI). To distinguish this, and also > > avoid AVX512 mask registers being used by reload to perform SImode > > scalar shifts, I've added the explicit (unspec UNSPEC_MASKOP) to the > > unpack mask operations, which matches what sse.md does for the other > > mask specific (logic) operations. > > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > > and make -k check, both with and without --target_board=unix{-m32}, > > with no new failures. Ok for mainline? > > > > > > 2022-06-03 Roger Sayle <ro...@nextmovesoftware.com> > > > > gcc/ChangeLog > > PR target/91681 > > * config/i386/i386.md (zero_extendditi2): New define_insn_and_split. > > (*add<dwi>3_doubleword_zext): New define_insn_and_split. > > (*sub<dwi>3_doubleword_zext): New define_insn_and_split. > > (*concat<mode><dwi>3_1): New define_insn_and_split replacing > > previous define_split for implementing DST = (HI<<32)|LO as > > pair of move instructions, setting lopart and hipart. > > (*concat<mode><dwi>3_2): Likewise. > > (*concat<mode><dwi>3_3): Likewise, where HI is zero_extended. > > (*concat<mode><dwi>3_4): Likewise, where HI is zero_extended. > > * config/i386/sse.md (kunpckhi): Add UNSPEC_MASKOP unspec. > > (kunpcksi): Likewise, add UNSPEC_MASKOP unspec. > > (kunpckdi): Likewise, add UNSPEC_MASKOP unspec. > > (vec_pack_trunc_qi): Update to specify required UNSPEC_MASKOP > > unspec. > > (vec_pack_trunc_<mode>): Likewise. > > > > gcc/testsuite/ChangeLog > > PR target/91681 > > * g++.target/i386/pr91681.C: New test case (from the PR). > > * gcc.target/i386/pr91681-1.c: New int128 test case. > > * gcc.target/i386/pr91681-2.c: Likewise. > > * gcc.target/i386/pr91681-3.c: Likewise, but for ia32. > > + "MEM_P (operands[0]) ? rtx_equal_p (operands[0], operands[1]) && > + !MEM_P (operands[2]) > + : !MEM_P (operands[1])" > > Can we use ix86_binary_operator_ok (UNKNOWN, ...mode..., operands) here > instead? > > (UNKNOWN RTX code is used to prevent unwanted optimization with > commutative operands). > > Uros.
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 602dfa7..b3d2c90 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -4325,6 +4325,16 @@ (set_attr "type" "imovx,mskmov,mskmov") (set_attr "mode" "SI,QI,QI")]) +(define_insn_and_split "zero_extendditi2" + [(set (match_operand:TI 0 "nonimmediate_operand" "=r,o") + (zero_extend:TI (match_operand:DI 1 "nonimmediate_operand" "rm,r")))] + "TARGET_64BIT" + "#" + "&& reload_completed" + [(set (match_dup 3) (match_dup 1)) + (set (match_dup 4) (const_int 0))] + "split_double_mode (TImode, &operands[0], 1, &operands[3], &operands[4]);") + ;; Transform xorl; mov[bw] (set strict_low_part) into movz[bw]l. (define_peephole2 [(parallel [(set (match_operand:SWI48 0 "general_reg_operand") @@ -6453,6 +6463,31 @@ [(set_attr "type" "alu") (set_attr "mode" "QI")]) +(define_insn_and_split "*add<dwi>3_doubleword_zext" + [(set (match_operand:<DWI> 0 "nonimmediate_operand" "=r,o") + (plus:<DWI> + (zero_extend:<DWI> + (match_operand:DWIH 2 "nonimmediate_operand" "rm,r")) + (match_operand:<DWI> 1 "nonimmediate_operand" "0,0"))) + (clobber (reg:CC FLAGS_REG))] + "ix86_binary_operator_ok (UNKNOWN, <DWI>mode, operands)" + "#" + "&& reload_completed" + [(parallel [(set (reg:CCC FLAGS_REG) + (compare:CCC + (plus:DWIH (match_dup 1) (match_dup 2)) + (match_dup 1))) + (set (match_dup 0) + (plus:DWIH (match_dup 1) (match_dup 2)))]) + (parallel [(set (match_dup 3) + (plus:DWIH + (plus:DWIH + (ltu:DWIH (reg:CC FLAGS_REG) (const_int 0)) + (match_dup 4)) + (const_int 0))) + (clobber (reg:CC FLAGS_REG))])] + "split_double_mode (<DWI>mode, &operands[0], 2, &operands[0], &operands[3]);") + ;; Like DWI, but use POImode instead of OImode. (define_mode_attr DPWI [(QI "HI") (HI "SI") (SI "DI") (DI "TI") (TI "POI")]) @@ -6903,6 +6938,29 @@ } }) +(define_insn_and_split "*sub<dwi>3_doubleword_zext" + [(set (match_operand:<DWI> 0 "nonimmediate_operand" "=r,o") + (minus:<DWI> + (match_operand:<DWI> 1 "nonimmediate_operand" "0,0") + (zero_extend:<DWI> + (match_operand:DWIH 2 "nonimmediate_operand" "rm,r")))) + (clobber (reg:CC FLAGS_REG))] + "ix86_binary_operator_ok (UNKNOWN, <DWI>mode, operands)" + "#" + "&& reload_completed" + [(parallel [(set (reg:CC FLAGS_REG) + (compare:CC (match_dup 1) (match_dup 2))) + (set (match_dup 0) + (minus:DWIH (match_dup 1) (match_dup 2)))]) + (parallel [(set (match_dup 3) + (minus:DWIH + (minus:DWIH + (match_dup 4) + (ltu:DWIH (reg:CC FLAGS_REG) (const_int 0))) + (const_int 0))) + (clobber (reg:CC FLAGS_REG))])] + "split_double_mode (<DWI>mode, &operands[0], 2, &operands[0], &operands[3]);") + (define_insn "*sub<mode>_1" [(set (match_operand:SWI 0 "nonimmediate_operand" "=<r>m,<r>") (minus:SWI @@ -10956,34 +11014,82 @@ ;; Split DST = (HI<<32)|LO early to minimize register usage. (define_code_iterator any_or_plus [plus ior xor]) -(define_split - [(set (match_operand:DI 0 "register_operand") - (any_or_plus:DI - (ashift:DI (match_operand:DI 1 "register_operand") - (const_int 32)) - (zero_extend:DI (match_operand:SI 2 "register_operand"))))] - "!TARGET_64BIT" - [(set (match_dup 3) (match_dup 4)) - (set (match_dup 5) (match_dup 2))] +(define_insn_and_split "*concat<mode><dwi>3_1" + [(set (match_operand:<DWI> 0 "register_operand" "=r") + (any_or_plus:<DWI> + (ashift:<DWI> (match_operand:<DWI> 1 "register_operand" "r") + (match_operand:<DWI> 2 "const_int_operand")) + (zero_extend:<DWI> (match_operand:DWIH 3 "register_operand" "r"))))] + "INTVAL (operands[2]) == <MODE_SIZE> * BITS_PER_UNIT + && REG_P (operands[0]) + && ix86_pre_reload_split ()" + "#" + "&& 1" + [(set (match_dup 4) (match_dup 3)) + (set (match_dup 5) (match_dup 6))] +{ + operands[1] = force_reg (<DWI>mode, operands[1]); + operands[4] = gen_lowpart (<MODE>mode, operands[0]); + operands[5] = gen_highpart (<MODE>mode, operands[0]); + operands[6] = gen_lowpart (<MODE>mode, operands[1]); +}) + +(define_insn_and_split "*concat<mode><dwi>3_2" + [(set (match_operand:<DWI> 0 "register_operand" "=r") + (any_or_plus:<DWI> + (zero_extend:<DWI> (match_operand:DWIH 1 "register_operand" "r")) + (ashift:<DWI> (match_operand:<DWI> 2 "register_operand" "r") + (match_operand:<DWI> 3 "const_int_operand"))))] + "INTVAL (operands[3]) == <MODE_SIZE> * BITS_PER_UNIT + && REG_P (operands[0]) + && ix86_pre_reload_split ()" + "#" + "&& 1" + [(set (match_dup 4) (match_dup 1)) + (set (match_dup 5) (match_dup 6))] +{ + operands[2] = force_reg (<DWI>mode, operands[2]); + operands[4] = gen_lowpart (<MODE>mode, operands[0]); + operands[5] = gen_highpart (<MODE>mode, operands[0]); + operands[6] = gen_lowpart (<MODE>mode, operands[2]); +}) + +(define_insn_and_split "*concat<mode><dwi>3_3" + [(set (match_operand:<DWI> 0 "register_operand" "=r") + (any_or_plus:<DWI> + (ashift:<DWI> + (zero_extend:<DWI> (match_operand:DWIH 1 "register_operand" "r")) + (match_operand:<DWI> 2 "const_int_operand")) + (zero_extend:<DWI> (match_operand:DWIH 3 "register_operand" "r"))))] + "INTVAL (operands[2]) == <MODE_SIZE> * BITS_PER_UNIT + && REG_P (operands[0]) + && ix86_pre_reload_split ()" + "#" + "&& 1" + [(set (match_dup 4) (match_dup 3)) + (set (match_dup 5) (match_dup 1))] { - operands[3] = gen_highpart (SImode, operands[0]); - operands[4] = gen_lowpart (SImode, operands[1]); - operands[5] = gen_lowpart (SImode, operands[0]); + operands[4] = gen_lowpart (<MODE>mode, operands[0]); + operands[5] = gen_highpart (<MODE>mode, operands[0]); }) -(define_split - [(set (match_operand:DI 0 "register_operand") - (any_or_plus:DI - (zero_extend:DI (match_operand:SI 1 "register_operand")) - (ashift:DI (match_operand:DI 2 "register_operand") - (const_int 32))))] - "!TARGET_64BIT" - [(set (match_dup 3) (match_dup 4)) - (set (match_dup 5) (match_dup 1))] +(define_insn_and_split "*concat<mode><dwi>3_4" + [(set (match_operand:<DWI> 0 "register_operand" "=r") + (any_or_plus:<DWI> + (zero_extend:<DWI> (match_operand:DWIH 1 "register_operand" "r")) + (ashift:<DWI> + (zero_extend:<DWI> (match_operand:DWIH 2 "register_operand" "r")) + (match_operand:<DWI> 3 "const_int_operand"))))] + "INTVAL (operands[3]) == <MODE_SIZE> * BITS_PER_UNIT + && REG_P (operands[0]) + && ix86_pre_reload_split ()" + "#" + "&& 1" + [(set (match_dup 4) (match_dup 1)) + (set (match_dup 5) (match_dup 2))] { - operands[3] = gen_highpart (SImode, operands[0]); - operands[4] = gen_lowpart (SImode, operands[2]); - operands[5] = gen_lowpart (SImode, operands[0]); + operands[4] = gen_lowpart (<MODE>mode, operands[0]); + operands[5] = gen_highpart (<MODE>mode, operands[0]); }) ;; Negation instructions diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index 8b2602b..0198156 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -2070,7 +2070,8 @@ (ashift:HI (zero_extend:HI (match_operand:QI 1 "register_operand" "k")) (const_int 8)) - (zero_extend:HI (match_operand:QI 2 "register_operand" "k"))))] + (zero_extend:HI (match_operand:QI 2 "register_operand" "k")))) + (unspec [(const_int 0)] UNSPEC_MASKOP)] "TARGET_AVX512F" "kunpckbw\t{%2, %1, %0|%0, %1, %2}" [(set_attr "mode" "HI") @@ -2083,7 +2084,8 @@ (ashift:SI (zero_extend:SI (match_operand:HI 1 "register_operand" "k")) (const_int 16)) - (zero_extend:SI (match_operand:HI 2 "register_operand" "k"))))] + (zero_extend:SI (match_operand:HI 2 "register_operand" "k")))) + (unspec [(const_int 0)] UNSPEC_MASKOP)] "TARGET_AVX512BW" "kunpckwd\t{%2, %1, %0|%0, %1, %2}" [(set_attr "mode" "SI")]) @@ -2094,7 +2096,8 @@ (ashift:DI (zero_extend:DI (match_operand:SI 1 "register_operand" "k")) (const_int 32)) - (zero_extend:DI (match_operand:SI 2 "register_operand" "k"))))] + (zero_extend:DI (match_operand:SI 2 "register_operand" "k")))) + (unspec [(const_int 0)] UNSPEC_MASKOP)] "TARGET_AVX512BW" "kunpckdq\t{%2, %1, %0|%0, %1, %2}" [(set_attr "mode" "DI")]) @@ -17398,21 +17401,26 @@ }) (define_expand "vec_pack_trunc_qi" - [(set (match_operand:HI 0 "register_operand") - (ior:HI (ashift:HI (zero_extend:HI (match_operand:QI 2 "register_operand")) - (const_int 8)) - (zero_extend:HI (match_operand:QI 1 "register_operand"))))] + [(parallel + [(set (match_operand:HI 0 "register_operand") + (ior:HI + (ashift:HI (zero_extend:HI (match_operand:QI 2 "register_operand")) + (const_int 8)) + (zero_extend:HI (match_operand:QI 1 "register_operand")))) + (unspec [(const_int 0)] UNSPEC_MASKOP)])] "TARGET_AVX512F") (define_expand "vec_pack_trunc_<mode>" - [(set (match_operand:<DOUBLEMASKMODE> 0 "register_operand") - (ior:<DOUBLEMASKMODE> - (ashift:<DOUBLEMASKMODE> + [(parallel + [(set (match_operand:<DOUBLEMASKMODE> 0 "register_operand") + (ior:<DOUBLEMASKMODE> + (ashift:<DOUBLEMASKMODE> + (zero_extend:<DOUBLEMASKMODE> + (match_operand:SWI24 2 "register_operand")) + (match_dup 3)) (zero_extend:<DOUBLEMASKMODE> - (match_operand:SWI24 2 "register_operand")) - (match_dup 3)) - (zero_extend:<DOUBLEMASKMODE> - (match_operand:SWI24 1 "register_operand"))))] + (match_operand:SWI24 1 "register_operand")))) + (unspec [(const_int 0)] UNSPEC_MASKOP)])] "TARGET_AVX512BW" { operands[3] = GEN_INT (GET_MODE_BITSIZE (<MODE>mode)); diff --git a/gcc/testsuite/g++.target/i386/pr91681.C b/gcc/testsuite/g++.target/i386/pr91681.C new file mode 100644 index 0000000..0271e43 --- /dev/null +++ b/gcc/testsuite/g++.target/i386/pr91681.C @@ -0,0 +1,20 @@ +/* { dg-do compile { target int128 } } */ +/* { dg-options "-O2" } */ + +void multiply128x64x2_3 ( + const unsigned long a, + const unsigned long b, + const unsigned long c, + const unsigned long d, + __uint128_t o[2]) +{ + __uint128_t B0 = (__uint128_t) b * c; + __uint128_t B2 = (__uint128_t) a * c; + __uint128_t B1 = (__uint128_t) b * d; + __uint128_t B3 = (__uint128_t) a * d; + + o[0] = B2 + (B0 >> 64); + o[1] = B3 + (B1 >> 64); +} + +/* { dg-final { scan-assembler-not "xor" } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr91681-1.c b/gcc/testsuite/gcc.target/i386/pr91681-1.c new file mode 100644 index 0000000..ab83cc4 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr91681-1.c @@ -0,0 +1,20 @@ +/* { dg-do compile { target int128 } } */ +/* { dg-options "-O2" } */ +unsigned __int128 m; + +unsigned __int128 foo(unsigned __int128 x, unsigned long long y) +{ + return x + y; +} + +void bar(unsigned __int128 x, unsigned long long y) +{ + m = x + y; +} + +void baz(unsigned long long y) +{ + m += y; +} + +/* { dg-final { scan-assembler-not "xor" } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr91681-2.c b/gcc/testsuite/gcc.target/i386/pr91681-2.c new file mode 100644 index 0000000..ea52c72 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr91681-2.c @@ -0,0 +1,20 @@ +/* { dg-do compile { target int128 } } */ +/* { dg-options "-O2" } */ +unsigned __int128 m; + +unsigned __int128 foo(unsigned __int128 x, unsigned long long y) +{ + return x - y; +} + +void bar(unsigned __int128 x, unsigned long long y) +{ + m = x - y; +} + +void baz(unsigned long long y) +{ + m -= y; +} + +/* { dg-final { scan-assembler-not "xor" } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr91681-3.c b/gcc/testsuite/gcc.target/i386/pr91681-3.c new file mode 100644 index 0000000..22a03c2 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr91681-3.c @@ -0,0 +1,16 @@ +/* { dg-do compile { target ia32 } } */ +/* { dg-options "-O2" } */ + +unsigned long long m; + +unsigned long long foo(unsigned long long x, unsigned int y) +{ + return x - y; +} + +void bar(unsigned long long x, unsigned int y) +{ + m = x - y; +} + +/* { dg-final { scan-assembler-not "xor" } } */