Hi Uros,
Many thanks for your speedy review.  This revised patch implements
all three of your recommended improvements; the use of
ix86_binary_operator_ok with code UNKNOWN, the removal of
"n" constraints from const_int_operand predicates, and the use
of force_reg (for input operands, and REG_P for output operands)
to ensure that it's always safe to call gen_lowpart/gen_highpart.

[If we proceed with the recent proposal to split double word 
addition, subtraction and other operations before reload, then
these new add/sub variants should be updated at the same time,
but for now this patch keeps double word patterns consistent].
 
This revised patch has been retested on x86_64-pc-linux-gnu with
make bootstrap and make -k check, both with and without 
--target_board=unix{-m32} with no new failures.  Ok for mainline?


2022-06-05  Roger Sayle  <ro...@nextmovesoftware.com>
            Uroš Bizjak  <ubiz...@gmail.com>

gcc/ChangeLog
        PR target/91681
        * config/i386/i386.md (zero_extendditi2): New define_insn_and_split.
        (*add<dwi>3_doubleword_zext): New define_insn_and_split.
        (*sub<dwi>3_doubleword_zext): New define_insn_and_split.
        (*concat<mode><dwi>3_1): New define_insn_and_split replacing
        previous define_split for implementing DST = (HI<<32)|LO as
        pair of move instructions, setting lopart and hipart.
        (*concat<mode><dwi>3_2): Likewise.
        (*concat<mode><dwi>3_3): Likewise, where HI is zero_extended.
        (*concat<mode><dwi>3_4): Likewise, where HI is zero_extended.
        * config/i386/sse.md (kunpckhi): Add UNSPEC_MASKOP unspec.
        (kunpcksi): Likewise, add UNSPEC_MASKOP unspec.
        (kunpckdi): Likewise, add UNSPEC_MASKOP unspec.
        (vec_pack_trunc_qi): Update to specify required UNSPEC_MASKOP unspec.
        (vec_pack_trunc_<mode>): Likewise.

gcc/testsuite/ChangeLog
        PR target/91681
        * g++.target/i386/pr91681.C: New test case (from the PR).
        * gcc.target/i386/pr91681-1.c: New int128 test case.
        * gcc.target/i386/pr91681-2.c: Likewise.
        * gcc.target/i386/pr91681-3.c: Likewise, but for ia32.


Thanks again,
Roger
--

> -----Original Message-----
> From: Uros Bizjak <ubiz...@gmail.com>
> Sent: 03 June 2022 11:08
> To: Roger Sayle <ro...@nextmovesoftware.com>
> Cc: GCC Patches <gcc-patches@gcc.gnu.org>
> Subject: Re: [x86 PATCH] PR target/91681: zero_extendditi2 pattern for more
> optimizations.
> 
> On Fri, Jun 3, 2022 at 11:49 AM Roger Sayle <ro...@nextmovesoftware.com>
> wrote:
> >
> >
> > Technically, PR target/91681 has already been resolved; we now
> > recognize the highpart multiplication at the tree-level, we no longer
> > use the stack, and we currently generate the same number of
> > instructions as LLVM.  However, it is still possible to do better, the
> > current x86_64 code to generate a double word addition of a zero extended
> operand, looks like:
> >
> >         xorl    %r11d, %r11d
> >         addq    %r10, %rax
> >         adcq    %r11, %rdx
> >
> > when it's possible (as LLVM does) to use an immediate constant:
> >
> >         addq    %r10, %rax
> >         adcq    $0, %rdx
> >
> > To do this, the backend required one or two simple changes, that then
> > themselves required one or two more obscure tweaks.
> >
> > The simple starting point is to define a zero_extendditi2 pattern, for
> > zero extension from DImode to TImode on TARGET_64BIT that is split
> > after reload.  Double word (TImode) addition/subtraction is split
> > after reload, so that constrains when things should happen.
> >
> > With zero extension now visible to combine, we add two new
> > define_insn_and_split that add/subtract a zero extended operand in
> > double word mode.  These apply to both 32-bit and 64-bit code
> > generation, to produce adc $0 and sbb $0.
> >
> > The first strange tweak is that these new patterns interfere with the
> > optimization that recognizes DW:DI = (HI:SI<<32)+LO:SI as a pair of
> > register moves, or more accurately the combine splitter no longer
> > triggers as we're now converting two instructions into two
> > instructions (not three instructions into two instructions).  This is
> > easily repaired (and extended to handle TImode) by changing from a
> > pair of define_split (that handle operand commutativity) to a set of
> > four define_insn_and_split (again to handle operand commutativity).
> >
> > The other/final strange tweak that the above splitters now interfere
> > with AVX512's kunpckdq instruction which is defined as identical RTL,
> > DW:DI = (HI:SI<<32)|zero_extend(LO:SI).  To distinguish this, and also
> > avoid AVX512 mask registers being used by reload to perform SImode
> > scalar shifts, I've added the explicit (unspec UNSPEC_MASKOP) to the
> > unpack mask operations, which matches what sse.md does for the other
> > mask specific (logic) operations.
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32},
> > with no new failures.  Ok for mainline?
> >
> >
> > 2022-06-03  Roger Sayle  <ro...@nextmovesoftware.com>
> >
> > gcc/ChangeLog
> >         PR target/91681
> >         * config/i386/i386.md (zero_extendditi2): New define_insn_and_split.
> >         (*add<dwi>3_doubleword_zext): New define_insn_and_split.
> >         (*sub<dwi>3_doubleword_zext): New define_insn_and_split.
> >         (*concat<mode><dwi>3_1): New define_insn_and_split replacing
> >         previous define_split for implementing DST = (HI<<32)|LO as
> >         pair of move instructions, setting lopart and hipart.
> >         (*concat<mode><dwi>3_2): Likewise.
> >         (*concat<mode><dwi>3_3): Likewise, where HI is zero_extended.
> >         (*concat<mode><dwi>3_4): Likewise, where HI is zero_extended.
> >         * config/i386/sse.md (kunpckhi): Add UNSPEC_MASKOP unspec.
> >         (kunpcksi): Likewise, add UNSPEC_MASKOP unspec.
> >         (kunpckdi): Likewise, add UNSPEC_MASKOP unspec.
> >         (vec_pack_trunc_qi): Update to specify required UNSPEC_MASKOP
> > unspec.
> >         (vec_pack_trunc_<mode>): Likewise.
> >
> > gcc/testsuite/ChangeLog
> >         PR target/91681
> >         * g++.target/i386/pr91681.C: New test case (from the PR).
> >         * gcc.target/i386/pr91681-1.c: New int128 test case.
> >         * gcc.target/i386/pr91681-2.c: Likewise.
> >         * gcc.target/i386/pr91681-3.c: Likewise, but for ia32.
> 
> +  "MEM_P (operands[0]) ? rtx_equal_p (operands[0], operands[1]) &&
> + !MEM_P (operands[2])
> +       : !MEM_P (operands[1])"
> 
> Can we use ix86_binary_operator_ok (UNKNOWN, ...mode..., operands) here
> instead?
> 
> (UNKNOWN RTX code is used to prevent unwanted optimization with
> commutative operands).
> 
> Uros.
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 602dfa7..b3d2c90 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -4325,6 +4325,16 @@
    (set_attr "type" "imovx,mskmov,mskmov")
    (set_attr "mode" "SI,QI,QI")])
 
+(define_insn_and_split "zero_extendditi2"
+  [(set (match_operand:TI 0 "nonimmediate_operand" "=r,o")
+       (zero_extend:TI (match_operand:DI 1 "nonimmediate_operand" "rm,r")))]
+  "TARGET_64BIT"
+  "#"
+  "&& reload_completed"
+  [(set (match_dup 3) (match_dup 1))
+   (set (match_dup 4) (const_int 0))]
+  "split_double_mode (TImode, &operands[0], 1, &operands[3], &operands[4]);")
+
 ;; Transform xorl; mov[bw] (set strict_low_part) into movz[bw]l.
 (define_peephole2
   [(parallel [(set (match_operand:SWI48 0 "general_reg_operand")
@@ -6453,6 +6463,31 @@
   [(set_attr "type" "alu")
    (set_attr "mode" "QI")])
 
+(define_insn_and_split "*add<dwi>3_doubleword_zext"
+  [(set (match_operand:<DWI> 0 "nonimmediate_operand" "=r,o")
+       (plus:<DWI>
+         (zero_extend:<DWI>
+           (match_operand:DWIH 2 "nonimmediate_operand" "rm,r")) 
+         (match_operand:<DWI> 1 "nonimmediate_operand" "0,0")))
+   (clobber (reg:CC FLAGS_REG))]
+  "ix86_binary_operator_ok (UNKNOWN, <DWI>mode, operands)"
+  "#"
+  "&& reload_completed"
+  [(parallel [(set (reg:CCC FLAGS_REG)
+                  (compare:CCC
+                    (plus:DWIH (match_dup 1) (match_dup 2))
+                    (match_dup 1)))
+             (set (match_dup 0)
+                  (plus:DWIH (match_dup 1) (match_dup 2)))])
+   (parallel [(set (match_dup 3)
+                  (plus:DWIH
+                    (plus:DWIH
+                      (ltu:DWIH (reg:CC FLAGS_REG) (const_int 0))
+                      (match_dup 4))
+                    (const_int 0)))
+             (clobber (reg:CC FLAGS_REG))])]
+ "split_double_mode (<DWI>mode, &operands[0], 2, &operands[0], &operands[3]);")
+
 ;; Like DWI, but use POImode instead of OImode.
 (define_mode_attr DPWI [(QI "HI") (HI "SI") (SI "DI") (DI "TI") (TI "POI")])
 
@@ -6903,6 +6938,29 @@
     }
 })
 
+(define_insn_and_split "*sub<dwi>3_doubleword_zext"
+  [(set (match_operand:<DWI> 0 "nonimmediate_operand" "=r,o")
+       (minus:<DWI>
+         (match_operand:<DWI> 1 "nonimmediate_operand" "0,0")
+         (zero_extend:<DWI>
+           (match_operand:DWIH 2 "nonimmediate_operand" "rm,r"))))
+   (clobber (reg:CC FLAGS_REG))]
+  "ix86_binary_operator_ok (UNKNOWN, <DWI>mode, operands)"
+  "#"
+  "&& reload_completed"
+  [(parallel [(set (reg:CC FLAGS_REG)
+                  (compare:CC (match_dup 1) (match_dup 2)))
+             (set (match_dup 0)
+                  (minus:DWIH (match_dup 1) (match_dup 2)))])
+   (parallel [(set (match_dup 3)
+                  (minus:DWIH
+                    (minus:DWIH
+                      (match_dup 4)
+                      (ltu:DWIH (reg:CC FLAGS_REG) (const_int 0)))
+                    (const_int 0)))
+             (clobber (reg:CC FLAGS_REG))])]
+  "split_double_mode (<DWI>mode, &operands[0], 2, &operands[0], 
&operands[3]);")
+
 (define_insn "*sub<mode>_1"
   [(set (match_operand:SWI 0 "nonimmediate_operand" "=<r>m,<r>")
        (minus:SWI
@@ -10956,34 +11014,82 @@
 
 ;; Split DST = (HI<<32)|LO early to minimize register usage.
 (define_code_iterator any_or_plus [plus ior xor])
-(define_split
-  [(set (match_operand:DI 0 "register_operand")
-       (any_or_plus:DI
-         (ashift:DI (match_operand:DI 1 "register_operand")
-                    (const_int 32))
-         (zero_extend:DI (match_operand:SI 2 "register_operand"))))]
-  "!TARGET_64BIT"
-  [(set (match_dup 3) (match_dup 4))
-   (set (match_dup 5) (match_dup 2))]
+(define_insn_and_split "*concat<mode><dwi>3_1"
+  [(set (match_operand:<DWI> 0 "register_operand" "=r")
+       (any_or_plus:<DWI>
+         (ashift:<DWI> (match_operand:<DWI> 1 "register_operand" "r")
+                       (match_operand:<DWI> 2 "const_int_operand"))
+         (zero_extend:<DWI> (match_operand:DWIH 3 "register_operand" "r"))))]
+  "INTVAL (operands[2]) == <MODE_SIZE> * BITS_PER_UNIT
+   && REG_P (operands[0])
+   && ix86_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 4) (match_dup 3))
+   (set (match_dup 5) (match_dup 6))]
+{
+  operands[1] = force_reg (<DWI>mode, operands[1]);
+  operands[4] = gen_lowpart (<MODE>mode, operands[0]);
+  operands[5] = gen_highpart (<MODE>mode, operands[0]);
+  operands[6] = gen_lowpart (<MODE>mode, operands[1]);
+})
+
+(define_insn_and_split "*concat<mode><dwi>3_2"
+  [(set (match_operand:<DWI> 0 "register_operand" "=r")
+       (any_or_plus:<DWI>
+         (zero_extend:<DWI> (match_operand:DWIH 1 "register_operand" "r"))
+         (ashift:<DWI> (match_operand:<DWI> 2 "register_operand" "r")
+                       (match_operand:<DWI> 3 "const_int_operand"))))]
+  "INTVAL (operands[3]) == <MODE_SIZE> * BITS_PER_UNIT
+   && REG_P (operands[0])
+   && ix86_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 4) (match_dup 1))
+   (set (match_dup 5) (match_dup 6))]
+{
+  operands[2] = force_reg (<DWI>mode, operands[2]);
+  operands[4] = gen_lowpart (<MODE>mode, operands[0]);
+  operands[5] = gen_highpart (<MODE>mode, operands[0]);
+  operands[6] = gen_lowpart (<MODE>mode, operands[2]);
+})
+
+(define_insn_and_split "*concat<mode><dwi>3_3"
+  [(set (match_operand:<DWI> 0 "register_operand" "=r")
+       (any_or_plus:<DWI>
+         (ashift:<DWI>
+           (zero_extend:<DWI> (match_operand:DWIH 1 "register_operand" "r"))
+           (match_operand:<DWI> 2 "const_int_operand"))
+         (zero_extend:<DWI> (match_operand:DWIH 3 "register_operand" "r"))))]
+  "INTVAL (operands[2]) == <MODE_SIZE> * BITS_PER_UNIT
+   && REG_P (operands[0])
+   && ix86_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 4) (match_dup 3))
+   (set (match_dup 5) (match_dup 1))]
 {
-  operands[3] = gen_highpart (SImode, operands[0]);
-  operands[4] = gen_lowpart (SImode, operands[1]);
-  operands[5] = gen_lowpart (SImode, operands[0]);
+  operands[4] = gen_lowpart (<MODE>mode, operands[0]);
+  operands[5] = gen_highpart (<MODE>mode, operands[0]);
 })
 
-(define_split
-  [(set (match_operand:DI 0 "register_operand")
-       (any_or_plus:DI
-         (zero_extend:DI (match_operand:SI 1 "register_operand"))
-         (ashift:DI (match_operand:DI 2 "register_operand")
-                    (const_int 32))))]
-  "!TARGET_64BIT"
-  [(set (match_dup 3) (match_dup 4))
-   (set (match_dup 5) (match_dup 1))]
+(define_insn_and_split "*concat<mode><dwi>3_4"
+  [(set (match_operand:<DWI> 0 "register_operand" "=r")
+       (any_or_plus:<DWI>
+         (zero_extend:<DWI> (match_operand:DWIH 1 "register_operand" "r"))
+         (ashift:<DWI>
+           (zero_extend:<DWI> (match_operand:DWIH 2 "register_operand" "r"))
+           (match_operand:<DWI> 3 "const_int_operand"))))]
+  "INTVAL (operands[3]) == <MODE_SIZE> * BITS_PER_UNIT
+   && REG_P (operands[0])
+   && ix86_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 4) (match_dup 1))
+   (set (match_dup 5) (match_dup 2))]
 {
-  operands[3] = gen_highpart (SImode, operands[0]);
-  operands[4] = gen_lowpart (SImode, operands[2]);
-  operands[5] = gen_lowpart (SImode, operands[0]);
+  operands[4] = gen_lowpart (<MODE>mode, operands[0]);
+  operands[5] = gen_highpart (<MODE>mode, operands[0]);
 })
 
 ;; Negation instructions
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 8b2602b..0198156 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -2070,7 +2070,8 @@
          (ashift:HI
            (zero_extend:HI (match_operand:QI 1 "register_operand" "k"))
            (const_int 8))
-         (zero_extend:HI (match_operand:QI 2 "register_operand" "k"))))]
+         (zero_extend:HI (match_operand:QI 2 "register_operand" "k"))))
+   (unspec [(const_int 0)] UNSPEC_MASKOP)]
   "TARGET_AVX512F"
   "kunpckbw\t{%2, %1, %0|%0, %1, %2}"
   [(set_attr "mode" "HI")
@@ -2083,7 +2084,8 @@
          (ashift:SI
            (zero_extend:SI (match_operand:HI 1 "register_operand" "k"))
            (const_int 16))
-         (zero_extend:SI (match_operand:HI 2 "register_operand" "k"))))]
+         (zero_extend:SI (match_operand:HI 2 "register_operand" "k"))))
+   (unspec [(const_int 0)] UNSPEC_MASKOP)]
   "TARGET_AVX512BW"
   "kunpckwd\t{%2, %1, %0|%0, %1, %2}"
   [(set_attr "mode" "SI")])
@@ -2094,7 +2096,8 @@
          (ashift:DI
            (zero_extend:DI (match_operand:SI 1 "register_operand" "k"))
            (const_int 32))
-         (zero_extend:DI (match_operand:SI 2 "register_operand" "k"))))]
+         (zero_extend:DI (match_operand:SI 2 "register_operand" "k"))))
+   (unspec [(const_int 0)] UNSPEC_MASKOP)]
   "TARGET_AVX512BW"
   "kunpckdq\t{%2, %1, %0|%0, %1, %2}"
   [(set_attr "mode" "DI")])
@@ -17398,21 +17401,26 @@
 })
 
 (define_expand "vec_pack_trunc_qi"
-  [(set (match_operand:HI 0 "register_operand")
-       (ior:HI (ashift:HI (zero_extend:HI (match_operand:QI 2 
"register_operand"))
-                           (const_int 8))
-               (zero_extend:HI (match_operand:QI 1 "register_operand"))))]
+  [(parallel
+    [(set (match_operand:HI 0 "register_operand")
+       (ior:HI
+          (ashift:HI (zero_extend:HI (match_operand:QI 2 "register_operand"))
+                     (const_int 8))
+          (zero_extend:HI (match_operand:QI 1 "register_operand"))))
+     (unspec [(const_int 0)] UNSPEC_MASKOP)])]
   "TARGET_AVX512F")
 
 (define_expand "vec_pack_trunc_<mode>"
-  [(set (match_operand:<DOUBLEMASKMODE> 0 "register_operand")
-       (ior:<DOUBLEMASKMODE>
-         (ashift:<DOUBLEMASKMODE>
+  [(parallel
+    [(set (match_operand:<DOUBLEMASKMODE> 0 "register_operand")
+         (ior:<DOUBLEMASKMODE>
+           (ashift:<DOUBLEMASKMODE>
+             (zero_extend:<DOUBLEMASKMODE>
+               (match_operand:SWI24 2 "register_operand"))
+             (match_dup 3))
            (zero_extend:<DOUBLEMASKMODE>
-             (match_operand:SWI24 2 "register_operand"))
-           (match_dup 3))
-         (zero_extend:<DOUBLEMASKMODE>
-           (match_operand:SWI24 1 "register_operand"))))]
+             (match_operand:SWI24 1 "register_operand"))))
+     (unspec [(const_int 0)] UNSPEC_MASKOP)])]
   "TARGET_AVX512BW"
 {
   operands[3] = GEN_INT (GET_MODE_BITSIZE (<MODE>mode));
diff --git a/gcc/testsuite/g++.target/i386/pr91681.C 
b/gcc/testsuite/g++.target/i386/pr91681.C
new file mode 100644
index 0000000..0271e43
--- /dev/null
+++ b/gcc/testsuite/g++.target/i386/pr91681.C
@@ -0,0 +1,20 @@
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2" } */
+
+void multiply128x64x2_3 ( 
+    const unsigned long a, 
+    const unsigned long b, 
+    const unsigned long c, 
+    const unsigned long d, 
+    __uint128_t o[2])
+{
+    __uint128_t B0 = (__uint128_t) b * c;
+    __uint128_t B2 = (__uint128_t) a * c;
+    __uint128_t B1 = (__uint128_t) b * d;
+    __uint128_t B3 = (__uint128_t) a * d;
+
+    o[0] = B2 + (B0 >> 64);
+    o[1] = B3 + (B1 >> 64);
+}
+
+/* { dg-final { scan-assembler-not "xor" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr91681-1.c 
b/gcc/testsuite/gcc.target/i386/pr91681-1.c
new file mode 100644
index 0000000..ab83cc4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr91681-1.c
@@ -0,0 +1,20 @@
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2" } */
+unsigned __int128 m;
+
+unsigned __int128 foo(unsigned __int128 x, unsigned long long y)
+{
+    return x + y;
+}
+
+void bar(unsigned __int128 x, unsigned long long y)
+{
+    m = x + y;
+}
+
+void baz(unsigned long long y)
+{
+    m += y;
+}
+
+/* { dg-final { scan-assembler-not "xor" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr91681-2.c 
b/gcc/testsuite/gcc.target/i386/pr91681-2.c
new file mode 100644
index 0000000..ea52c72
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr91681-2.c
@@ -0,0 +1,20 @@
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2" } */
+unsigned __int128 m;
+
+unsigned __int128 foo(unsigned __int128 x, unsigned long long y)
+{
+    return x - y;
+}
+
+void bar(unsigned __int128 x, unsigned long long y)
+{
+    m = x - y;
+}
+
+void baz(unsigned long long y)
+{
+    m -= y;
+}
+
+/* { dg-final { scan-assembler-not "xor" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr91681-3.c 
b/gcc/testsuite/gcc.target/i386/pr91681-3.c
new file mode 100644
index 0000000..22a03c2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr91681-3.c
@@ -0,0 +1,16 @@
+/* { dg-do compile { target ia32 } } */
+/* { dg-options "-O2" } */
+
+unsigned long long m;
+
+unsigned long long foo(unsigned long long x, unsigned int y)
+{
+    return x - y;
+}
+
+void bar(unsigned long long x, unsigned int y)
+{
+    m = x - y;
+}
+
+/* { dg-final { scan-assembler-not "xor" } } */

Reply via email to