For x86 part

   if (negate)
-    cmp = ix86_expand_int_sse_cmp (operands[0], EQ, cmp,
-                                  CONST0_RTX (GET_MODE (cmp)),
-                                  NULL, NULL, &negate);
-
-  gcc_assert (!negate);
+    {
+      if (TARGET_AVX512F && GET_MODE_SIZE (GET_MODE (cmp)) >= 16)
+       cmp = gen_rtx_XOR (GET_MODE (cmp), cmp, CONSTM1_RTX (GET_MODE (cmp)));
+      else
+       {
+         cmp = ix86_expand_int_sse_cmp (operands[0], EQ, cmp,
+                                        CONST0_RTX (GET_MODE (cmp)),
+                                        NULL, NULL, &negate);
+         gcc_assert (!negate);
+       }
+    }

Technically it's correct, however, in actual scenarios, avx512 (x86-64-v4) will 
enter ix86_expand_mask_vec_cmp, so this optimization appears to only target the 
scenario of avx512f + no-avx512vl + VL == 16/32, which doesn't sound 
particularly useful.

+    case EQ:
+    case GT:
+    case GTU:
+    case LT:
+    case LTU:
+      if (TARGET_SSE2
+         && GET_MODE_CLASS (mode) == MODE_VECTOR_INT
+         && GET_MODE_SIZE (mode) >= 8)

Do we need "GET_MODE_SIZE (mode) >= 8"? we also support vec_cmp for 
VI_16_32(2/4 bytes)

+       {
+         /* vpcmpeq */

Add vpcmpgt /* vpcmpeq/vpcmpgt */

+         *total = speed ? COSTS_N_INSNS (1) : COSTS_N_BYTES (4);
+         if (!REG_P (XEXP (x, 0)))
+           *total += rtx_cost (XEXP (x, 0), mode, code, 0, speed);
+         if (!REG_P (XEXP (x, 1)))
+           *total += rtx_cost (XEXP (x, 1), mode, code, 1, speed);
+         return true;
+       }

+      if (TARGET_SSE2
+         && GET_MODE_CLASS (mode) == MODE_VECTOR_INT
+         && GET_MODE_SIZE (mode) >= 8)
+       {
+         if (TARGET_AVX512F && GET_MODE_SIZE (mode) >= 16)
+           /* vpcmpeq + vpternlog */
+           *total = speed ? COSTS_N_INSNS (2) : COSTS_N_BYTES (11);
+         else
+           /* vpcmpeq + pxor + vpcmpeq */

Ditto /* vpcmpeq/vpcmpgt + pxor + vpcmpeq */

+           *total = speed ? COSTS_N_INSNS (3) : COSTS_N_BYTES (12);
+         if (!REG_P (XEXP (x, 0)))
+           *total += rtx_cost (XEXP (x, 0), mode, code, 0, speed);
+         if (!REG_P (XEXP (x, 1)))
+           *total += rtx_cost (XEXP (x, 1), mode, code, 1, speed);
+         return true;

Other x86 parts LGTM.

> -----Original Message-----
> From: Roger Sayle <[email protected]>
> Sent: Sunday, March 1, 2026 8:02 PM
> To: 'GCC Patches' <[email protected]>
> Cc: 'Richard Biener' <[email protected]>; 'Andrew Pinski'
> <[email protected]>; 'Hongtao Liu' <[email protected]>
> Subject: [PATCH v2] PR target/123238: VCOND_MASK regression on aarch64
> and x86_64.
> 
> 
> This is my revised patch for PR target/123238, previously posted at
> https://gcc.gnu.org/pipermail/gcc-patches/2026-February/708283.html
> incorporating the feedback from both Andrew Pinksi and Richard Biener, and
> additionally resolving this code quality regression on aarch64, which exhibits
> many of the same issues as x86_64.  Alas, the only native aarch64 that I have
> convenient access to is an Apple M1 laptop, so the examples and patches
> below are against Iain Sandoe's tree for
> aarch64-apple-darwin24.3.0 [as that target isn't yet supported by mainline
> GCC 16].
> 
> To explain and motivate things for the aarch64 maintainers:
> 
> void foo(char c[])
> {
>     for (int i = 0; i < 8; i++)
>         c[i] = c[i] != 'a' ? 'c' : 'e';
> }
> 
> currently generates with -O2 the following:
> 
> foo:    movi    v30.8b, 0x61
>         ldr     d0, [x0]
>         movi    v29.8b, 0x63
>         movi    v31.8b, 0x65
>         cmeq    v30.8b, v0.8b, v30.8b
>         not     v30.8b, v30.8b
>         bit     v31.8b, v29.8b, v30.8b
>         str     d31, [x0]
>         ret
> 
> where a cmeq followed by a not is used to implement NE_EXPR.
> c.f. the comment "Handle NE as !EQ" in aarch64-simd.md's expander of
> vec_cmp<mode><mode>.  With this patch, including the change to
> aarch64_rtx_costs to indicate that NE is more expensive than EQ, the middle-
> end swaps the VCOND_EXPR, reducing the number of instructions in the
> example above [to what it was in GCC 14].
> 
> Changes from the previous version of this patch include Richard's
> recommendation to use get_gimple_for_ssa_name instead of
> SSA_NAME_DEF_STMT, and Andrew's suggestion to log expand's decision to
> the dump file, which now contains lines such as:
> 
> ;; swapping operands of .VCOND_MASK
> ;; cost of original ne: 8
> ;; cost of replacement eq: 4
> 
> or (for failure)
> 
> ;; not swapping operands of .VCOND_MASK
> ;; cost of original eq: 4
> ;; cost of replacement ne: 8
> 
> I've kept the handling of an embedded tcc_comparison, this both helps
> document what the code intends to do, and (defensively) supports expand's
> ability to work with GENERIC, and not just GIMPLE trees (with or without
> TER).  During RTL expansion, the middle-end may temporarily construct trees
> (for intermediate
> expressions) that aren't always valid in earlier passes.
> 
> Finally, I've no objection to someone else attempting to perform this type of
> (target dependent) optimization during the gimple-isel pass, though I expect 
> it
> will be difficult to maintain hardware specific idioms, such as aarch64's
> representation of NE as !EQ (in some machine modes) when match.pd and
> elsewhere in the middle-end attempts to keep tree's canonical (and machine
> independent) by reducing operators, commuting constants in binary
> expressions etc.
> 
> I've previously explained how the ISEL pass adversely affects code generation
> on x86_64 (making decisions about lowering that are better left to the
> backends), but the same issue is also there for aarch64.
> 
> Consider the (closely related) case:
> 
> void bar(char c[])
> {
>     for (int i = 0; i < 8; i++)
>         c[i] = c[i] != 'a';
> }
> 
> Here gimple-isel makes the decision to turn a 0/-1 vector to a
> 0/1 vector using bitwise-AND with 1, so on aarch64 we generate:
> 
> bar:    movi    v30.8b, 0x61
>         ldr     d0, [x0]
>         movi    v31.8b, 0x1
>         cmeq    v30.8b, v0.8b, v30.8b
>         bic     v31.8b, v31.8b, v30.8b
>         str     d31, [x0]
>         ret
> 
> Ideally, aarch64 should be able to use "ushr v31.8b, v31.8b, 7"
> but once the .VCOND_MASK is lowered to an AND, the semantics that the
> original vector was a -1/0 mask has been lost, so later passes are unable to
> improve things.  Alas because gimple-isel thinks it knows best, GCC even de-
> optimizes code, introducing instructions into:
> 
> typedef char v8qi __attribute__((vector_size(8))); void baz(v8qi *c) {
>     v8qi t = *c;
>     t = t == (v8qi){'a','a','a','a','a','a','a','a'};
>     t >>= 7;
>     *c = t;
> }
> 
> In theory, one might be able to work around the above glitch with -fdisable-
> tree-isel but that currently ICEs, and would need my
> (rejected) patch from May 2022:
> https://gcc.gnu.org/pipermail/gcc-patches/2022-May/595382.html
> But this is digressing far off topic.
> 
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and
> make -k check, both with and without --target_board=unix{-m32} with no
> new failures.  It has also been tested on iains'
> aarch64-apple-darwin24.3.0 also with no new failures.
> Ok for mainline?  (the middle-end, aarch64 and x86_64 pieces can all be
> approved/applied independently).
> 
> Thanks in advance (and for the previous reviews).
> 
> 
> 2026-03-01  Roger Sayle  <[email protected]>
> 
> gcc/ChangeLog
>         PR target/123238
>         * expr.cc (convert_tree_comp_to_rtx): Make global.
>         * expr.h (convert_tree_comp_to_rtx): Prototype here.
>         * internal-fn.cc (expand_vec_cond_mask_optab_fn): Use rtx_costs
>         to determine whether swapping operands would result in better
>         code.
> 
>         * config/aarch64/aarch64.cc (aarch64_rtx_costs) <case NE/EQ>:
>         Provide costs for integer vector equality and inequality.
> 
>         * config/i386/i386-expand.cc (ix86_expand_int_vec_cmp): On
>         AVX512 targets use a ternlog instead of a comparison to negate
>         the mask (requires one instruction instead of two).
>         * config/i386/i386.cc (ix86_rtx_costs): Refactor code for UNSPEC.
>         Provide costs for UNSPEC_BLENDV and  UNSPEC_MOVMSK.  Provide
>         costs for comparison operators of integer vector modes.
> 
> gcc/testsuite/ChangeLog
>         PR target/123238
>         * gcc.target/aarch64/pr123238.c: New test case.
>         * gcc.target/i386/pr123238.c: Likewise.
> 
> Roger
> --

Reply via email to