This is my revised patch for PR target/123238, previously posted at
https://gcc.gnu.org/pipermail/gcc-patches/2026-February/708283.html
incorporating the feedback from both Andrew Pinksi and Richard Biener,
and additionally resolving this code quality regression on aarch64,
which exhibits many of the same issues as x86_64. Alas, the only
native aarch64 that I have convenient access to is an Apple M1 laptop,
so the examples and patches below are against Iain Sandoe's tree for
aarch64-apple-darwin24.3.0 [as that target isn't yet supported by
mainline GCC 16].
To explain and motivate things for the aarch64 maintainers:
void foo(char c[])
{
for (int i = 0; i < 8; i++)
c[i] = c[i] != 'a' ? 'c' : 'e';
}
currently generates with -O2 the following:
foo: movi v30.8b, 0x61
ldr d0, [x0]
movi v29.8b, 0x63
movi v31.8b, 0x65
cmeq v30.8b, v0.8b, v30.8b
not v30.8b, v30.8b
bit v31.8b, v29.8b, v30.8b
str d31, [x0]
ret
where a cmeq followed by a not is used to implement NE_EXPR.
c.f. the comment "Handle NE as !EQ" in aarch64-simd.md's expander
of vec_cmp<mode><mode>. With this patch, including the change
to aarch64_rtx_costs to indicate that NE is more expensive than EQ,
the middle-end swaps the VCOND_EXPR, reducing the number of
instructions in the example above [to what it was in GCC 14].
Changes from the previous version of this patch include Richard's
recommendation to use get_gimple_for_ssa_name instead of
SSA_NAME_DEF_STMT, and Andrew's suggestion to log expand's
decision to the dump file, which now contains lines such as:
;; swapping operands of .VCOND_MASK
;; cost of original ne: 8
;; cost of replacement eq: 4
or (for failure)
;; not swapping operands of .VCOND_MASK
;; cost of original eq: 4
;; cost of replacement ne: 8
I've kept the handling of an embedded tcc_comparison, this both
helps document what the code intends to do, and (defensively)
supports expand's ability to work with GENERIC, and not just
GIMPLE trees (with or without TER). During RTL expansion, the
middle-end may temporarily construct trees (for intermediate
expressions) that aren't always valid in earlier passes.
Finally, I've no objection to someone else attempting to perform
this type of (target dependent) optimization during the gimple-isel
pass, though I expect it will be difficult to maintain hardware
specific idioms, such as aarch64's representation of NE as !EQ
(in some machine modes) when match.pd and elsewhere in the middle-end
attempts to keep tree's canonical (and machine independent)
by reducing operators, commuting constants in binary expressions
etc.
I've previously explained how the ISEL pass adversely affects
code generation on x86_64 (making decisions about lowering that
are better left to the backends), but the same issue is also
there for aarch64.
Consider the (closely related) case:
void bar(char c[])
{
for (int i = 0; i < 8; i++)
c[i] = c[i] != 'a';
}
Here gimple-isel makes the decision to turn a 0/-1 vector to a
0/1 vector using bitwise-AND with 1, so on aarch64 we generate:
bar: movi v30.8b, 0x61
ldr d0, [x0]
movi v31.8b, 0x1
cmeq v30.8b, v0.8b, v30.8b
bic v31.8b, v31.8b, v30.8b
str d31, [x0]
ret
Ideally, aarch64 should be able to use "ushr v31.8b, v31.8b, 7"
but once the .VCOND_MASK is lowered to an AND, the semantics that
the original vector was a -1/0 mask has been lost, so later passes
are unable to improve things. Alas because gimple-isel thinks it
knows best, GCC even de-optimizes code, introducing instructions into:
typedef char v8qi __attribute__((vector_size(8)));
void baz(v8qi *c)
{
v8qi t = *c;
t = t == (v8qi){'a','a','a','a','a','a','a','a'};
t >>= 7;
*c = t;
}
In theory, one might be able to work around the above glitch with
-fdisable-tree-isel but that currently ICEs, and would need my
(rejected) patch from May 2022:
https://gcc.gnu.org/pipermail/gcc-patches/2022-May/595382.html
But this is digressing far off topic.
This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures. It has also been tested on iains'
aarch64-apple-darwin24.3.0 also with no new failures.
Ok for mainline? (the middle-end, aarch64 and x86_64 pieces
can all be approved/applied independently).
Thanks in advance (and for the previous reviews).
2026-03-01 Roger Sayle <[email protected]>
gcc/ChangeLog
PR target/123238
* expr.cc (convert_tree_comp_to_rtx): Make global.
* expr.h (convert_tree_comp_to_rtx): Prototype here.
* internal-fn.cc (expand_vec_cond_mask_optab_fn): Use rtx_costs
to determine whether swapping operands would result in better
code.
* config/aarch64/aarch64.cc (aarch64_rtx_costs) <case NE/EQ>:
Provide costs for integer vector equality and inequality.
* config/i386/i386-expand.cc (ix86_expand_int_vec_cmp): On
AVX512 targets use a ternlog instead of a comparison to negate
the mask (requires one instruction instead of two).
* config/i386/i386.cc (ix86_rtx_costs): Refactor code for UNSPEC.
Provide costs for UNSPEC_BLENDV and UNSPEC_MOVMSK. Provide
costs for comparison operators of integer vector modes.
gcc/testsuite/ChangeLog
PR target/123238
* gcc.target/aarch64/pr123238.c: New test case.
* gcc.target/i386/pr123238.c: Likewise.
Roger
--
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index f2ecb0ee8cb..d92a5aa491c 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -16224,8 +16224,34 @@ cost_plus:
return aarch64_if_then_else_costs (XEXP (x, 0), XEXP (x, 1),
XEXP (x, 2), cost, speed);
- case EQ:
case NE:
+ if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT)
+ {
+ op0 = XEXP (x, 0);
+ op1 = XEXP (x, 1);
+ machine_mode inner_mode = GET_MODE (op0);
+ *cost += rtx_cost (op0, inner_mode, NE, 0, speed);
+ /* NE costs one_cmpl<mode>2 more than EQ. */
+ if (op1 != CONST0_RTX (inner_mode))
+ *cost += rtx_cost (op0, inner_mode, NE, 1, speed)
+ + COSTS_N_INSNS (1);
+ return true;
+ }
+ return false; /* All arguments need to be in registers. */
+
+ case EQ:
+ if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT)
+ {
+ op0 = XEXP (x, 0);
+ op1 = XEXP (x, 1);
+ machine_mode inner_mode = GET_MODE (op0);
+ *cost += rtx_cost (op0, inner_mode, EQ, 0, speed);
+ if (op1 != CONST0_RTX (inner_mode))
+ *cost += rtx_cost (op0, inner_mode, EQ, 1, speed);
+ return true;
+ }
+ return false; /* All arguments must be in registers. */
+
case GT:
case GTU:
case LT:
diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index a82bb4399c9..366ad513da9 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -5282,11 +5282,17 @@ ix86_expand_int_vec_cmp (rtx operands[])
return false;
if (negate)
- cmp = ix86_expand_int_sse_cmp (operands[0], EQ, cmp,
- CONST0_RTX (GET_MODE (cmp)),
- NULL, NULL, &negate);
-
- gcc_assert (!negate);
+ {
+ if (TARGET_AVX512F && GET_MODE_SIZE (GET_MODE (cmp)) >= 16)
+ cmp = gen_rtx_XOR (GET_MODE (cmp), cmp, CONSTM1_RTX (GET_MODE (cmp)));
+ else
+ {
+ cmp = ix86_expand_int_sse_cmp (operands[0], EQ, cmp,
+ CONST0_RTX (GET_MODE (cmp)),
+ NULL, NULL, &negate);
+ gcc_assert (!negate);
+ }
+ }
if (operands[0] != cmp)
emit_move_insn (operands[0], cmp);
diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index acedc73b825..a1c32ba953e 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -23069,36 +23069,71 @@ ix86_rtx_costs (rtx x, machine_mode mode, int
outer_code_i, int opno,
return false;
case UNSPEC:
- if (XINT (x, 1) == UNSPEC_TP)
- *total = 0;
- else if (XINT (x, 1) == UNSPEC_VTERNLOG)
+ switch (XINT (x, 1))
{
+ case UNSPEC_TP:
+ *total = 0;
+ break;
+
+ case UNSPEC_VTERNLOG:
*total = cost->sse_op;
- *total += rtx_cost (XVECEXP (x, 0, 0), mode, code, 0, speed);
- *total += rtx_cost (XVECEXP (x, 0, 1), mode, code, 1, speed);
- *total += rtx_cost (XVECEXP (x, 0, 2), mode, code, 2, speed);
+ if (!REG_P (XVECEXP (x, 0, 0)))
+ *total += rtx_cost (XVECEXP (x, 0, 0), mode, code, 0, speed);
+ if (!REG_P (XVECEXP (x, 0, 1)))
+ *total += rtx_cost (XVECEXP (x, 0, 1), mode, code, 1, speed);
+ if (!REG_P (XVECEXP (x, 0, 2)))
+ *total += rtx_cost (XVECEXP (x, 0, 2), mode, code, 2, speed);
return true;
- }
- else if (XINT (x, 1) == UNSPEC_PTEST)
- {
+
+ case UNSPEC_PTEST:
+ {
+ *total = cost->sse_op;
+ rtx test_op0 = XVECEXP (x, 0, 0);
+ if (!rtx_equal_p (test_op0, XVECEXP (x, 0, 1)))
+ return false;
+ if (GET_CODE (test_op0) == AND)
+ {
+ rtx and_op0 = XEXP (test_op0, 0);
+ if (GET_CODE (and_op0) == NOT)
+ and_op0 = XEXP (and_op0, 0);
+ *total += rtx_cost (and_op0, GET_MODE (and_op0),
+ AND, 0, speed)
+ + rtx_cost (XEXP (test_op0, 1), GET_MODE (and_op0),
+ AND, 1, speed);
+ }
+ else
+ *total = rtx_cost (test_op0, GET_MODE (test_op0),
+ UNSPEC, 0, speed);
+ }
+ return true;
+
+ case UNSPEC_BLENDV:
*total = cost->sse_op;
- rtx test_op0 = XVECEXP (x, 0, 0);
- if (!rtx_equal_p (test_op0, XVECEXP (x, 0, 1)))
- return false;
- if (GET_CODE (test_op0) == AND)
+ if (!REG_P (XVECEXP (x, 0, 0)))
+ *total += rtx_cost (XVECEXP (x, 0, 0), mode, code, 0, speed);
+ if (!REG_P (XVECEXP (x, 0, 1)))
+ *total += rtx_cost (XVECEXP (x, 0, 1), mode, code, 1, speed);
+ if (!REG_P (XVECEXP (x, 0, 2)))
{
- rtx and_op0 = XEXP (test_op0, 0);
- if (GET_CODE (and_op0) == NOT)
- and_op0 = XEXP (and_op0, 0);
- *total += rtx_cost (and_op0, GET_MODE (and_op0),
- AND, 0, speed)
- + rtx_cost (XEXP (test_op0, 1), GET_MODE (and_op0),
- AND, 1, speed);
+ rtx cond = XVECEXP (x, 0, 2);
+ if ((GET_CODE (cond) == LT || GET_CODE (cond) == GT)
+ && CONST_VECTOR_P (XEXP (cond, 1)))
+ {
+ /* avx2_blendvpd256_gt and friends. */
+ if (!REG_P (XEXP (cond, 0)))
+ *total += rtx_cost (XEXP (cond, 0), mode, code, 2, speed);
+ }
+ else
+ *total += rtx_cost (cond, mode, code, 2, speed);
}
- else
- *total = rtx_cost (test_op0, GET_MODE (test_op0),
- UNSPEC, 0, speed);
return true;
+
+ case UNSPEC_MOVMSK:
+ *total = cost->sse_op;
+ return true;
+
+ default:
+ break;
}
return false;
@@ -23315,6 +23350,70 @@ ix86_rtx_costs (rtx x, machine_mode mode, int
outer_code_i, int opno,
}
return false;
+ case EQ:
+ case GT:
+ case GTU:
+ case LT:
+ case LTU:
+ if (TARGET_SSE2
+ && GET_MODE_CLASS (mode) == MODE_VECTOR_INT
+ && GET_MODE_SIZE (mode) >= 8)
+ {
+ /* vpcmpeq */
+ *total = speed ? COSTS_N_INSNS (1) : COSTS_N_BYTES (4);
+ if (!REG_P (XEXP (x, 0)))
+ *total += rtx_cost (XEXP (x, 0), mode, code, 0, speed);
+ if (!REG_P (XEXP (x, 1)))
+ *total += rtx_cost (XEXP (x, 1), mode, code, 1, speed);
+ return true;
+ }
+ if (TARGET_XOP
+ && GET_MODE_CLASS (mode) == MODE_VECTOR_INT
+ && GET_MODE_SIZE (mode) <= 16)
+ {
+ /* vpcomeq */
+ *total = speed ? COSTS_N_INSNS (1) : COSTS_N_BYTES (6);
+ if (!REG_P (XEXP (x, 0)))
+ *total += rtx_cost (XEXP (x, 0), mode, code, 0, speed);
+ if (!REG_P (XEXP (x, 1)))
+ *total += rtx_cost (XEXP (x, 1), mode, code, 1, speed);
+ return true;
+ }
+ return false;
+
+ case NE:
+ case GE:
+ case GEU:
+ if (TARGET_XOP
+ && GET_MODE_CLASS (mode) == MODE_VECTOR_INT
+ && GET_MODE_SIZE (mode) <= 16)
+ {
+ /* vpcomneq */
+ *total = speed ? COSTS_N_INSNS (1) : COSTS_N_BYTES (6);
+ if (!REG_P (XEXP (x, 0)))
+ *total += rtx_cost (XEXP (x, 0), mode, code, 0, speed);
+ if (!REG_P (XEXP (x, 1)))
+ *total += rtx_cost (XEXP (x, 1), mode, code, 1, speed);
+ return true;
+ }
+ if (TARGET_SSE2
+ && GET_MODE_CLASS (mode) == MODE_VECTOR_INT
+ && GET_MODE_SIZE (mode) >= 8)
+ {
+ if (TARGET_AVX512F && GET_MODE_SIZE (mode) >= 16)
+ /* vpcmpeq + vpternlog */
+ *total = speed ? COSTS_N_INSNS (2) : COSTS_N_BYTES (11);
+ else
+ /* vpcmpeq + pxor + vpcmpeq */
+ *total = speed ? COSTS_N_INSNS (3) : COSTS_N_BYTES (12);
+ if (!REG_P (XEXP (x, 0)))
+ *total += rtx_cost (XEXP (x, 0), mode, code, 0, speed);
+ if (!REG_P (XEXP (x, 1)))
+ *total += rtx_cost (XEXP (x, 1), mode, code, 1, speed);
+ return true;
+ }
+ return false;
+
default:
return false;
}
diff --git a/gcc/expr.cc b/gcc/expr.cc
index 4d1a6f3dd1c..203c116f2fa 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -9121,7 +9121,7 @@ highest_pow2_factor_for_target (const_tree target,
const_tree exp)
/* Convert the tree comparison code TCODE to the rtl one where the
signedness is UNSIGNEDP. */
-static enum rtx_code
+enum rtx_code
convert_tree_comp_to_rtx (enum tree_code tcode, int unsignedp)
{
enum rtx_code code;
diff --git a/gcc/expr.h b/gcc/expr.h
index ddd47cb4ecc..1e89a142d8c 100644
--- a/gcc/expr.h
+++ b/gcc/expr.h
@@ -338,6 +338,7 @@ extern tree string_constant (tree, tree *, tree *, tree *);
a constant. */
extern tree byte_representation (tree, tree *, tree *, tree *);
+extern enum rtx_code convert_tree_comp_to_rtx (enum tree_code, int);
extern enum tree_code maybe_optimize_mod_cmp (enum tree_code, tree *, tree *);
extern void maybe_optimize_sub_cmp_0 (enum tree_code, tree *, tree *);
diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
index d879568c6e3..65be5aed35c 100644
--- a/gcc/internal-fn.cc
+++ b/gcc/internal-fn.cc
@@ -3229,6 +3229,74 @@ expand_vec_cond_mask_optab_fn (internal_fn, gcall *stmt,
convert_optab optab)
gcc_assert (icode != CODE_FOR_nothing);
+ /* Find the comparison generating the mask OP0. */
+ tree cmp_op0 = NULL_TREE;
+ tree cmp_op1 = NULL_TREE;
+ enum tree_code cmp_code = TREE_CODE (op0);
+ if (TREE_CODE_CLASS (cmp_code) == tcc_comparison)
+ {
+ cmp_op0 = TREE_OPERAND (op0, 0);
+ cmp_op1 = TREE_OPERAND (op0, 1);
+ }
+ else if (cmp_code == SSA_NAME)
+ {
+ gimple *def_stmt = get_gimple_for_ssa_name (op0);
+ if (def_stmt && is_gimple_assign (def_stmt))
+ {
+ cmp_code = gimple_assign_rhs_code (def_stmt);
+ if (TREE_CODE_CLASS (cmp_code) == tcc_comparison)
+ {
+ cmp_op0 = gimple_assign_rhs1 (def_stmt);
+ cmp_op1 = gimple_assign_rhs2 (def_stmt);
+ }
+ }
+ }
+
+ /* Decide whether to invert comparison based on rtx_cost. */
+ if (cmp_op0)
+ {
+ enum tree_code rev_code;
+ tree op_type = TREE_TYPE (cmp_op0);
+ int unsignedp = TYPE_UNSIGNED (op_type);
+ rev_code = invert_tree_comparison (cmp_code, HONOR_NANS (op_type));
+
+ if (rev_code != ERROR_MARK)
+ {
+ tree cmp_type = TREE_TYPE (op0);
+ machine_mode cmp_mode = TYPE_MODE (cmp_type);
+ machine_mode op_mode = TYPE_MODE (op_type);
+ bool speed_p = optimize_insn_for_speed_p ();
+ rtx reg = gen_raw_REG (op_mode, LAST_VIRTUAL_REGISTER + 1);
+ enum rtx_code cmp_rtx_code = convert_tree_comp_to_rtx (cmp_code,
+ unsignedp);
+ rtx veccmp = gen_rtx_fmt_ee (cmp_rtx_code, cmp_mode, reg, reg);
+ int old_cost = rtx_cost (veccmp, cmp_mode, SET, 0, speed_p);
+ enum rtx_code rev_rtx_code = convert_tree_comp_to_rtx (rev_code,
+ unsignedp);
+ PUT_CODE (veccmp, rev_rtx_code);
+ int new_cost = rtx_cost (veccmp, cmp_mode, SET, 0, speed_p);
+ if (new_cost < old_cost)
+ {
+ op0 = fold_build2_loc (EXPR_LOCATION (op0), rev_code,
+ cmp_type, cmp_op0, cmp_op1);
+ std::swap (op1, op2);
+ }
+
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ {
+ fprintf (dump_file,
+ ";; %sswapping operands of .VCOND_MASK\n",
+ new_cost >= old_cost ? "not " : "");
+ fprintf (dump_file,
+ ";; cost of original %s: %d\n",
+ GET_RTX_NAME (cmp_rtx_code), old_cost);
+ fprintf (dump_file,
+ ";; cost of replacement %s: %d\n",
+ GET_RTX_NAME (rev_rtx_code), new_cost);
+ }
+ }
+ }
+
mask = expand_normal (op0);
rtx_op1 = expand_normal (op1);
rtx_op2 = expand_normal (op2);
diff --git a/gcc/testsuite/gcc.target/aarch64/pr123238.c
b/gcc/testsuite/gcc.target/aarch64/pr123238.c
new file mode 100644
index 00000000000..3228dde2e76
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr123238.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+void foo(char c[])
+{
+ for (int i = 0; i < 8; i++)
+ c[i] = c[i] != 'a' ? 'c' : 'e';
+}
+
+void bar(char c[])
+{
+ for (int i = 0; i < 8; i++)
+ c[i] = c[i] == 'a' ? 'c' : 'e';
+}
+
+/* { dg-final { scan-assembler-not "not" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr123238.c
b/gcc/testsuite/gcc.target/i386/pr123238.c
new file mode 100644
index 00000000000..63906ae0fb4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr123238.c
@@ -0,0 +1,10 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2" } */
+
+void f(char c[])
+{
+ for (int i = 0; i < 8; i++)
+ c[i] = c[i] ? 'a' : 'c';
+}
+
+/* { dg-final { scan-assembler-times "pcmpeqb" 1 } } */