For x86 part
if (negate)
- cmp = ix86_expand_int_sse_cmp (operands[0], EQ, cmp,
- CONST0_RTX (GET_MODE (cmp)),
- NULL, NULL, &negate);
-
- gcc_assert (!negate);
+ {
+ if (TARGET_AVX512F && GET_MODE_SIZE (GET_MODE (cmp)) >= 16)
+ cmp = gen_rtx_XOR (GET_MODE (cmp), cmp, CONSTM1_RTX (GET_MODE (cmp)));
+ else
+ {
+ cmp = ix86_expand_int_sse_cmp (operands[0], EQ, cmp,
+ CONST0_RTX (GET_MODE (cmp)),
+ NULL, NULL, &negate);
+ gcc_assert (!negate);
+ }
+ }
Technically it's correct, however, in actual scenarios, avx512 (x86-64-v4) will
enter ix86_expand_mask_vec_cmp, so this optimization appears to only target the
scenario of avx512f + no-avx512vl + VL == 16/32, which doesn't sound
particularly useful.
+ case EQ:
+ case GT:
+ case GTU:
+ case LT:
+ case LTU:
+ if (TARGET_SSE2
+ && GET_MODE_CLASS (mode) == MODE_VECTOR_INT
+ && GET_MODE_SIZE (mode) >= 8)
Do we need "GET_MODE_SIZE (mode) >= 8"? we also support vec_cmp for
VI_16_32(2/4 bytes)
+ {
+ /* vpcmpeq */
Add vpcmpgt /* vpcmpeq/vpcmpgt */
+ *total = speed ? COSTS_N_INSNS (1) : COSTS_N_BYTES (4);
+ if (!REG_P (XEXP (x, 0)))
+ *total += rtx_cost (XEXP (x, 0), mode, code, 0, speed);
+ if (!REG_P (XEXP (x, 1)))
+ *total += rtx_cost (XEXP (x, 1), mode, code, 1, speed);
+ return true;
+ }
+ if (TARGET_SSE2
+ && GET_MODE_CLASS (mode) == MODE_VECTOR_INT
+ && GET_MODE_SIZE (mode) >= 8)
+ {
+ if (TARGET_AVX512F && GET_MODE_SIZE (mode) >= 16)
+ /* vpcmpeq + vpternlog */
+ *total = speed ? COSTS_N_INSNS (2) : COSTS_N_BYTES (11);
+ else
+ /* vpcmpeq + pxor + vpcmpeq */
Ditto /* vpcmpeq/vpcmpgt + pxor + vpcmpeq */
+ *total = speed ? COSTS_N_INSNS (3) : COSTS_N_BYTES (12);
+ if (!REG_P (XEXP (x, 0)))
+ *total += rtx_cost (XEXP (x, 0), mode, code, 0, speed);
+ if (!REG_P (XEXP (x, 1)))
+ *total += rtx_cost (XEXP (x, 1), mode, code, 1, speed);
+ return true;
Other x86 parts LGTM.
> -----Original Message-----
> From: Roger Sayle <[email protected]>
> Sent: Sunday, March 1, 2026 8:02 PM
> To: 'GCC Patches' <[email protected]>
> Cc: 'Richard Biener' <[email protected]>; 'Andrew Pinski'
> <[email protected]>; 'Hongtao Liu' <[email protected]>
> Subject: [PATCH v2] PR target/123238: VCOND_MASK regression on aarch64
> and x86_64.
>
>
> This is my revised patch for PR target/123238, previously posted at
> https://gcc.gnu.org/pipermail/gcc-patches/2026-February/708283.html
> incorporating the feedback from both Andrew Pinksi and Richard Biener, and
> additionally resolving this code quality regression on aarch64, which exhibits
> many of the same issues as x86_64. Alas, the only native aarch64 that I have
> convenient access to is an Apple M1 laptop, so the examples and patches
> below are against Iain Sandoe's tree for
> aarch64-apple-darwin24.3.0 [as that target isn't yet supported by mainline
> GCC 16].
>
> To explain and motivate things for the aarch64 maintainers:
>
> void foo(char c[])
> {
> for (int i = 0; i < 8; i++)
> c[i] = c[i] != 'a' ? 'c' : 'e';
> }
>
> currently generates with -O2 the following:
>
> foo: movi v30.8b, 0x61
> ldr d0, [x0]
> movi v29.8b, 0x63
> movi v31.8b, 0x65
> cmeq v30.8b, v0.8b, v30.8b
> not v30.8b, v30.8b
> bit v31.8b, v29.8b, v30.8b
> str d31, [x0]
> ret
>
> where a cmeq followed by a not is used to implement NE_EXPR.
> c.f. the comment "Handle NE as !EQ" in aarch64-simd.md's expander of
> vec_cmp<mode><mode>. With this patch, including the change to
> aarch64_rtx_costs to indicate that NE is more expensive than EQ, the middle-
> end swaps the VCOND_EXPR, reducing the number of instructions in the
> example above [to what it was in GCC 14].
>
> Changes from the previous version of this patch include Richard's
> recommendation to use get_gimple_for_ssa_name instead of
> SSA_NAME_DEF_STMT, and Andrew's suggestion to log expand's decision to
> the dump file, which now contains lines such as:
>
> ;; swapping operands of .VCOND_MASK
> ;; cost of original ne: 8
> ;; cost of replacement eq: 4
>
> or (for failure)
>
> ;; not swapping operands of .VCOND_MASK
> ;; cost of original eq: 4
> ;; cost of replacement ne: 8
>
> I've kept the handling of an embedded tcc_comparison, this both helps
> document what the code intends to do, and (defensively) supports expand's
> ability to work with GENERIC, and not just GIMPLE trees (with or without
> TER). During RTL expansion, the middle-end may temporarily construct trees
> (for intermediate
> expressions) that aren't always valid in earlier passes.
>
> Finally, I've no objection to someone else attempting to perform this type of
> (target dependent) optimization during the gimple-isel pass, though I expect
> it
> will be difficult to maintain hardware specific idioms, such as aarch64's
> representation of NE as !EQ (in some machine modes) when match.pd and
> elsewhere in the middle-end attempts to keep tree's canonical (and machine
> independent) by reducing operators, commuting constants in binary
> expressions etc.
>
> I've previously explained how the ISEL pass adversely affects code generation
> on x86_64 (making decisions about lowering that are better left to the
> backends), but the same issue is also there for aarch64.
>
> Consider the (closely related) case:
>
> void bar(char c[])
> {
> for (int i = 0; i < 8; i++)
> c[i] = c[i] != 'a';
> }
>
> Here gimple-isel makes the decision to turn a 0/-1 vector to a
> 0/1 vector using bitwise-AND with 1, so on aarch64 we generate:
>
> bar: movi v30.8b, 0x61
> ldr d0, [x0]
> movi v31.8b, 0x1
> cmeq v30.8b, v0.8b, v30.8b
> bic v31.8b, v31.8b, v30.8b
> str d31, [x0]
> ret
>
> Ideally, aarch64 should be able to use "ushr v31.8b, v31.8b, 7"
> but once the .VCOND_MASK is lowered to an AND, the semantics that the
> original vector was a -1/0 mask has been lost, so later passes are unable to
> improve things. Alas because gimple-isel thinks it knows best, GCC even de-
> optimizes code, introducing instructions into:
>
> typedef char v8qi __attribute__((vector_size(8))); void baz(v8qi *c) {
> v8qi t = *c;
> t = t == (v8qi){'a','a','a','a','a','a','a','a'};
> t >>= 7;
> *c = t;
> }
>
> In theory, one might be able to work around the above glitch with -fdisable-
> tree-isel but that currently ICEs, and would need my
> (rejected) patch from May 2022:
> https://gcc.gnu.org/pipermail/gcc-patches/2022-May/595382.html
> But this is digressing far off topic.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and
> make -k check, both with and without --target_board=unix{-m32} with no
> new failures. It has also been tested on iains'
> aarch64-apple-darwin24.3.0 also with no new failures.
> Ok for mainline? (the middle-end, aarch64 and x86_64 pieces can all be
> approved/applied independently).
>
> Thanks in advance (and for the previous reviews).
>
>
> 2026-03-01 Roger Sayle <[email protected]>
>
> gcc/ChangeLog
> PR target/123238
> * expr.cc (convert_tree_comp_to_rtx): Make global.
> * expr.h (convert_tree_comp_to_rtx): Prototype here.
> * internal-fn.cc (expand_vec_cond_mask_optab_fn): Use rtx_costs
> to determine whether swapping operands would result in better
> code.
>
> * config/aarch64/aarch64.cc (aarch64_rtx_costs) <case NE/EQ>:
> Provide costs for integer vector equality and inequality.
>
> * config/i386/i386-expand.cc (ix86_expand_int_vec_cmp): On
> AVX512 targets use a ternlog instead of a comparison to negate
> the mask (requires one instruction instead of two).
> * config/i386/i386.cc (ix86_rtx_costs): Refactor code for UNSPEC.
> Provide costs for UNSPEC_BLENDV and UNSPEC_MOVMSK. Provide
> costs for comparison operators of integer vector modes.
>
> gcc/testsuite/ChangeLog
> PR target/123238
> * gcc.target/aarch64/pr123238.c: New test case.
> * gcc.target/i386/pr123238.c: Likewise.
>
> Roger
> --