Re: [PATCH v2] PR target/123238: VCOND_MASK regression on aarch64 and x86_64.

Richard Biener Mon, 02 Mar 2026 02:40:07 -0800

On Sun, 1 Mar 2026, Roger Sayle wrote:

> 
> This is my revised patch for PR target/123238, previously posted at
> https://gcc.gnu.org/pipermail/gcc-patches/2026-February/708283.html
> incorporating the feedback from both Andrew Pinksi and Richard Biener,
> and additionally resolving this code quality regression on aarch64,
> which exhibits many of the same issues as x86_64.  Alas, the only
> native aarch64 that I have convenient access to is an Apple M1 laptop,
> so the examples and patches below are against Iain Sandoe's tree for
> aarch64-apple-darwin24.3.0 [as that target isn't yet supported by
> mainline GCC 16].
> 
> To explain and motivate things for the aarch64 maintainers:
> 
> void foo(char c[])
> {
>     for (int i = 0; i < 8; i++)
>         c[i] = c[i] != 'a' ? 'c' : 'e';
> }
> 
> currently generates with -O2 the following:
> 
> foo:    movi    v30.8b, 0x61
>         ldr     d0, [x0]
>         movi    v29.8b, 0x63
>         movi    v31.8b, 0x65
>         cmeq    v30.8b, v0.8b, v30.8b
>         not     v30.8b, v30.8b
>         bit     v31.8b, v29.8b, v30.8b
>         str     d31, [x0]
>         ret
> 
> where a cmeq followed by a not is used to implement NE_EXPR.
> c.f. the comment "Handle NE as !EQ" in aarch64-simd.md's expander
> of vec_cmp<mode><mode>.  With this patch, including the change
> to aarch64_rtx_costs to indicate that NE is more expensive than EQ,
> the middle-end swaps the VCOND_EXPR, reducing the number of
> instructions in the example above [to what it was in GCC 14].
> 
> Changes from the previous version of this patch include Richard's
> recommendation to use get_gimple_for_ssa_name instead of
> SSA_NAME_DEF_STMT, and Andrew's suggestion to log expand's
> decision to the dump file, which now contains lines such as:
> 
> ;; swapping operands of .VCOND_MASK
> ;; cost of original ne: 8
> ;; cost of replacement eq: 4
> 
> or (for failure)
> 
> ;; not swapping operands of .VCOND_MASK
> ;; cost of original eq: 4
> ;; cost of replacement ne: 8
> 
> I've kept the handling of an embedded tcc_comparison, this both
> helps document what the code intends to do, and (defensively)
> supports expand's ability to work with GENERIC, and not just
> GIMPLE trees (with or without TER).  During RTL expansion, the
> middle-end may temporarily construct trees (for intermediate
> expressions) that aren't always valid in earlier passes.
> 
> Finally, I've no objection to someone else attempting to perform
> this type of (target dependent) optimization during the gimple-isel
> pass, though I expect it will be difficult to maintain hardware
> specific idioms, such as aarch64's representation of NE as !EQ
> (in some machine modes) when match.pd and elsewhere in the middle-end
> attempts to keep tree's canonical (and machine independent)
> by reducing operators, commuting constants in binary expressions
> etc.
> 
> I've previously explained how the ISEL pass adversely affects
> code generation on x86_64 (making decisions about lowering that
> are better left to the backends), but the same issue is also
> there for aarch64.
> 
> Consider the (closely related) case:
> 
> void bar(char c[])
> {
>     for (int i = 0; i < 8; i++)
>         c[i] = c[i] != 'a';
> }
> 
> Here gimple-isel makes the decision to turn a 0/-1 vector to a
> 0/1 vector using bitwise-AND with 1, so on aarch64 we generate:
> 
> bar:    movi    v30.8b, 0x61
>         ldr     d0, [x0]
>         movi    v31.8b, 0x1
>         cmeq    v30.8b, v0.8b, v30.8b
>         bic     v31.8b, v31.8b, v30.8b
>         str     d31, [x0]
>         ret
> 
> Ideally, aarch64 should be able to use "ushr v31.8b, v31.8b, 7"
> but once the .VCOND_MASK is lowered to an AND, the semantics that
> the original vector was a -1/0 mask has been lost, so later passes
> are unable to improve things.  Alas because gimple-isel thinks it
> knows best, GCC even de-optimizes code, introducing instructions into:
> 
> typedef char v8qi __attribute__((vector_size(8)));
> void baz(v8qi *c)
> {
>     v8qi t = *c;
>     t = t == (v8qi){'a','a','a','a','a','a','a','a'};
>     t >>= 7;
>     *c = t;
> }
> 
> In theory, one might be able to work around the above glitch with
> -fdisable-tree-isel but that currently ICEs, and would need my
> (rejected) patch from May 2022:
> https://gcc.gnu.org/pipermail/gcc-patches/2022-May/595382.html
> But this is digressing far off topic.
> 
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  It has also been tested on iains'
> aarch64-apple-darwin24.3.0 also with no new failures.
> Ok for mainline?  (the middle-end, aarch64 and x86_64 pieces
> can all be approved/applied independently).


Are both the middle-end and the backend changes required to fix the PR
or do they complement each other, possibly on different testcases?

The middle-end change looks correct now, technically, while still being
contrary to the original "spirit" we designed the VEC_COND_EXPR lowering
to .VCOND_MASK to - that the chosen .VCOND_MASK directly maps to the
optab that should be chosen.  In other places we discussed what would
be best in the situation like this, and there's related issues with
.VEC_CMP where most targets cannot do all comparison operands but
targets usually claim they can.  So the "obvious" answer was to stop
doing that, but that would require fixup since the middle-end assumes
if one operand is supported, all others are as well.

To me a target specific fix at this point looks less risky since
we don't really know what targets do in their rtx-cost hook (though
one could hope that at best all of them would return equal cost
if nobody paid attention).

Thanks for working on this, I think the middle-end parts are OK
if required.  The TREE_CODE_CLASS (cmp_code) == tcc_comparison is
still dead code, even with TER (the get_gimple_for_ssa_name is
what "TER" is about).

Thanks,
Richard.

> Thanks in advance (and for the previous reviews).
> 
> 
> 2026-03-01  Roger Sayle  <[email protected]>
> 
> gcc/ChangeLog
>         PR target/123238
>         * expr.cc (convert_tree_comp_to_rtx): Make global.
>         * expr.h (convert_tree_comp_to_rtx): Prototype here.
>         * internal-fn.cc (expand_vec_cond_mask_optab_fn): Use rtx_costs
>         to determine whether swapping operands would result in better
>         code.
> 
>         * config/aarch64/aarch64.cc (aarch64_rtx_costs) <case NE/EQ>:
>         Provide costs for integer vector equality and inequality.
> 
>         * config/i386/i386-expand.cc (ix86_expand_int_vec_cmp): On
>         AVX512 targets use a ternlog instead of a comparison to negate
>         the mask (requires one instruction instead of two).
>         * config/i386/i386.cc (ix86_rtx_costs): Refactor code for UNSPEC.
>         Provide costs for UNSPEC_BLENDV and  UNSPEC_MOVMSK.  Provide
>         costs for comparison operators of integer vector modes.
> 
> gcc/testsuite/ChangeLog
>         PR target/123238
>         * gcc.target/aarch64/pr123238.c: New test case.
>         * gcc.target/i386/pr123238.c: Likewise.
> 
> Roger
> --
> 
> 

-- 
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: [PATCH v2] PR target/123238: VCOND_MASK regression on aarch64 and x86_64.

Reply via email to