On Sun, 1 Mar 2026, Roger Sayle wrote: > > This is my revised patch for PR target/123238, previously posted at > https://gcc.gnu.org/pipermail/gcc-patches/2026-February/708283.html > incorporating the feedback from both Andrew Pinksi and Richard Biener, > and additionally resolving this code quality regression on aarch64, > which exhibits many of the same issues as x86_64. Alas, the only > native aarch64 that I have convenient access to is an Apple M1 laptop, > so the examples and patches below are against Iain Sandoe's tree for > aarch64-apple-darwin24.3.0 [as that target isn't yet supported by > mainline GCC 16]. > > To explain and motivate things for the aarch64 maintainers: > > void foo(char c[]) > { > for (int i = 0; i < 8; i++) > c[i] = c[i] != 'a' ? 'c' : 'e'; > } > > currently generates with -O2 the following: > > foo: movi v30.8b, 0x61 > ldr d0, [x0] > movi v29.8b, 0x63 > movi v31.8b, 0x65 > cmeq v30.8b, v0.8b, v30.8b > not v30.8b, v30.8b > bit v31.8b, v29.8b, v30.8b > str d31, [x0] > ret > > where a cmeq followed by a not is used to implement NE_EXPR. > c.f. the comment "Handle NE as !EQ" in aarch64-simd.md's expander > of vec_cmp<mode><mode>. With this patch, including the change > to aarch64_rtx_costs to indicate that NE is more expensive than EQ, > the middle-end swaps the VCOND_EXPR, reducing the number of > instructions in the example above [to what it was in GCC 14]. > > Changes from the previous version of this patch include Richard's > recommendation to use get_gimple_for_ssa_name instead of > SSA_NAME_DEF_STMT, and Andrew's suggestion to log expand's > decision to the dump file, which now contains lines such as: > > ;; swapping operands of .VCOND_MASK > ;; cost of original ne: 8 > ;; cost of replacement eq: 4 > > or (for failure) > > ;; not swapping operands of .VCOND_MASK > ;; cost of original eq: 4 > ;; cost of replacement ne: 8 > > I've kept the handling of an embedded tcc_comparison, this both > helps document what the code intends to do, and (defensively) > supports expand's ability to work with GENERIC, and not just > GIMPLE trees (with or without TER). During RTL expansion, the > middle-end may temporarily construct trees (for intermediate > expressions) that aren't always valid in earlier passes. > > Finally, I've no objection to someone else attempting to perform > this type of (target dependent) optimization during the gimple-isel > pass, though I expect it will be difficult to maintain hardware > specific idioms, such as aarch64's representation of NE as !EQ > (in some machine modes) when match.pd and elsewhere in the middle-end > attempts to keep tree's canonical (and machine independent) > by reducing operators, commuting constants in binary expressions > etc. > > I've previously explained how the ISEL pass adversely affects > code generation on x86_64 (making decisions about lowering that > are better left to the backends), but the same issue is also > there for aarch64. > > Consider the (closely related) case: > > void bar(char c[]) > { > for (int i = 0; i < 8; i++) > c[i] = c[i] != 'a'; > } > > Here gimple-isel makes the decision to turn a 0/-1 vector to a > 0/1 vector using bitwise-AND with 1, so on aarch64 we generate: > > bar: movi v30.8b, 0x61 > ldr d0, [x0] > movi v31.8b, 0x1 > cmeq v30.8b, v0.8b, v30.8b > bic v31.8b, v31.8b, v30.8b > str d31, [x0] > ret > > Ideally, aarch64 should be able to use "ushr v31.8b, v31.8b, 7" > but once the .VCOND_MASK is lowered to an AND, the semantics that > the original vector was a -1/0 mask has been lost, so later passes > are unable to improve things. Alas because gimple-isel thinks it > knows best, GCC even de-optimizes code, introducing instructions into: > > typedef char v8qi __attribute__((vector_size(8))); > void baz(v8qi *c) > { > v8qi t = *c; > t = t == (v8qi){'a','a','a','a','a','a','a','a'}; > t >>= 7; > *c = t; > } > > In theory, one might be able to work around the above glitch with > -fdisable-tree-isel but that currently ICEs, and would need my > (rejected) patch from May 2022: > https://gcc.gnu.org/pipermail/gcc-patches/2022-May/595382.html > But this is digressing far off topic. > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > and make -k check, both with and without --target_board=unix{-m32} > with no new failures. It has also been tested on iains' > aarch64-apple-darwin24.3.0 also with no new failures. > Ok for mainline? (the middle-end, aarch64 and x86_64 pieces > can all be approved/applied independently).
Are both the middle-end and the backend changes required to fix the PR or do they complement each other, possibly on different testcases? The middle-end change looks correct now, technically, while still being contrary to the original "spirit" we designed the VEC_COND_EXPR lowering to .VCOND_MASK to - that the chosen .VCOND_MASK directly maps to the optab that should be chosen. In other places we discussed what would be best in the situation like this, and there's related issues with .VEC_CMP where most targets cannot do all comparison operands but targets usually claim they can. So the "obvious" answer was to stop doing that, but that would require fixup since the middle-end assumes if one operand is supported, all others are as well. To me a target specific fix at this point looks less risky since we don't really know what targets do in their rtx-cost hook (though one could hope that at best all of them would return equal cost if nobody paid attention). Thanks for working on this, I think the middle-end parts are OK if required. The TREE_CODE_CLASS (cmp_code) == tcc_comparison is still dead code, even with TER (the get_gimple_for_ssa_name is what "TER" is about). Thanks, Richard. > Thanks in advance (and for the previous reviews). > > > 2026-03-01 Roger Sayle <[email protected]> > > gcc/ChangeLog > PR target/123238 > * expr.cc (convert_tree_comp_to_rtx): Make global. > * expr.h (convert_tree_comp_to_rtx): Prototype here. > * internal-fn.cc (expand_vec_cond_mask_optab_fn): Use rtx_costs > to determine whether swapping operands would result in better > code. > > * config/aarch64/aarch64.cc (aarch64_rtx_costs) <case NE/EQ>: > Provide costs for integer vector equality and inequality. > > * config/i386/i386-expand.cc (ix86_expand_int_vec_cmp): On > AVX512 targets use a ternlog instead of a comparison to negate > the mask (requires one instruction instead of two). > * config/i386/i386.cc (ix86_rtx_costs): Refactor code for UNSPEC. > Provide costs for UNSPEC_BLENDV and UNSPEC_MOVMSK. Provide > costs for comparison operators of integer vector modes. > > gcc/testsuite/ChangeLog > PR target/123238 > * gcc.target/aarch64/pr123238.c: New test case. > * gcc.target/i386/pr123238.c: Likewise. > > Roger > -- > > -- Richard Biener <[email protected]> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Jochen Jaser, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
