Hello, I'd like to apply this to resolve PR 85758. In bug discussion Marc noted that what we are doing leads to code with higher instruction-level parallelism, but it also has higher register pressure and longer code if the target does not expose and-not patterns.
Now that we have 2->2 combine, it is able to recover SSE and-not for typedef unsigned T __attribute__((vector_size(16))); void g(T, T); void f(T a, T b, T m, T s) { m &= s; a += m; m ^= s; b += m; g(a, b); } but in scalar case costs say the original sequence is cheaper, even with -mbmi. I think it's fine. OK for trunk? If a testcase is needed, please tell how to implement it. Alexander * match.pd ((X & Y) ^ Y): Add :s qualifier to inner operand. diff --git a/gcc/match.pd b/gcc/match.pd index cb3c93e3e16..d43e52d05cd 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -1027,7 +1027,7 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (for opo (bit_and bit_xor) opi (bit_xor bit_and) (simplify - (opo:c (opi:c @0 @1) @1) + (opo:c (opi:cs @0 @1) @1) (bit_and (bit_not @0) @1))) /* Given a bit-wise operation CODE applied to ARG0 and ARG1, see if both