https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114522

--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
The function in aes_xor_combine.c which fails is:

```
#include <arm_neon.h>

#define AESE(r, v, key) (r = vaeseq_u8 ((v), (key)));
#define AESD(r, v, key) (r = vaesdq_u8 ((v), (key)));

const uint8x16_t zero = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};

uint8x16_t foo1 (uint8x16_t a, uint8x16_t b)
{
  uint8x16_t dummy;
  AESE(dummy, a ^ b, zero);
  AESE(dummy, dummy ^ a, zero);
  return dummy;
}
```

Compile with `-O3 -mcpu=cortex-a55+aes`.
What we have in this case is:
```
(insn 7 4 8 2 (set (reg:V16QI 107 [ _1 ])
        (xor:V16QI (reg/v:V16QI 105 [ aD.22791 ])
            (reg/v:V16QI 106 [ bD.22792 ]))) "/app/example.cpp":14:3 1771
{xorv16qi3}
     (expr_list:REG_DEAD (reg/v:V16QI 106 [ bD.22792 ])
        (nil)))
(insn 8 7 9 2 (set (reg:V16QI 108)
        (const_vector:V16QI [
                (const_int 0 [0]) repeated x16
            ]))
"/opt/compiler-explorer/arm64/gcc-trunk-20240816/aarch64-unknown-linux-gnu/lib/gcc/aarch64-unknown-linux-gnu/15.0.0/include/arm_neon.h":7311:10
1268 {*aarch64_simd_movv16qi}
     (nil))
(insn 9 8 10 2 (set (reg:V16QI 103 [ _6 ])
        (unspec:V16QI [
                (xor:V16QI (reg:V16QI 107 [ _1 ])
                    (reg:V16QI 108))
            ] UNSPEC_AESE))
"/opt/compiler-explorer/arm64/gcc-trunk-20240816/aarch64-unknown-linux-gnu/lib/gcc/aarch64-unknown-linux-gnu/15.0.0/include/arm_neon.h":7311:10
5203 {aarch64_crypto_aesev16qi}
     (expr_list:REG_EQUAL (unspec:V16QI [
                (reg:V16QI 107 [ _1 ])
            ] UNSPEC_AESE)
        (expr_list:REG_DEAD (reg:V16QI 108)
            (expr_list:REG_DEAD (reg:V16QI 107 [ _1 ])
                (nil)))))
```

Which is:
r107 = r105 ^ r106
r108 = 0 (used again below)
r103 = AESE (r107 ^ r108)

But there is no pattern to accept `AESE (r107)` which partly causes the issue
here.
Actually 3->2 should have done it but since i2 didn't change we reject it even
though it is cost is better.

```
Trying 8, 7 -> 9:
    8: r108:V16QI=const_vector
    7: r107:V16QI=r105:V16QI^r113:V16QI
      REG_DEAD r113:V16QI
    9: r103:V16QI=unspec[r107:V16QI^r108:V16QI] 229
      REG_DEAD r107:V16QI
      REG_EQUAL unspec[r107:V16QI] 229
Failed to match this instruction:
(parallel [
        (set (reg:V16QI 103 [ _6 ])
            (unspec:V16QI [
                    (xor:V16QI (reg/v:V16QI 105 [ aD.22791 ])
                        (reg:V16QI 113))
                ] UNSPEC_AESE))
        (set (reg:V16QI 108)
            (const_vector:V16QI [
                    (const_int 0 [0]) repeated x16
                ]))
    ])
Failed to match this instruction:
(parallel [
        (set (reg:V16QI 103 [ _6 ])
            (unspec:V16QI [
                    (xor:V16QI (reg/v:V16QI 105 [ aD.22791 ])
                        (reg:V16QI 113))
                ] UNSPEC_AESE))
        (set (reg:V16QI 108)
            (const_vector:V16QI [
                    (const_int 0 [0]) repeated x16
                ]))
    ])
Successfully matched this instruction:
(set (reg:V16QI 108)
    (const_vector:V16QI [
            (const_int 0 [0]) repeated x16
        ]))
Successfully matched this instruction:
(set (reg:V16QI 103 [ _6 ])
    (unspec:V16QI [
            (xor:V16QI (reg/v:V16QI 105 [ aD.22791 ])
                (reg:V16QI 113))
        ] UNSPEC_AESE))
allowing combination of insns 7, 8 and 9
original costs 8 + 4 + 4 = 16
replacement costs 4 + 4 = 8
i2 didn't change, not doing this
```

I would have expected that only for 2->2 combines and not 3->2 combines which
should almost always be a win .....

Reply via email to