https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114522
--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> --- The function in aes_xor_combine.c which fails is: ``` #include <arm_neon.h> #define AESE(r, v, key) (r = vaeseq_u8 ((v), (key))); #define AESD(r, v, key) (r = vaesdq_u8 ((v), (key))); const uint8x16_t zero = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; uint8x16_t foo1 (uint8x16_t a, uint8x16_t b) { uint8x16_t dummy; AESE(dummy, a ^ b, zero); AESE(dummy, dummy ^ a, zero); return dummy; } ``` Compile with `-O3 -mcpu=cortex-a55+aes`. What we have in this case is: ``` (insn 7 4 8 2 (set (reg:V16QI 107 [ _1 ]) (xor:V16QI (reg/v:V16QI 105 [ aD.22791 ]) (reg/v:V16QI 106 [ bD.22792 ]))) "/app/example.cpp":14:3 1771 {xorv16qi3} (expr_list:REG_DEAD (reg/v:V16QI 106 [ bD.22792 ]) (nil))) (insn 8 7 9 2 (set (reg:V16QI 108) (const_vector:V16QI [ (const_int 0 [0]) repeated x16 ])) "/opt/compiler-explorer/arm64/gcc-trunk-20240816/aarch64-unknown-linux-gnu/lib/gcc/aarch64-unknown-linux-gnu/15.0.0/include/arm_neon.h":7311:10 1268 {*aarch64_simd_movv16qi} (nil)) (insn 9 8 10 2 (set (reg:V16QI 103 [ _6 ]) (unspec:V16QI [ (xor:V16QI (reg:V16QI 107 [ _1 ]) (reg:V16QI 108)) ] UNSPEC_AESE)) "/opt/compiler-explorer/arm64/gcc-trunk-20240816/aarch64-unknown-linux-gnu/lib/gcc/aarch64-unknown-linux-gnu/15.0.0/include/arm_neon.h":7311:10 5203 {aarch64_crypto_aesev16qi} (expr_list:REG_EQUAL (unspec:V16QI [ (reg:V16QI 107 [ _1 ]) ] UNSPEC_AESE) (expr_list:REG_DEAD (reg:V16QI 108) (expr_list:REG_DEAD (reg:V16QI 107 [ _1 ]) (nil))))) ``` Which is: r107 = r105 ^ r106 r108 = 0 (used again below) r103 = AESE (r107 ^ r108) But there is no pattern to accept `AESE (r107)` which partly causes the issue here. Actually 3->2 should have done it but since i2 didn't change we reject it even though it is cost is better. ``` Trying 8, 7 -> 9: 8: r108:V16QI=const_vector 7: r107:V16QI=r105:V16QI^r113:V16QI REG_DEAD r113:V16QI 9: r103:V16QI=unspec[r107:V16QI^r108:V16QI] 229 REG_DEAD r107:V16QI REG_EQUAL unspec[r107:V16QI] 229 Failed to match this instruction: (parallel [ (set (reg:V16QI 103 [ _6 ]) (unspec:V16QI [ (xor:V16QI (reg/v:V16QI 105 [ aD.22791 ]) (reg:V16QI 113)) ] UNSPEC_AESE)) (set (reg:V16QI 108) (const_vector:V16QI [ (const_int 0 [0]) repeated x16 ])) ]) Failed to match this instruction: (parallel [ (set (reg:V16QI 103 [ _6 ]) (unspec:V16QI [ (xor:V16QI (reg/v:V16QI 105 [ aD.22791 ]) (reg:V16QI 113)) ] UNSPEC_AESE)) (set (reg:V16QI 108) (const_vector:V16QI [ (const_int 0 [0]) repeated x16 ])) ]) Successfully matched this instruction: (set (reg:V16QI 108) (const_vector:V16QI [ (const_int 0 [0]) repeated x16 ])) Successfully matched this instruction: (set (reg:V16QI 103 [ _6 ]) (unspec:V16QI [ (xor:V16QI (reg/v:V16QI 105 [ aD.22791 ]) (reg:V16QI 113)) ] UNSPEC_AESE)) allowing combination of insns 7, 8 and 9 original costs 8 + 4 + 4 = 16 replacement costs 4 + 4 = 8 i2 didn't change, not doing this ``` I would have expected that only for 2->2 combines and not 3->2 combines which should almost always be a win .....