Hi All, This patch series is optimizing AArch64 codegen for narrowing operations, shift and narrow, and some comparisons with bitmasks.
There are more to come but this is the first batch. This series shows a 2% gain on x264 in SPECCPU2017 and 0.05% size reduction and shows 5-10% perf gain on various intrinsics optimized real world libraries. One part that is missing and needs additional work is being able to combine stores into sequential locations. Consider: #include <arm_neon.h> ? #define SIZE 1 #define SIZE2 8 * 8 * 8 ? extern void pop (uint8_t*); ? void foo (int16x8_t row0, int16x8_t row1, int16x8_t row2, int16x8_t row3, int16x8_t row4, int16x8_t row5, int16x8_t row6, int16x8_t row7) { uint8_t block_nbits[SIZE2]; uint8x8_t row0_nbits = vsub_u8(vdup_n_u8(16), vmovn_u16(vreinterpretq_u16_s16(row0))); uint8x8_t row1_nbits = vsub_u8(vdup_n_u8(16), vmovn_u16(vreinterpretq_u16_s16(row1))); uint8x8_t row2_nbits = vsub_u8(vdup_n_u8(16), vmovn_u16(vreinterpretq_u16_s16(row2))); uint8x8_t row3_nbits = vsub_u8(vdup_n_u8(16), vmovn_u16(vreinterpretq_u16_s16(row3))); uint8x8_t row4_nbits = vsub_u8(vdup_n_u8(16), vmovn_u16(vreinterpretq_u16_s16(row4))); uint8x8_t row5_nbits = vsub_u8(vdup_n_u8(16), vmovn_u16(vreinterpretq_u16_s16(row5))); uint8x8_t row6_nbits = vsub_u8(vdup_n_u8(16), vmovn_u16(vreinterpretq_u16_s16(row6))); uint8x8_t row7_nbits = vsub_u8(vdup_n_u8(16), vmovn_u16(vreinterpretq_u16_s16(row7))); vst1_u8(block_nbits + 0 * SIZE, row0_nbits); vst1_u8(block_nbits + 1 * SIZE, row1_nbits); vst1_u8(block_nbits + 2 * SIZE, row2_nbits); vst1_u8(block_nbits + 3 * SIZE, row3_nbits); vst1_u8(block_nbits + 4 * SIZE, row4_nbits); vst1_u8(block_nbits + 5 * SIZE, row5_nbits); vst1_u8(block_nbits + 6 * SIZE, row6_nbits); vst1_u8(block_nbits + 7 * SIZE, row7_nbits); ? pop (block_nbits); } currently generates: movi v1.8b, #0x10 xtn v17.8b, v17.8h xtn v23.8b, v23.8h xtn v22.8b, v22.8h xtn v4.8b, v21.8h xtn v20.8b, v20.8h xtn v19.8b, v19.8h xtn v18.8b, v18.8h xtn v24.8b, v24.8h sub v17.8b, v1.8b, v17.8b sub v23.8b, v1.8b, v23.8b sub v22.8b, v1.8b, v22.8b sub v16.8b, v1.8b, v4.8b sub v8.8b, v1.8b, v20.8b sub v4.8b, v1.8b, v19.8b sub v2.8b, v1.8b, v18.8b sub v1.8b, v1.8b, v24.8b stp d17, d23, [sp, #224] stp d22, d16, [sp, #240] stp d8, d4, [sp, #256] stp d2, d1, [sp, #272] where optimized codegen for this is: movi v1.16b, #0x10 uzp1 v17.16b, v17.16b, v23.16b uzp1 v22.16b, v22.16b, v4.16b uzp1 v20.16b, v20.16b, v19.16b uzp1 v24.16b, v18.16b, v24.16b sub v17.16b, v1.16b, v17.16b sub v18.16b, v1.16b, v22.16b sub v19.16b, v1.16b, v20.16b sub v20.16b, v1.16b, v24.16b stp q17, q18, [sp, #224] stp q19, q20, [sp, #256] which requires us to recognize the stores into sequential locations (multiple stp d blocks in the current example) and merge them into one. This pattern happens reasonably often but unsure how to handle it. For one this requires st1 and friends to not be unspec, which is currently the focus of https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579582.html Thanks, Tamar --- inline copy of patch -- --