https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88915
Bug ID: 88915 Summary: Try smaller vectorisation factors in scalar fallback Product: gcc Version: 9.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ktkachov at gcc dot gnu.org Blocks: 53947 Target Milestone: --- The get_ref hot function in 525.x264_r inlines a hot helper that performs a vector average: void pixel_avg( unsigned char *dst, int i_dst_stride, unsigned char *src1, int i_src1_stride, unsigned char *src2, int i_src2_stride, int i_width, int i_height ) { for( int y = 0; y < i_height; y++ ) { for( int x = 0; x < i_width; x++ ) dst[x] = ( src1[x] + src2[x] + 1 ) >> 1; dst += i_dst_stride; src1 += i_src1_stride; src2 += i_src2_stride; } } GCC 9 already knows how to generate vector average instructions (PR 85694). For aarch64 it generates a 16x vectorised loop. Runtime profiling of the arguments to this function, however, show that the >50% of the time the i_width has value 8 during runtime and therefore the vector loop is skipped in favour of a scalar fallback: 32.07% 40ed2c ldrb w3, [x0,x5] 11.41% 40ed30 ldrb w11, [x4,x5] 40ed34 add w3, w3, w11 40ed38 add w3, w3, #0x1 40ed3c asr w3, w3, #1 0.71% 40ed40 strb w3, [x2,x5] 40ed44 add x5, x5, #0x1 40ed48 cmp w6, w5 40ed4c b.gt <loop> The most frequent runtime combinations of inputs to this function are: 29240545 i_height: 8, i_width: 8, i_dst_stride: 16, i_src1_stride: 1344, i_src2_stride: 1344 22714355 i_height: 16, i_width: 16, i_dst_stride: 16, i_src1_stride: 1344, i_src2_stride: 1344 19669512 i_height: 8, i_width: 8, i_dst_stride: 16, i_src1_stride: 704, i_src2_stride: 704 3689216 i_height: 16, i_width: 8, i_dst_stride: 16, i_src1_stride: 1344, i_src2_stride: 1344 3670639 i_height: 8, i_width: 16, i_dst_stride: 16, i_src1_stride: 1344, i_src2_stride: 1344 That's a shame. AArch64 supports the V8QI form of the vector average instruction (and advertises it through optabs). With --param vect-epilogues-nomask=1 we already generate something like: if (bytes_left > 16) { while (bytes_left > 16) 16x_vectorised; if (bytes_left > 8) 8x_vectorised; unrolled_scalar_epilogue; } else scalar_loop; Could we perhaps generate: while (bytes_left > 16) 16x_vectorised; if (bytes_left > 8) 8x_vectorised; unrolled_scalar_epilogue; // or keep it as a rolled scalar_loop to save on codesize? Basically I'm looking for a way to take advantage of the 8x vectorised form. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations