https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86504
Bug ID: 86504 Summary: vectorization failure for a nest loop Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jiangning.liu at amperecomputing dot com Target Milestone: --- Created attachment 44386 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44386&action=edit bad vectorizatoin result for boundary size 16 For the case below, the code generated by “gcc -O3” is very ugly, and the inner loop can be correctly vectorized. Please refer to attached file test_loop_inner_16.s. char g_d[1024], g_s1[1024], g_s2[1024]; void test_loop(void) { char *d = g_d, *s1 = g_s1, *s2 = g_s2; for ( int y = 0; y < 128; y++ ) { for ( int x = 0; x < 16; x++ ) d[x] = s1[x] + s2[x]; d += 16; } } If we change inner loop “for ( int x = 0; x < 16; x++ )” to be like “for ( int x = 0; x < 32; x++ )”, i.e. the loop boundary size changes from 16 to 32, very beautiful vectorization code would be generated. For example, the code below is the aarch64 result for loop boundary size 32, and it the same case for x86. test_loop: .LFB0: .cfi_startproc adrp x2, g_s1 adrp x3, g_s2 add x2, x2, :lo12:g_s1 add x3, x3, :lo12:g_s2 adrp x0, g_d adrp x1, g_d+2048 add x0, x0, :lo12:g_d add x1, x1, :lo12:g_d+2048 ldp q1, q2, [x2] ldp q3, q0, [x3] add v1.16b, v1.16b, v3.16b add v0.16b, v0.16b, v2.16b .p2align 3,,7 .L2: str q1, [x0] str q0, [x0, 16]! cmp x0, x1 bne .L2 ret The code generated for loop boundary size 8 is also very bad. Any idea?