https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67682
Bug ID: 67682 Summary: Missed vectorization: (another) straight-line memcpy/memset not vectorized when equivalent loop is Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: alalaw01 at gcc dot gnu.org Target Milestone: --- Target: aarch64 This code: void test (int*__restrict a, int*__restrict b) { a[0] = b[0]; a[1] = b[1]; a[2] = b[2]; a[3] = b[3]; a[4] = 0; a[5] = 0; a[6] = 0; a[7] = 0; } is not vectorized; -fdump-tree-slp-details reveals test.c:4:13: note: Build SLP failed: different operation in stmt MEM[(int *)a_4( D) + 28B] = 0; test.c:4:13: note: original stmt *a_4(D) = _3; test.c:4:13: note: === vect_slp_analyze_data_ref_dependences === test.c:4:13: note: === vect_slp_analyze_operations === test.c:4:13: note: not vectorized: bad operation in basic block. test.c:4:13: note: ***** Re-trying analysis with vector size 8 ... test.c:4:13: note: Build SLP failed: different operation in stmt MEM[(int *)a_4(D) + 28B] = 0; test.c:4:13: note: original stmt *a_4(D) = _3; test.c:4:13: note: === vect_slp_analyze_data_ref_dependences === test.c:4:13: note: === vect_slp_analyze_operations === test.c:4:13: note: not vectorized: bad operation in basic block. (the failure with vector size 8 is expected, but vector size 4 should succeed) Output is: test: ldp w4, w3, [x1] ldp w2, w1, [x1, 8] stp w4, w3, [x0] stp w2, w1, [x0, 8] stp wzr, wzr, [x0, 16] stp wzr, wzr, [x0, 24] ret Curiously, a similar code but writing elements a[0..3] and a[5..8] (missing out a[4]) is SLP'd, producing superior: test: ldr q0, [x1] movi v1.4s, 0 str q1, [x0, 20] str q0, [x0] ret And similarly for (equivalent to the first): void test (int*__restrict a, int*__restrict b) { for (int i = 0; i < 4; i++) a[i] = b[i]; for (int i = 4; i < 8; i++) a[i] = 0; } producing: test: movi v0.4s, 0 ldp x2, x3, [x1] stp x2, x3, [x0] str q0, [x0, 16] ret