https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67682

            Bug ID: 67682
           Summary: Missed vectorization: (another) straight-line
                    memcpy/memset not vectorized when equivalent loop is
           Product: gcc
           Version: 6.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: alalaw01 at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64

This code:

void
test (int*__restrict a, int*__restrict b)
{
    a[0] = b[0];
    a[1] = b[1];
    a[2] = b[2];
    a[3] = b[3];
    a[4] = 0;
    a[5] = 0;
    a[6] = 0;
    a[7] = 0;
}

is not vectorized; -fdump-tree-slp-details reveals

test.c:4:13: note: Build SLP failed: different operation in stmt MEM[(int
*)a_4(
D) + 28B] = 0;
test.c:4:13: note: original stmt *a_4(D) = _3;
test.c:4:13: note: === vect_slp_analyze_data_ref_dependences ===
test.c:4:13: note: === vect_slp_analyze_operations ===
test.c:4:13: note: not vectorized: bad operation in basic block.
test.c:4:13: note: ***** Re-trying analysis with vector size 8
...
test.c:4:13: note: Build SLP failed: different operation in stmt MEM[(int
*)a_4(D) + 28B] = 0;
test.c:4:13: note: original stmt *a_4(D) = _3;
test.c:4:13: note: === vect_slp_analyze_data_ref_dependences ===
test.c:4:13: note: === vect_slp_analyze_operations ===
test.c:4:13: note: not vectorized: bad operation in basic block.

(the failure with vector size 8 is expected, but vector size 4 should succeed)

Output is:
test:
        ldp     w4, w3, [x1]
        ldp     w2, w1, [x1, 8]
        stp     w4, w3, [x0]
        stp     w2, w1, [x0, 8]
        stp     wzr, wzr, [x0, 16]
        stp     wzr, wzr, [x0, 24]
        ret

Curiously, a similar code but writing elements a[0..3] and a[5..8] (missing out
a[4]) is SLP'd, producing superior:

test:
        ldr     q0, [x1]
        movi    v1.4s, 0
        str     q1, [x0, 20]
        str     q0, [x0]
        ret

And similarly for (equivalent to the first):

void
test (int*__restrict a, int*__restrict b)
{
  for (int i = 0; i < 4; i++)
    a[i] = b[i];
  for (int i = 4; i < 8; i++)
    a[i] = 0;
}

producing:

test:
        movi    v0.4s, 0
        ldp     x2, x3, [x1]
        stp     x2, x3, [x0]
        str     q0, [x0, 16]
        ret

Reply via email to