https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117738
Bug ID: 117738 Summary: Failure to recognize dot-product pattern in inner loop Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- Take a two-level loop-nest: void foo(int8_t *__restrict__ A, int8_t *__restrict__ B, int32_t *__restrict__ sum, int n, int m) { for (int i = 0; i < n; ++i) { int8_t a = A[i]; for (int j = 0; j < m; j++) { int8_t b = B[T_FN(j) + i]; sum[j] += a * b; } } } Suppose T_FN() is some kind of pure mathematical function. Now although gcc could vectorize inner loop independent of the outer one regarding simple form of T_FN(), the result is basically far from optimal. If we consider loop-nest as a whole, and unroll the outer loop by an appropriate VF(for example, let VF=8 for 128 bit-vectorization width), we could make accumulate statement of the inner loop fit into more compact dot-product pattern as: (leftover epilog loop is omitted) for (int i = 0; i < n; i += 8) { <vector(8) int8_t> v_a = LOAD<vector(8) int8_t>(&A[i]); for (int j = 0; j < m; j++) { <vector(8) int8_t> v_b = LOAD<vector(8) int8_t>(&B[T_FN(j) + i]); sum[j] += DOT_PROD(v_a * v_b); } }