https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117738

            Bug ID: 117738
           Summary: Failure to recognize dot-product pattern in inner loop
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

Take a two-level loop-nest:

  void foo(int8_t *__restrict__ A, int8_t *__restrict__ B, int32_t
*__restrict__ sum, int n, int m)
  {
    for (int i = 0; i < n; ++i) {
      int8_t a = A[i];

      for (int j = 0; j < m; j++) {
        int8_t b = B[T_FN(j) + i];

        sum[j] += a * b;
      }
    }
  }

Suppose T_FN() is some kind of pure mathematical function. Now although gcc
could vectorize inner loop independent of the outer one regarding simple form
of T_FN(), the result is basically far from optimal. If we consider loop-nest
as a whole, and unroll the outer loop by an appropriate VF(for example, let
VF=8 for 128 bit-vectorization width), we could make accumulate statement of
the inner loop fit into more compact dot-product pattern as: (leftover epilog
loop is omitted)

    for (int i = 0; i < n; i += 8) {
      <vector(8) int8_t> v_a = LOAD<vector(8) int8_t>(&A[i]);

      for (int j = 0; j < m; j++) {
        <vector(8) int8_t> v_b = LOAD<vector(8) int8_t>(&B[T_FN(j) + i]); 

        sum[j] += DOT_PROD(v_a * v_b);
      }
    }
  • [Bug tree-optimization/117... fxue at os dot amperecomputing.com via Gcc-bugs

Reply via email to