https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245
            Bug ID: 79245
           Summary: [7 Regression] Inefficient loop distribution to memcpy
           Product: gcc
           Version: 7.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jgreenhalgh at gcc dot gnu.org
  Target Milestone: ---

Created attachment 40591
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40591&action=edit
Code generation for -Ofast -fomit-frame-pointer -mcpu=cortex-a72+crypto
-fno-tree-loop-distribute-patterns

After r242038 we start to distribute the memcpy in the inner loop in this
testcase:

  #define k 128000
  double a[k][k];
  double b[k][k];
  double c[k][k];

  int x;
  int y;

  void
  foo (void)
  {
    for (int j = 0; j < x; j++)
      {
        for (int i = 0; i < y; i++)
          {
            c[j][i] = b[j][i] - a[j][i];
            a[j][i] = b[j][i];
          }
      }
  }

Such that (on AArch64) we first do a vectorized subtract, then a call to
memcpy. That's a suboptimal cache access pattern. We already have a[j][i],
b[j][i], c[j][i] in cache as we walk the inner loop - loop distribution just
causes us to have to pull them in again for the memcpy.

Attached are compilation runs on aarch64-none-elf with -Ofast
-fomit-frame-pointer -mcpu=cortex-a72+crypto with
-ftree-loop-distribute-patterns on/off which show the issue.

Reply via email to