https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245
Bug ID: 79245 Summary: [7 Regression] Inefficient loop distribution to memcpy Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jgreenhalgh at gcc dot gnu.org Target Milestone: --- Created attachment 40591 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40591&action=edit Code generation for -Ofast -fomit-frame-pointer -mcpu=cortex-a72+crypto -fno-tree-loop-distribute-patterns After r242038 we start to distribute the memcpy in the inner loop in this testcase: #define k 128000 double a[k][k]; double b[k][k]; double c[k][k]; int x; int y; void foo (void) { for (int j = 0; j < x; j++) { for (int i = 0; i < y; i++) { c[j][i] = b[j][i] - a[j][i]; a[j][i] = b[j][i]; } } } Such that (on AArch64) we first do a vectorized subtract, then a call to memcpy. That's a suboptimal cache access pattern. We already have a[j][i], b[j][i], c[j][i] in cache as we walk the inner loop - loop distribution just causes us to have to pull them in again for the memcpy. Attached are compilation runs on aarch64-none-elf with -Ofast -fomit-frame-pointer -mcpu=cortex-a72+crypto with -ftree-loop-distribute-patterns on/off which show the issue.