https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110456
Bug ID: 110456 Summary: vectorization with loop masking prone to STLF issues Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: rguenth at gcc dot gnu.org Target Milestone: --- void __attribute__((noipa)) test (double * __restrict a, double *b, int n, int m) { for (int j = 0; j < m; ++j) for (int i = 0; i < n; ++i) a[i + j*n] = a[i + j*n /* + 512 */] + b[i + j*n]; } double a[1024]; double b[1024]; int main(int argc, char **argv) { int m = atoi (argv[1]); for (long i = 0; i < 1000000000; ++i) test (a + 4, b + 4, 4, m); } Shows that when we apply loop masking with --param vect-partial-vector-usage then masked stores will generally prohibit store-to-load forwarding, especially when there's only a partial overlap with a following load like when traversing a multi-dimensional array as above. The above runs noticable slower compared to when the loads are offset (uncomment the /* + 512 */). The situation is difficult to avoid in general but there might be easy heuristics that could be implemented like avoiding loop masking when there's a read-modify-write operation to the same memory location in a loop (with or without an immediately visible outer loop). For unknown dependences and thus runtime disambiguation a proper distance of any read/write operation could be ensured as well.