https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110456

            Bug ID: 110456
           Summary: vectorization with loop masking prone to STLF issues
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rguenth at gcc dot gnu.org
  Target Milestone: ---

void __attribute__((noipa))
test (double * __restrict a, double *b, int n, int m)
{
  for (int j = 0; j < m; ++j)
    for (int i = 0; i < n; ++i)
      a[i + j*n] = a[i + j*n /* + 512 */] + b[i + j*n];
}

double a[1024];
double b[1024]; 

int main(int argc, char **argv)
{
  int m = atoi (argv[1]);
  for (long i = 0; i < 1000000000; ++i)
    test (a + 4, b + 4, 4, m);
}


Shows that when we apply loop masking with --param vect-partial-vector-usage
then masked stores will generally prohibit store-to-load forwarding,
especially when there's only a partial overlap with a following load like
when traversing a multi-dimensional array as above.  The above runs
noticable slower compared to when the loads are offset
(uncomment the /* + 512 */).

The situation is difficult to avoid in general but there might be easy
heuristics that could be implemented like avoiding loop masking when
there's a read-modify-write operation to the same memory location in
a loop (with or without an immediately visible outer loop).  For
unknown dependences and thus runtime disambiguation a proper distance
of any read/write operation could be ensured as well.

Reply via email to