[Bug fortran/116128] missed optimisation: fortran sum instrinsic performed in order

anlauf at gcc dot gnu.org via Gcc-bugs Mon, 26 Aug 2024 13:03:37 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116128


--- Comment #6 from anlauf at gcc dot gnu.org ---
Guided by the dump-tree for the inlining code, I played a little to see
what kind of code the middle-end expects.  To this end I used C code.

The reference for the sum over a rank-1 array (given stride and count):

double __attribute__ ((noipa,noinline))
sum_v1 (double x[], int stride, int n)
{
  double sum = 0.;

  for (int i = 0; i < n; i++)
    sum += x[i*stride];

  return sum;
}


An optimizer might see a pattern in the array access.  I guess that for
stride=1 it generates a separate path like:

double __attribute__ ((noipa,noinline))
sum_v2 (double x[], int stride, int n)
{
  double sum = 0.;

  if (stride == 1)
    {
      for (int i = 0; i < n; i++)
        sum += x[i];
    }
  else
    {
      for (int i = 0; i < n; i++)
        sum += x[i*stride];
    }

  return sum;
}

Unit stride should give faster loads (it does on vector machines).
I wonder if gcc does that rather obvious transformation if stride can
be 1 and the loop is "hot".

Another optimization could be partial sums with reordering - which is what
you mention - but only if associative math is enabled.  Code might look like:

double __attribute__ ((noipa,noinline))
sum_v4 (double x[], int stride, int n)
{
  double sum;
  double tmp[4];
  int i, off, n4;

  if (n <= 0)
    return 0.;

  for (int i = 0; i < 4; i++)
    tmp[i] = 0.;

  n4 = n & ~0x3;
  off = 0;
  for (i = 0; i < n4; i += 4)
    {
      tmp[0] += x[off];
      tmp[1] += x[off+1*stride];
      tmp[2] += x[off+2*stride];
      tmp[3] += x[off+3*stride];
      off += 4*stride;
    }
  sum = (tmp[0] + tmp[1]) + (tmp[2] + tmp[3]);

  for (; i < n; i++)
    sum += x[off + (i-n4)*stride];

  return sum;
}

While this may look natural when memory bandwidth is "infinite" and SIMD
performance is limited, I do not know what the situation is for common
scalar architectures.

Ideally, if annotation of a reduction loop (e.g. "associative math") could
tell the middle-end that any of the above transformations is allowed, the
generation of inline code could be rather target-agnostic.
Doing code transforms as above when generating inline code in the frontend
could end up in a nightmare.

[Bug fortran/116128] missed optimisation: fortran sum instrinsic performed in order

Reply via email to