https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110279

--- Comment #1 from Di Zhao <dizhao at os dot amperecomputing.com> ---
Here's a small example for the issue exposed in 508.namd_r:

  #define LOOP_COUNT 800000000
  typedef double data_e;
  #include <stdio.h>

  __attribute_noinline__ data_e
  foo (data_e a, data_e b, data_e c, data_e d, data_e x, data_e y)
  {
    data_e tmp1, tmp2;
    data_e result = 0;

    for (int ic = 0; ic < LOOP_COUNT; ic++)
      {
        /*  LHS is operator of another FMA, re-writing to parallel is worse. 
*/
        tmp1 = a + c * c - d * d + x * y;

        tmp2 = x * tmp1;
        result += (a + c - d + tmp2);

        a -= 0.1;
        b += 0.9;
        c *= 1.02;
        x *= 0.1;
        y *= y;
        d *= 0.61;
      }

    return result;
  }

  int
  main (int argc, char **argv)
  {
    printf ("%f\n", foo (-1.0, 0.01, 9.8, 1e2, -1.9, 0.2));
  }

Tested on the following platforms, rewriting all the two op list is worse than
no-rewriting or only rewriting "result" (compile option I used are "-Ofast
--param tree-reassoc-width=4 -march=native"):

run         no        rewrite    rewrite   rewrite
time        rewrite   "result"   "tmp1"    both
-----------------------------------------------
Ampere1     1.80       1.93       2.04     2.10
Neoverse-n1 1.36       1.45       1.49     1.52
Intel Xeon  1.57       1.55       1.66     1.62

Reply via email to