https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110279
--- Comment #1 from Di Zhao <dizhao at os dot amperecomputing.com> --- Here's a small example for the issue exposed in 508.namd_r: #define LOOP_COUNT 800000000 typedef double data_e; #include <stdio.h> __attribute_noinline__ data_e foo (data_e a, data_e b, data_e c, data_e d, data_e x, data_e y) { data_e tmp1, tmp2; data_e result = 0; for (int ic = 0; ic < LOOP_COUNT; ic++) { /* LHS is operator of another FMA, re-writing to parallel is worse. */ tmp1 = a + c * c - d * d + x * y; tmp2 = x * tmp1; result += (a + c - d + tmp2); a -= 0.1; b += 0.9; c *= 1.02; x *= 0.1; y *= y; d *= 0.61; } return result; } int main (int argc, char **argv) { printf ("%f\n", foo (-1.0, 0.01, 9.8, 1e2, -1.9, 0.2)); } Tested on the following platforms, rewriting all the two op list is worse than no-rewriting or only rewriting "result" (compile option I used are "-Ofast --param tree-reassoc-width=4 -march=native"): run no rewrite rewrite rewrite time rewrite "result" "tmp1" both ----------------------------------------------- Ampere1 1.80 1.93 2.04 2.10 Neoverse-n1 1.36 1.45 1.49 1.52 Intel Xeon 1.57 1.55 1.66 1.62