I have forgotten to mentionned that I have a variant of fatigue
in which I have done the inlining manually along with few other
optimizations and the timing for it is
[macbook] lin/test% gfc -Ofast fatigue_v8.f90
[macbook] lin/test% time a.out > /dev/null
2.793u 0.002s 0:02.79 100.0% 0+0k 0+1io 0pf+0w
[macbook] lin/test% gfc -Ofast -fwhole-program fatigue_v8.f90
[macbook] lin/test% time a.out > /dev/null
2.680u 0.003s 0:02.68 100.0% 0+0k 0+2io 0pf+0w
[macbook] lin/test% gfc -Ofast -fwhole-program -flto fatigue_v8.f90
[macbook] lin/test% time a.out > /dev/null
2.671u 0.002s 0:02.67 100.0% 0+0k 0+2io 0pf+0w
[macbook] lin/test% gfc -Ofast -fwhole-program -fstack-arrays fatigue_v8.f90
[macbook] lin/test% time a.out > /dev/null
2.680u 0.003s 0:02.68 100.0% 0+0k 0+2io 0pf+0w
[macbook] lin/test% gfc -Ofast -fwhole-program -flto -fstack-arrays
fatigue_v8.f90
[macbook] lin/test% time a.out > /dev/null
2.677u 0.003s 0:02.68 99.6% 0+0k 0+0io 0pf+0w
So the timing of the original code with
-Ofast -finline-limit=600 -fwhole-program -flto -fstack-arrays
is quite close to this lower bound.
I have also looked at the failure for gfortran.dg/elemental_dependency_1.f90
and it seems due to a spurious integer(kind=4) A.37[4]; (and friends) in
integer(kind=8) D.1674;
integer(kind=4) A.37[4];
struct array1_integer(kind=4) atmp.36;
void * D.1669;
integer(kind=8) D.1668;
struct array1_integer(kind=4) parm.35;
parm.35.dtype = 265;
parm.35.dim[0].lbound = 1;
parm.35.dim[0].ubound = 4;
parm.35.dim[0].stride = 1;
parm.35.data = (void *) &a[1];
parm.35.offset = -2;
atmp.36.dtype = 265;
atmp.36.dim[0].stride = 1;
atmp.36.dim[0].lbound = 0;
atmp.36.dim[0].ubound = 3;
integer(kind=4) A.37[4];
atmp.36.data = (void * restrict) &A.37;
compared to
integer(kind=8) D.1658;
integer(kind=4) A.37[4];
struct array1_integer(kind=4) atmp.36;
void * D.1653;
integer(kind=8) D.1652;
struct array1_integer(kind=4) parm.35;
parm.35.dtype = 265;
parm.35.dim[0].lbound = 1;
parm.35.dim[0].ubound = 4;
parm.35.dim[0].stride = 1;
parm.35.data = (void *) &a[1];
parm.35.offset = -2;
atmp.36.dtype = 265;
atmp.36.dim[0].stride = 1;
atmp.36.dim[0].lbound = 0;
atmp.36.dim[0].ubound = 3;
atmp.36.data = (void * restrict) &A.37;
Note that this is without the -fstack-arrays option.
Same thing for gfortran.dg/vector_subscript_4.f90.
Dominique