I have forgotten to mentionned that I have a variant of fatigue in which I have done the inlining manually along with few other optimizations and the timing for it is
[macbook] lin/test% gfc -Ofast fatigue_v8.f90 [macbook] lin/test% time a.out > /dev/null 2.793u 0.002s 0:02.79 100.0% 0+0k 0+1io 0pf+0w [macbook] lin/test% gfc -Ofast -fwhole-program fatigue_v8.f90 [macbook] lin/test% time a.out > /dev/null 2.680u 0.003s 0:02.68 100.0% 0+0k 0+2io 0pf+0w [macbook] lin/test% gfc -Ofast -fwhole-program -flto fatigue_v8.f90 [macbook] lin/test% time a.out > /dev/null 2.671u 0.002s 0:02.67 100.0% 0+0k 0+2io 0pf+0w [macbook] lin/test% gfc -Ofast -fwhole-program -fstack-arrays fatigue_v8.f90 [macbook] lin/test% time a.out > /dev/null 2.680u 0.003s 0:02.68 100.0% 0+0k 0+2io 0pf+0w [macbook] lin/test% gfc -Ofast -fwhole-program -flto -fstack-arrays fatigue_v8.f90 [macbook] lin/test% time a.out > /dev/null 2.677u 0.003s 0:02.68 99.6% 0+0k 0+0io 0pf+0w So the timing of the original code with -Ofast -finline-limit=600 -fwhole-program -flto -fstack-arrays is quite close to this lower bound. I have also looked at the failure for gfortran.dg/elemental_dependency_1.f90 and it seems due to a spurious integer(kind=4) A.37[4]; (and friends) in integer(kind=8) D.1674; integer(kind=4) A.37[4]; struct array1_integer(kind=4) atmp.36; void * D.1669; integer(kind=8) D.1668; struct array1_integer(kind=4) parm.35; parm.35.dtype = 265; parm.35.dim[0].lbound = 1; parm.35.dim[0].ubound = 4; parm.35.dim[0].stride = 1; parm.35.data = (void *) &a[1]; parm.35.offset = -2; atmp.36.dtype = 265; atmp.36.dim[0].stride = 1; atmp.36.dim[0].lbound = 0; atmp.36.dim[0].ubound = 3; integer(kind=4) A.37[4]; atmp.36.data = (void * restrict) &A.37; compared to integer(kind=8) D.1658; integer(kind=4) A.37[4]; struct array1_integer(kind=4) atmp.36; void * D.1653; integer(kind=8) D.1652; struct array1_integer(kind=4) parm.35; parm.35.dtype = 265; parm.35.dim[0].lbound = 1; parm.35.dim[0].ubound = 4; parm.35.dim[0].stride = 1; parm.35.data = (void *) &a[1]; parm.35.offset = -2; atmp.36.dtype = 265; atmp.36.dim[0].stride = 1; atmp.36.dim[0].lbound = 0; atmp.36.dim[0].ubound = 3; atmp.36.data = (void * restrict) &A.37; Note that this is without the -fstack-arrays option. Same thing for gfortran.dg/vector_subscript_4.f90. Dominique