https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930
--- Comment #12 from Adam Hirst <adam at aphirst dot karoo.co.uk> --- Created attachment 40940 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40940&action=edit call graph of my "real" application Thanks Thomas, My "real" application is of course not using random numbers for the NU and NV, but I will bear in mind the point about generating large chunks for the future. I noticed too that enough optimisation flags resulted in an execution time of 0 seconds. I worked around it by writing all the results into an array, evaluating the second "timing" variable, then asking for user input to specify which result(s) to print. In my "real" application, the Tensor P (or D, whatever I'm calling it this week) is a 4x4 segment of a larger 'array' of Type(Vector), whose elements keep varying (they're the control points of a B-Spline surface, and I'm more-or-less doing shape optimisation on that surface). The whole reason I was looking into this in the first place is that gprof (along with useful plots by gprof2dot, one of which is attached) consistently shows that it is this TensorProduct routine which BY FAR dominates. So my options are either i) make it faster, or 2) need to call it less (which is more a matter of algorithm design, and is a TODO for later investigation). In any case, switching my TensorProduct routine to the one where the matmul() and dot_product() are computed separately (though with no further array temporaries, see one of my earlier comments in this thread) yielded the best speed-up in my "real" application. Not as drastic as the reduced test case, but still much more than a factor of two faster, whether building with -O2 or -Ofast -flto.