https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69564
--- Comment #29 from Jeffrey A. Law <law at redhat dot com> --- So to bring this BZ back to the core questions (the scope seems to have widened through the year since this originally reported). Namely are the use of LTO or C++ making things slower, particularly for scimark's LU factorization test. >From my experiments, the answer is a very clear yes. I hacked up the test a bit to only run LU and run a fixed number of iterations. That makes comparisons with something like callgrind much easier. Use of C++ adds 2-3% in terms of instruction counts. LTO adds an additional 2-3% to the instruction counts. These are additive, C++ with LTO is about 5% higher than C without LTO. The time (not surprisingly) is lost in LU_factor, the main culprit seems to be this pair of nested loops: int ii; for (ii=j+1; ii<M; ii++) { double *Aii = A[ii]; double *Aj = A[j]; double AiiJ = Aii[j]; /* Here */ int jj; for (jj=j+1; jj<N; jj++) Aii[jj] -= AiiJ * Aj[jj]; } Callgrind calls out the marked line, which probably in reality means the preheader for the inner loop. For C w/o LTO it's ~12million instructions. For C++ with LTO it's ~21million instructions (remember, I'm just running LU and for a relatively small number of iterations). It's a bit of a surprise as these loops are dead simple, but it appears we've got to be doing something dumb somewhere. Hopefully that narrows things down a bit.