https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83017
--- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> --- Ok, so we do slightly better for the runtime test than for the static test: if (loop->inner) m_p_thread=2; else m_p_thread=MIN_PER_THREAD; so with 2 threads we should have exactly 2 iterations but ... the runtime check uses the number of latch executions which is 3 and thus arrives at 1 iteration per thread. Fixing this off-by-one get's us > /usr/bin/time ./a.out PI 2.98876095 PI 3.14159274 4.02user 0.00system 0:04.02elapsed 99%CPU (0avgtext+0avgdata 2460maxresident)k 0inputs+0outputs (0major+102minor)pagefaults 0swaps vs. > /usr/bin/time ./a.out PI 8.59536934 PI 3.14159274 10.90user 0.00system 0:05.54elapsed 196%CPU (0avgtext+0avgdata 2840maxresident)k 0inputs+0outputs (0major+126minor)pagefaults 0swaps I guess the different computation outcome means we're doing sth wrong somewhere. Also at least on my machine the result isn't any faster (when parallelizing the outer loop). As usual auto-parallelization may harm followup transforms.