https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
--- Comment #20 from Alexander Nesterovskiy <alexander.nesterovskiy at intel dot com> --- I've made test runs on Broadwell and Skylake, RHEL 7.3. 410.bwaves became faster after r256990 but not as fast as it was on r253678. Comparing 410.bwaves performance, "-Ofast -funroll-loops -flto -ftree-parallelize-loops=4": rev perf. relative to r253678, % r253678 100% r253679 54% ... r256989 54% r256990 71% CPU time distribution became more flat (~34% thread0, ~22% - threads1-3), but a lot of time is spent spinning in libgomp.so.1.0.0/gomp_barrier_wait_end -> do_wait -> do_spin and libgomp.so.1.0.0/gomp_team_barrier_wait_end -> do_wait -> do_spin r253678 spin time is ~10% of CPU time r256990 spin time is ~30% of CPU time