https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
--- Comment #21 from amker at gcc dot gnu.org --- (In reply to Alexander Nesterovskiy from comment #20) > I've made test runs on Broadwell and Skylake, RHEL 7.3. > 410.bwaves became faster after r256990 but not as fast as it was on r253678. > Comparing 410.bwaves performance, "-Ofast -funroll-loops -flto > -ftree-parallelize-loops=4": > > rev perf. relative to r253678, % > r253678 100% > r253679 54% > ... > r256989 54% > r256990 71% > > CPU time distribution became more flat (~34% thread0, ~22% - threads1-3), > but a lot of time is spent spinning in > libgomp.so.1.0.0/gomp_barrier_wait_end -> do_wait -> do_spin > and > libgomp.so.1.0.0/gomp_team_barrier_wait_end -> do_wait -> do_spin > r253678 spin time is ~10% of CPU time > r256990 spin time is ~30% of CPU time I don't know gomp. Does this mean we spend more time synchronizing threads now? Thanks.