https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

--- Comment #20 from Alexander Nesterovskiy <alexander.nesterovskiy at intel 
dot com> ---
I've made test runs on Broadwell and Skylake, RHEL 7.3.
410.bwaves became faster after r256990 but not as fast as it was on r253678. 
Comparing 410.bwaves performance, "-Ofast -funroll-loops -flto
-ftree-parallelize-loops=4": 

rev       perf. relative to r253678, %
r253678   100%
r253679   54%
...
r256989   54%
r256990   71%

CPU time distribution became more flat (~34% thread0, ~22% - threads1-3), but a
lot of time is spent spinning in 
libgomp.so.1.0.0/gomp_barrier_wait_end -> do_wait -> do_spin
and
libgomp.so.1.0.0/gomp_team_barrier_wait_end -> do_wait -> do_spin 
r253678 spin time is ~10% of CPU time 
r256990 spin time is ~30% of CPU time

Reply via email to