https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119588
Bug ID: 119588 Summary: Possible improvement in locking strategies for libgomp Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: libgomp Assignee: unassigned at gcc dot gnu.org Reporter: matmal01 at gcc dot gnu.org CC: jakub at gcc dot gnu.org Target Milestone: --- Created attachment 60960 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=60960&action=edit Demonstrating locking differences Hello, Summary is: I'm proposing that we implement the "hypercube-embedded tree" locking strategy that LLVM libomp uses by default in libgomp. Would appreciate feedback on whether this would be welcome and/or feasible. Below contains the observations I've made to come to that suggestion. Apologies for taking my time between asking on IRC and raising the PR. ------------------------------ We've seen on some internal workloads (NVPL BLAS running GEMM routine on a small matrix) that the overhead of a `#pragma omp parallel` statement when running with a high number of cores (72 or 144) is much higher with the libgomp implementation than with LLVM's libomp. In a program which has both some work that can be handled with high parallelism (so OMP is running with many threads) and a large number of small pieces of work that need to be performed with low overhead, this has been seen to cause a significant overhead when accumulated. I'm attaching a benchmark for just the creation of a `#pragma omp parallel` region (around an `asm` statement so the region doesn't get optimised away). We can see that with many threads libgomp scales worse than llvm's libomp. When compiled with the below: #+begin_example vshcmd: > ${gcc_install_path}/bin/g++ -O3 -fopenmp OpenMP-reproducer.cpp -o bench.gcc.x vshcmd: > ${clang_install_path}/bin/clang++ -O3 -fopenmp OpenMP-reproducer.cpp -o bench.clang.x lego-c2-qs-56:openmp-parallel-gomp-slow [04:02:41] $ lego-c2-qs-56:openmp-parallel-gomp-slow [04:02:44] $ #+end_example Numbers I've observed are such showing that at 144 threads the cost of just the barrier is much higher with GNU than with LLVM (N.b. this is on an AArch64 machine with 144 cores): #+begin_example vshcmd: > bench_gcc () { vshcmd: > LD_LIBRARY_PATH=${gcc_install_path}/lib64 ./bench.gcc.x vshcmd: > } vshcmd: > bench_clang () { vshcmd: > LD_LIBRARY_PATH=${clang_install_path}/lib ./bench.clang.x vshcmd: > } vshcmd: > three_times () { vshcmd: > for i in 1 2 3; do vshcmd: > $1 vshcmd: > done vshcmd: > } vshcmd: > high_thread_counts () { vshcmd: > for num_threads in 72 144; do vshcmd: > export OMP_NUM_THREADS=$num_threads vshcmd: > echo " NUM = $num_threads" vshcmd: > OMP_PROC_BIND=true OMP_WAIT_POLICY=active three_times $1 vshcmd: > done vshcmd: > } > > lego-c2-qs-56:openmp-parallel-gomp-slow [04:37:02] $ > > lego-c2-qs-56:openmp-parallel-gomp-slow [04:37:02] $ > > > > lego-c2-qs-56:openmp-parallel-gomp-slow [04:37:02] $ > > > > > > lego-c2-qs-56:openmp-parallel-gomp-slow [04:37:02] $ vshcmd: > # Without any specification of locking mechanisms, clang approx thrice performance of GCC. vshcmd: > high_thread_counts bench_gcc NUM = 72 creation maxthr:72 nthr:72 min_time:10.694 us max_time:11.181 us avg_time:10.839 us stddev:23.127 us creation maxthr:72 nthr:72 min_time:10.214 us max_time:10.567 us avg_time:10.335 us stddev:11.986 us creation maxthr:72 nthr:72 min_time:10.147 us max_time:10.615 us avg_time:10.357 us stddev:19.212 us NUM = 144 creation maxthr:144 nthr:144 min_time:31.421 us max_time:32.003 us avg_time:31.735 us stddev:31.332 us creation maxthr:144 nthr:144 min_time:30.592 us max_time:31.953 us avg_time:31.352 us stddev:132.466 us creation maxthr:144 nthr:144 min_time:31.089 us max_time:31.953 us avg_time:31.640 us stddev:60.002 us lego-c2-qs-56:openmp-parallel-gomp-slow [04:37:05] $ vshcmd: > high_thread_counts bench_clang NUM = 72 creation maxthr:72 nthr:72 min_time:8.574 us max_time:9.006 us avg_time:8.877 us stddev:17.170 us creation maxthr:72 nthr:72 min_time:8.601 us max_time:8.749 us avg_time:8.686 us stddev:3.635 us creation maxthr:72 nthr:72 min_time:8.206 us max_time:8.471 us avg_time:8.421 us stddev:6.070 us NUM = 144 creation maxthr:144 nthr:144 min_time:9.958 us max_time:11.293 us avg_time:10.388 us stddev:133.078 us creation maxthr:144 nthr:144 min_time:9.685 us max_time:10.618 us avg_time:10.232 us stddev:83.710 us creation maxthr:144 nthr:144 min_time:9.132 us max_time:9.783 us avg_time:9.434 us stddev:42.769 us lego-c2-qs-56:openmp-parallel-gomp-slow [04:37:06] $ #+end_example I believe the difference to be the locking algorithm used. There are environment variables that the LLVM libomp uses to adjust locking strategy and the default seems to be "hyper". #+begin_example vshcmd: > # Default for clang is hyper choices. vshcmd: > OMP_DISPLAY_ENV=verbose \ vshcmd: > OMP_NUM_THREADS=2 \ vshcmd: > OMP_PROC_BIND=true OMP_WAIT_POLICY=active \ vshcmd: > bench_clang \ vshcmd: > 2>&1 | grep KMP_.*BARRIER_PATTERN > > > > [host] KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper' [host] KMP_PLAIN_BARRIER_PATTERN='hyper,hyper' [host] KMP_REDUCTION_BARRIER_PATTERN='hyper,hyper' lego-c2-qs-56:openmp-parallel-gomp-slow [04:35:51] $ #+end_example Trying out the performance of these different locking types it seems that: 1) The "linear" locking strategy is slightly worse than the current libgomp approach. From this and my understanding that libgomp implements a straight-forward locking strategy I believe that the majority of the difference between LLVM and GNU in this area is due to the locking strategy. 2) The "dist" strategy is best for this particular workload. The discussion around that locking strategy on the LLVM review mechanism seems to say that this is precisely the workload where it would excel, but seems to imply that said locking strategy is less generally useful than the "hyper" one https://reviews.llvm.org/D103121. #+begin_example vshcmd: > # Trying out performance of different locking types. vshcmd: > for_each_locktype () { vshcmd: > for locktype in linear tree hyper dist; do vshcmd: > echo "### ${locktype}" vshcmd: > KMP_PLAIN_BARRIER_PATTERN="${locktype},${locktype}" \ vshcmd: > KMP_FORKJOIN_BARRIER_PATTERN="${locktype},${locktype}" \ vshcmd: > $1 vshcmd: > done vshcmd: > } > > > > > > > lego-c2-qs-56:openmp-parallel-gomp-slow [04:38:10] $ vshcmd: > for_each_locktype "high_thread_counts bench_clang" ### linear NUM = 72 creation maxthr:72 nthr:72 min_time:23.069 us max_time:24.048 us avg_time:23.381 us stddev:82.844 us creation maxthr:72 nthr:72 min_time:21.945 us max_time:22.392 us avg_time:22.235 us stddev:19.136 us creation maxthr:72 nthr:72 min_time:21.888 us max_time:22.827 us avg_time:22.458 us stddev:70.038 us NUM = 144 creation maxthr:144 nthr:144 min_time:42.401 us max_time:49.778 us avg_time:43.787 us stddev:4655.222 us creation maxthr:144 nthr:144 min_time:41.662 us max_time:43.191 us avg_time:42.453 us stddev:219.167 us creation maxthr:144 nthr:144 min_time:41.585 us max_time:42.580 us avg_time:42.035 us stddev:128.929 us ### tree NUM = 72 creation maxthr:72 nthr:72 min_time:8.336 us max_time:8.753 us avg_time:8.568 us stddev:15.185 us creation maxthr:72 nthr:72 min_time:8.289 us max_time:8.503 us avg_time:8.366 us stddev:3.837 us creation maxthr:72 nthr:72 min_time:7.934 us max_time:8.229 us avg_time:8.132 us stddev:10.432 us NUM = 144 creation maxthr:144 nthr:144 min_time:9.282 us max_time:10.082 us avg_time:9.637 us stddev:52.126 us creation maxthr:144 nthr:144 min_time:9.471 us max_time:10.060 us avg_time:9.759 us stddev:50.012 us creation maxthr:144 nthr:144 min_time:9.715 us max_time:10.063 us avg_time:9.880 us stddev:13.659 us ### hyper NUM = 72 creation maxthr:72 nthr:72 min_time:8.656 us max_time:9.096 us avg_time:8.797 us stddev:19.781 us creation maxthr:72 nthr:72 min_time:8.528 us max_time:9.008 us avg_time:8.699 us stddev:29.078 us creation maxthr:72 nthr:72 min_time:8.690 us max_time:8.951 us avg_time:8.821 us stddev:8.773 us NUM = 144 creation maxthr:144 nthr:144 min_time:10.216 us max_time:10.779 us avg_time:10.506 us stddev:39.825 us creation maxthr:144 nthr:144 min_time:9.706 us max_time:10.624 us avg_time:10.244 us stddev:95.163 us creation maxthr:144 nthr:144 min_time:10.115 us max_time:10.524 us avg_time:10.278 us stddev:23.869 us ### dist NUM = 72 creation maxthr:72 nthr:72 min_time:4.301 us max_time:4.551 us avg_time:4.361 us stddev:5.554 us creation maxthr:72 nthr:72 min_time:4.359 us max_time:4.531 us avg_time:4.417 us stddev:2.983 us creation maxthr:72 nthr:72 min_time:4.274 us max_time:4.307 us avg_time:4.291 us stddev:0.154 us NUM = 144 creation maxthr:144 nthr:144 min_time:4.543 us max_time:5.223 us avg_time:4.760 us stddev:68.559 us creation maxthr:144 nthr:144 min_time:4.561 us max_time:5.136 us avg_time:4.756 us stddev:42.839 us creation maxthr:144 nthr:144 min_time:4.501 us max_time:4.970 us avg_time:4.683 us stddev:23.683 us lego-c2-qs-56:openmp-parallel-gomp-slow [04:38:31] $ #+end_example I also happen to have found a paper online that includes some performance comparisons on omp directives. It also identifies the "hypercube-embedded tree" locking strategy as the reason for the different scaling properties around locking. The paper is quite old, but I see something similar (though not as drastic) in my experiments. Figure (2) in the paper give a graph of how the two barrier approaches scale with multiple threads (though this is done for `#pragma omp barrier` -- Figure (5) compares the overhead of `#pragma omp parallel` to `#pragma omp barrier`). Figure (3) gives a diagram of what "hypercube-embedded tree" actually means. https://eprints.whiterose.ac.uk/180101/6/HPCC2021.pdf