[Bug libgomp/119588] New: Possible improvement in locking strategies for libgomp

matmal01 at gcc dot gnu.org via Gcc-bugs Wed, 02 Apr 2025 04:44:29 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119588


            Bug ID: 119588
           Summary: Possible improvement in locking strategies for libgomp
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: libgomp
          Assignee: unassigned at gcc dot gnu.org
          Reporter: matmal01 at gcc dot gnu.org
                CC: jakub at gcc dot gnu.org
  Target Milestone: ---

Created attachment 60960
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=60960&action=edit
Demonstrating locking differences

Hello,

Summary is:  I'm proposing that we implement the "hypercube-embedded tree"
locking strategy that LLVM libomp uses by default in libgomp.
Would appreciate feedback on whether this would be welcome and/or feasible.

Below contains the observations I've made to come to that suggestion.

Apologies for taking my time between asking on IRC and raising the PR.

------------------------------
We've seen on some internal workloads (NVPL BLAS running GEMM routine on a
small
matrix) that the overhead of a `#pragma omp parallel` statement when running
with a high number of cores (72 or 144) is much higher with the libgomp
implementation than with LLVM's libomp.

In a program which has both some work that can be handled with high parallelism
(so OMP is running with many threads) and a large number of small pieces of
work
that need to be performed with low overhead, this has been seen to cause a
significant overhead when accumulated.

I'm attaching a benchmark for just the creation of a `#pragma omp parallel`
region (around an `asm` statement so the region doesn't get optimised away). 
We
can see that with many threads libgomp scales worse than llvm's libomp.

When compiled with the below:
#+begin_example
  vshcmd: > ${gcc_install_path}/bin/g++ -O3 -fopenmp OpenMP-reproducer.cpp -o
bench.gcc.x
  vshcmd: > ${clang_install_path}/bin/clang++ -O3 -fopenmp
OpenMP-reproducer.cpp -o bench.clang.x
  lego-c2-qs-56:openmp-parallel-gomp-slow [04:02:41] $
lego-c2-qs-56:openmp-parallel-gomp-slow [04:02:44] $ 
#+end_example


Numbers I've observed are such showing that at 144 threads the cost of just the
barrier is much higher with GNU than with LLVM (N.b. this is on an AArch64
machine
with 144 cores):

#+begin_example
  vshcmd: > bench_gcc () {
  vshcmd: >     LD_LIBRARY_PATH=${gcc_install_path}/lib64 ./bench.gcc.x
  vshcmd: > }
  vshcmd: > bench_clang () {
  vshcmd: >     LD_LIBRARY_PATH=${clang_install_path}/lib ./bench.clang.x
  vshcmd: > }
  vshcmd: > three_times () {
  vshcmd: >     for i in 1 2 3; do
  vshcmd: >         $1
  vshcmd: >     done
  vshcmd: > }
  vshcmd: > high_thread_counts () {
  vshcmd: >     for num_threads in 72 144; do
  vshcmd: >         export OMP_NUM_THREADS=$num_threads
  vshcmd: >         echo "     NUM = $num_threads"
  vshcmd: >         OMP_PROC_BIND=true OMP_WAIT_POLICY=active three_times $1
  vshcmd: >     done
  vshcmd: > }
  > > lego-c2-qs-56:openmp-parallel-gomp-slow [04:37:02] $ > >
lego-c2-qs-56:openmp-parallel-gomp-slow [04:37:02] $ > > > >
lego-c2-qs-56:openmp-parallel-gomp-slow [04:37:02] $ > > > > > >
lego-c2-qs-56:openmp-parallel-gomp-slow [04:37:02] $ 
  vshcmd: > # Without any specification of locking mechanisms, clang approx
thrice performance of GCC.
  vshcmd: > high_thread_counts bench_gcc
       NUM = 72
  creation maxthr:72 nthr:72 min_time:10.694 us max_time:11.181 us
avg_time:10.839 us stddev:23.127 us
  creation maxthr:72 nthr:72 min_time:10.214 us max_time:10.567 us
avg_time:10.335 us stddev:11.986 us
  creation maxthr:72 nthr:72 min_time:10.147 us max_time:10.615 us
avg_time:10.357 us stddev:19.212 us
       NUM = 144
  creation maxthr:144 nthr:144 min_time:31.421 us max_time:32.003 us
avg_time:31.735 us stddev:31.332 us
  creation maxthr:144 nthr:144 min_time:30.592 us max_time:31.953 us
avg_time:31.352 us stddev:132.466 us
  creation maxthr:144 nthr:144 min_time:31.089 us max_time:31.953 us
avg_time:31.640 us stddev:60.002 us
  lego-c2-qs-56:openmp-parallel-gomp-slow [04:37:05] $ 
  vshcmd: > high_thread_counts bench_clang
       NUM = 72
  creation maxthr:72 nthr:72 min_time:8.574 us max_time:9.006 us avg_time:8.877
us stddev:17.170 us
  creation maxthr:72 nthr:72 min_time:8.601 us max_time:8.749 us avg_time:8.686
us stddev:3.635 us
  creation maxthr:72 nthr:72 min_time:8.206 us max_time:8.471 us avg_time:8.421
us stddev:6.070 us
       NUM = 144
  creation maxthr:144 nthr:144 min_time:9.958 us max_time:11.293 us
avg_time:10.388 us stddev:133.078 us
  creation maxthr:144 nthr:144 min_time:9.685 us max_time:10.618 us
avg_time:10.232 us stddev:83.710 us
  creation maxthr:144 nthr:144 min_time:9.132 us max_time:9.783 us
avg_time:9.434 us stddev:42.769 us
  lego-c2-qs-56:openmp-parallel-gomp-slow [04:37:06] $ 
#+end_example


I believe the difference to be the locking algorithm used.  There are
environment
variables that the LLVM libomp uses to adjust locking strategy and the default
seems to be "hyper".

#+begin_example
  vshcmd: > # Default for clang is hyper choices.
  vshcmd: > OMP_DISPLAY_ENV=verbose \
  vshcmd: > OMP_NUM_THREADS=2 \
  vshcmd: > OMP_PROC_BIND=true OMP_WAIT_POLICY=active \
  vshcmd: >      bench_clang \
  vshcmd: >   2>&1 | grep KMP_.*BARRIER_PATTERN
  > > > >   [host] KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper'
    [host] KMP_PLAIN_BARRIER_PATTERN='hyper,hyper'
    [host] KMP_REDUCTION_BARRIER_PATTERN='hyper,hyper'
  lego-c2-qs-56:openmp-parallel-gomp-slow [04:35:51] $ 
#+end_example


Trying out the performance of these different locking types it seems that:
1) The "linear" locking strategy is slightly worse than the current libgomp
   approach.
   From this and my understanding that libgomp implements a straight-forward
   locking strategy I believe that the majority of the difference between LLVM
   and GNU in this area is due to the locking strategy.
2) The "dist" strategy is best for this particular workload.  The discussion
   around that locking strategy on the LLVM review mechanism seems to say that
   this is precisely the workload where it would excel, but seems to imply that
   said locking strategy is less generally useful than the "hyper" one
   https://reviews.llvm.org/D103121.

#+begin_example
  vshcmd: > # Trying out performance of different locking types.
  vshcmd: > for_each_locktype () {
  vshcmd: >     for locktype in linear tree hyper dist; do
  vshcmd: >          echo "### ${locktype}"
  vshcmd: >          KMP_PLAIN_BARRIER_PATTERN="${locktype},${locktype}" \
  vshcmd: >          KMP_FORKJOIN_BARRIER_PATTERN="${locktype},${locktype}" \
  vshcmd: >          $1
  vshcmd: >     done
  vshcmd: > }
  > > > > > > > lego-c2-qs-56:openmp-parallel-gomp-slow [04:38:10] $ 
  vshcmd: > for_each_locktype "high_thread_counts bench_clang"
  ### linear
       NUM = 72
  creation maxthr:72 nthr:72 min_time:23.069 us max_time:24.048 us
avg_time:23.381 us stddev:82.844 us
  creation maxthr:72 nthr:72 min_time:21.945 us max_time:22.392 us
avg_time:22.235 us stddev:19.136 us
  creation maxthr:72 nthr:72 min_time:21.888 us max_time:22.827 us
avg_time:22.458 us stddev:70.038 us
       NUM = 144
  creation maxthr:144 nthr:144 min_time:42.401 us max_time:49.778 us
avg_time:43.787 us stddev:4655.222 us
  creation maxthr:144 nthr:144 min_time:41.662 us max_time:43.191 us
avg_time:42.453 us stddev:219.167 us
  creation maxthr:144 nthr:144 min_time:41.585 us max_time:42.580 us
avg_time:42.035 us stddev:128.929 us
  ### tree
       NUM = 72
  creation maxthr:72 nthr:72 min_time:8.336 us max_time:8.753 us avg_time:8.568
us stddev:15.185 us
  creation maxthr:72 nthr:72 min_time:8.289 us max_time:8.503 us avg_time:8.366
us stddev:3.837 us
  creation maxthr:72 nthr:72 min_time:7.934 us max_time:8.229 us avg_time:8.132
us stddev:10.432 us
       NUM = 144
  creation maxthr:144 nthr:144 min_time:9.282 us max_time:10.082 us
avg_time:9.637 us stddev:52.126 us
  creation maxthr:144 nthr:144 min_time:9.471 us max_time:10.060 us
avg_time:9.759 us stddev:50.012 us
  creation maxthr:144 nthr:144 min_time:9.715 us max_time:10.063 us
avg_time:9.880 us stddev:13.659 us
  ### hyper
       NUM = 72
  creation maxthr:72 nthr:72 min_time:8.656 us max_time:9.096 us avg_time:8.797
us stddev:19.781 us
  creation maxthr:72 nthr:72 min_time:8.528 us max_time:9.008 us avg_time:8.699
us stddev:29.078 us
  creation maxthr:72 nthr:72 min_time:8.690 us max_time:8.951 us avg_time:8.821
us stddev:8.773 us
       NUM = 144
  creation maxthr:144 nthr:144 min_time:10.216 us max_time:10.779 us
avg_time:10.506 us stddev:39.825 us
  creation maxthr:144 nthr:144 min_time:9.706 us max_time:10.624 us
avg_time:10.244 us stddev:95.163 us
  creation maxthr:144 nthr:144 min_time:10.115 us max_time:10.524 us
avg_time:10.278 us stddev:23.869 us
  ### dist
       NUM = 72
  creation maxthr:72 nthr:72 min_time:4.301 us max_time:4.551 us avg_time:4.361
us stddev:5.554 us
  creation maxthr:72 nthr:72 min_time:4.359 us max_time:4.531 us avg_time:4.417
us stddev:2.983 us
  creation maxthr:72 nthr:72 min_time:4.274 us max_time:4.307 us avg_time:4.291
us stddev:0.154 us
       NUM = 144
  creation maxthr:144 nthr:144 min_time:4.543 us max_time:5.223 us
avg_time:4.760 us stddev:68.559 us
  creation maxthr:144 nthr:144 min_time:4.561 us max_time:5.136 us
avg_time:4.756 us stddev:42.839 us
  creation maxthr:144 nthr:144 min_time:4.501 us max_time:4.970 us
avg_time:4.683 us stddev:23.683 us
  lego-c2-qs-56:openmp-parallel-gomp-slow [04:38:31] $ 
#+end_example


I also happen to have found a paper online that includes some performance
comparisons on omp directives.  It also identifies the "hypercube-embedded
tree"
locking strategy as the reason for the different scaling properties around
locking.

The paper is quite old, but I see something similar (though not as drastic) in
my experiments.  Figure (2) in the paper give a graph of how the two barrier
approaches scale with multiple threads (though this is done for `#pragma omp
barrier` -- Figure (5) compares the overhead of `#pragma omp parallel` to
`#pragma omp barrier`).

Figure (3) gives a diagram of what "hypercube-embedded tree" actually means.
https://eprints.whiterose.ac.uk/180101/6/HPCC2021.pdf

[Bug libgomp/119588] New: Possible improvement in locking strategies for libgomp

Reply via email to