Ah, yes -- apologies.
There are two patches I built on top of. I would very much appreciate
target maintainer attention to both of these as well.
I split them out into independent patches and forgot to mention them in
the email (plus they didn't get properly sent due to mail server problems).
After re-sending I can now link:
This one first (to fix https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122314):
https://gcc.gnu.org/pipermail/gcc-patches/2025-November/700117.html
And this one second (to fix
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122356):
https://gcc.gnu.org/pipermail/gcc-patches/2025-November/700031.html
MM
On 11/10/25 10:45, Andrew Stubbs wrote:
External email: Use caution opening links or attachments
I don't seem to be able to apply your patches. Did I miss a prerequisite?
Specifically, the hunks in gomp_team_barrier_wait_end and
gomp_team_barrier_wait_cancel_end have context that does not match
mainline.
Andrew
On 10/11/2025 10:06, [email protected] wrote:
From: Matthew Malcomson <[email protected]>
Cc'ing in maintainers of nvptx, gcn, and rtems ports for target specific
changes (especially with request for runtime testing).
After having updated the target code, looked into various TODO's and ran
more testing I've combined my previous patches (in the below links) into
one patchset.
https://gcc.gnu.org/pipermail/gcc-patches/2025-September/695005.html
https://gcc.gnu.org/pipermail/gcc-patches/2025-October/698257.html
In combination with the below patch this patchset drastically improves
the performance on the micro-benchmark 119588.
https://gcc.gnu.org/pipermail/gcc-patches/2025-November/700031.html
This micro-benchmark represents a significant slowdown in some OMP uses
in NVPL BLAS running GEMM routines on small matrices with a high level
of parallelism. (High level of parallelism due to other routines in the
code benefiting from many threads, and there being no low-overhead way
to change the level of parallelism between routines).
This patchset has two commits:
1) Changes the linux/ barrier implementation from the "centralized"
method currently used to a combination of a "linear" barrier gather
and "centralized" barrier release.
- I see this gives about a 3x improvement on time through a highly
contended barrier on a 144 core machine.
2) Follows the LLVM example and "wait" between parallel regions *inside*
the barrier rather than between two barriers. This halves the
overhead from barriers on many consecutive parallel regions.
In order to use the feature introduced in patch (1) we have to change
the barrier API to pass an ID. For patch (1) alone we also need to
introduce some relatively awkward interfaces for adjusting the size of
the barrier.
Patch (2) removes that need for the new awkward interface (the only
barrier that needs size adjustment is now no longer in the fast path).
Since I hope to have both patches in I have only made changes for other
targets to build on top of patch (2). This in order to avoid writing
the implementation for this awkward interface that I intend to never be
actually used.
N.b. when I did bootstrap & regtest on the posix/ target I saw flaky
tests before and after. Believe the same flaky tests.
Matthew Malcomson (2):
libgomp: Implement "flat" barrier for linux/ target
libgomp: Removing one barrier in non-nested thread loop
libgomp/barrier.c | 4 +-
libgomp/config/gcn/bar.c | 45 +-
libgomp/config/gcn/bar.h | 83 +-
libgomp/config/gcn/team.c | 2 +-
libgomp/config/linux/bar.c | 790 ++++++++++++++++--
libgomp/config/linux/bar.h | 330 +++++++-
libgomp/config/linux/futex_waitv.h | 129 +++
libgomp/config/linux/simple-bar.h | 66 ++
libgomp/config/linux/wait.h | 15 +-
libgomp/config/nvptx/bar.c | 36 +-
libgomp/config/nvptx/bar.h | 79 +-
libgomp/config/nvptx/team.c | 2 +-
libgomp/config/posix/bar.c | 29 +-
libgomp/config/posix/bar.h | 74 +-
libgomp/config/posix/pool.h | 1 +
libgomp/config/posix/simple-bar.h | 10 +-
libgomp/config/rtems/bar.c | 185 +++-
libgomp/config/rtems/bar.h | 82 +-
libgomp/libgomp.h | 13 +-
libgomp/single.c | 4 +-
libgomp/task.c | 43 +-
libgomp/team.c | 255 +++++-
.../testsuite/libgomp.c++/task-reduction-20.C | 136 +++
.../testsuite/libgomp.c++/task-reduction-21.C | 140 ++++
.../libgomp.c/primary-thread-tasking.c | 80 ++
libgomp/work.c | 26 +-
26 files changed, 2392 insertions(+), 267 deletions(-)
create mode 100644 libgomp/config/linux/futex_waitv.h
create mode 100644 libgomp/config/linux/simple-bar.h
create mode 100644 libgomp/testsuite/libgomp.c++/task-reduction-20.C
create mode 100644 libgomp/testsuite/libgomp.c++/task-reduction-21.C
create mode 100644 libgomp/testsuite/libgomp.c/primary-thread-tasking.c