Ping x3. On 2022/10/31 10:18 PM, Chung-Lin Tang wrote: > Ping x2. > > On 2022/10/17 10:29 PM, Chung-Lin Tang wrote: >> Ping. >> >> On 2022/9/21 3:45 PM, Chung-Lin Tang via Gcc-patches wrote: >>> Hi Tom, >>> I had a patch submitted earlier, where I reported that the current way of >>> implementing >>> barriers in libgomp on nvptx created a quite significant performance drop >>> on some SPEChpc2021 >>> benchmarks: >>> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html >>> >>> That previous patch wasn't accepted well (admittedly, it was kind of a >>> hack). >>> So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX. >>> >>> Basically, instead of trying to have the GPU do CPU-with-OS-like things >>> that it isn't suited for, >>> barriers are implemented simplistically with bar.* synchronization >>> instructions. >>> Tasks are processed after threads have joined, and only if team->task_count >>> != 0 >>> >>> (arguably, there might be a little bit of performance forfeited where >>> earlier arriving threads >>> could've been used to process tasks ahead of other threads. But that again >>> falls into requiring >>> implementing complex futex-wait/wake like behavior. Really, that kind of >>> tasking is not what target >>> offloading is usually used for) >>> >>> Implementation highlight notes: >>> 1. gomp_team_barrier_wake() is now an empty function (threads never "wake" >>> in the usual manner) >>> 2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction. >>> 3. gomp_barrier_wait_last() now is implemented using "bar.arrive" >>> >>> 4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end(): >>> The main synchronization is done using a 'bar.red' instruction. This >>> reduces across all threads >>> the condition (team->task_count != 0), to enable the task processing >>> down below if any thread >>> created a task. (this bar.red usage required the need of the second GCC >>> patch in this series) >>> >>> This patch has been tested on x86_64/powerpc64le with nvptx offloading, >>> using libgomp, ovo, omptests, >>> and sollve_vv testsuites, all without regressions. Also verified that the >>> SPEChpc 2021 521.miniswp_t >>> and 534.hpgmgfv_t performance regressions that occurred in the GCC12 cycle >>> has been restored to >>> devel/omp/gcc-11 (OG11) branch levels. Is this okay for trunk? >>> >>> (also suggest backporting to GCC12 branch, if performance regression can be >>> considered a defect) >>> >>> Thanks, >>> Chung-Lin >>> >>> libgomp/ChangeLog: >>> >>> 2022-09-21 Chung-Lin Tang <clt...@codesourcery.com> >>> >>> * config/nvptx/bar.c (generation_to_barrier): Remove. >>> (futex_wait,futex_wake,do_spin,do_wait): Remove. >>> (GOMP_WAIT_H): Remove. >>> (#include "../linux/bar.c"): Remove. >>> (gomp_barrier_wait_end): New function. >>> (gomp_barrier_wait): Likewise. >>> (gomp_barrier_wait_last): Likewise. >>> (gomp_team_barrier_wait_end): Likewise. >>> (gomp_team_barrier_wait): Likewise. >>> (gomp_team_barrier_wait_final): Likewise. >>> (gomp_team_barrier_wait_cancel_end): Likewise. >>> (gomp_team_barrier_wait_cancel): Likewise. >>> (gomp_team_barrier_cancel): Likewise. >>> * config/nvptx/bar.h (gomp_team_barrier_wake): Remove >>> prototype, add new static inline function.
[Ping x3] Re: [PATCH, nvptx, 1/2] Reimplement libgomp barriers for nvptx
Chung-Lin Tang via Gcc-patches Mon, 07 Nov 2022 08:34:57 -0800
- [Ping x3] Re: [PATCH, nvptx, 1/2] Reimpleme... Chung-Lin Tang via Gcc-patches
- [Ping x4] Re: [PATCH, nvptx, 1/2] Reim... Chung-Lin Tang via Gcc-patches
- [Ping x5] Re: [PATCH, nvptx, 1/2] ... Chung-Lin Tang via Gcc-patches
- [Ping x6] Re: [PATCH, nvptx, 1... Chung-Lin Tang via Gcc-patches