Ping x6 On 2022/12/6 12:21 AM, Chung-Lin Tang wrote: > Ping x5 > > On 2022/11/22 12:24 上午, Chung-Lin Tang wrote: >> Ping x4 >> >> On 2022/11/8 12:34 AM, Chung-Lin Tang wrote: >>> Ping x3. >>> >>> On 2022/10/31 10:18 PM, Chung-Lin Tang wrote: >>>> Ping x2. >>>> >>>> On 2022/10/17 10:29 PM, Chung-Lin Tang wrote: >>>>> Ping. >>>>> >>>>> On 2022/9/21 3:45 PM, Chung-Lin Tang via Gcc-patches wrote: >>>>>> Hi Tom, >>>>>> I had a patch submitted earlier, where I reported that the current way >>>>>> of implementing >>>>>> barriers in libgomp on nvptx created a quite significant performance >>>>>> drop on some SPEChpc2021 >>>>>> benchmarks: >>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html >>>>>> That previous patch wasn't accepted well (admittedly, it was kind of a >>>>>> hack). >>>>>> So in this patch, I tried to (mostly) re-implement team-barriers for >>>>>> NVPTX. >>>>>> >>>>>> Basically, instead of trying to have the GPU do CPU-with-OS-like things >>>>>> that it isn't suited for, >>>>>> barriers are implemented simplistically with bar.* synchronization >>>>>> instructions. >>>>>> Tasks are processed after threads have joined, and only if >>>>>> team->task_count != 0 >>>>>> >>>>>> (arguably, there might be a little bit of performance forfeited where >>>>>> earlier arriving threads >>>>>> could've been used to process tasks ahead of other threads. But that >>>>>> again falls into requiring >>>>>> implementing complex futex-wait/wake like behavior. Really, that kind of >>>>>> tasking is not what target >>>>>> offloading is usually used for) >>>>>> >>>>>> Implementation highlight notes: >>>>>> 1. gomp_team_barrier_wake() is now an empty function (threads never >>>>>> "wake" in the usual manner) >>>>>> 2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction. >>>>>> 3. gomp_barrier_wait_last() now is implemented using "bar.arrive" >>>>>> >>>>>> 4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end(): >>>>>> The main synchronization is done using a 'bar.red' instruction. This >>>>>> reduces across all threads >>>>>> the condition (team->task_count != 0), to enable the task processing >>>>>> down below if any thread >>>>>> created a task. (this bar.red usage required the need of the second >>>>>> GCC patch in this series) >>>>>> >>>>>> This patch has been tested on x86_64/powerpc64le with nvptx offloading, >>>>>> using libgomp, ovo, omptests, >>>>>> and sollve_vv testsuites, all without regressions. Also verified that >>>>>> the SPEChpc 2021 521.miniswp_t >>>>>> and 534.hpgmgfv_t performance regressions that occurred in the GCC12 >>>>>> cycle has been restored to >>>>>> devel/omp/gcc-11 (OG11) branch levels. Is this okay for trunk? >>>>>> >>>>>> (also suggest backporting to GCC12 branch, if performance regression can >>>>>> be considered a defect) >>>>>> >>>>>> Thanks, >>>>>> Chung-Lin >>>>>> >>>>>> libgomp/ChangeLog: >>>>>> >>>>>> 2022-09-21 Chung-Lin Tang <clt...@codesourcery.com> >>>>>> >>>>>> * config/nvptx/bar.c (generation_to_barrier): Remove. >>>>>> (futex_wait,futex_wake,do_spin,do_wait): Remove. >>>>>> (GOMP_WAIT_H): Remove. >>>>>> (#include "../linux/bar.c"): Remove. >>>>>> (gomp_barrier_wait_end): New function. >>>>>> (gomp_barrier_wait): Likewise. >>>>>> (gomp_barrier_wait_last): Likewise. >>>>>> (gomp_team_barrier_wait_end): Likewise. >>>>>> (gomp_team_barrier_wait): Likewise. >>>>>> (gomp_team_barrier_wait_final): Likewise. >>>>>> (gomp_team_barrier_wait_cancel_end): Likewise. >>>>>> (gomp_team_barrier_wait_cancel): Likewise. >>>>>> (gomp_team_barrier_cancel): Likewise. >>>>>> * config/nvptx/bar.h (gomp_team_barrier_wake): Remove >>>>>> prototype, add new static inline function. >>> >> >
[Ping x6] Re: [PATCH, nvptx, 1/2] Reimplement libgomp barriers for nvptx
Chung-Lin Tang via Gcc-patches Mon, 12 Dec 2022 03:13:22 -0800
- [Ping x4] Re: [PATCH, nvptx, 1/2] Reimpleme... Chung-Lin Tang via Gcc-patches
- [Ping x5] Re: [PATCH, nvptx, 1/2] Reim... Chung-Lin Tang via Gcc-patches
- [Ping x6] Re: [PATCH, nvptx, 1/2] ... Chung-Lin Tang via Gcc-patches