Hi! During GCC/OpenMP/nvptx reverse offload investigations, about how to replace the problematic global 'GOMP_REV_OFFLOAD_VAR', I may have found something re:
On 2022-08-26T11:07:28+0200, Tobias Burnus <tob...@codesourcery.com> wrote: > Better suggestions are welcome for the busy loop in > libgomp/plugin/plugin-nvptx.c regarding the variable placement and checking > its value. > On the host side, the last address is checked - if fn_addr != NULL, > it passes all arguments on to the generic (target.c) gomp_target_rev > to do the actual offloading. > > CUDA does lockup when trying to copy data from the currently running > stream; hence, a new stream is generated to do the memory copying. > Future work for nvptx: > * Adjust 'sleep', possibly [...] > to do shorter sleeps than usleep(1)? ... this busy loop. Current 'libgomp/plugin/plugin-nvptx.c:GOMP_OFFLOAD_run': [...] if (reverse_offload) CUDA_CALL_ASSERT (cuStreamCreate, ©_stream, CU_STREAM_NON_BLOCKING); r = CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1, 32, threads, 1, 0, NULL, NULL, config); if (r != CUDA_SUCCESS) GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r)); if (reverse_offload) while (true) { r = CUDA_CALL_NOCHECK (cuStreamQuery, NULL); if (r == CUDA_SUCCESS) break; if (r == CUDA_ERROR_LAUNCH_FAILED) GOMP_PLUGIN_fatal ("cuStreamQuery error: %s %s\n", cuda_error (r), maybe_abort_msg); else if (r != CUDA_ERROR_NOT_READY) GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r)); if (__atomic_load_n (&ptx_dev->rev_data->fn, __ATOMIC_ACQUIRE) != 0) { struct rev_offload *rev_data = ptx_dev->rev_data; GOMP_PLUGIN_target_rev (rev_data->fn, rev_data->mapnum, rev_data->addrs, rev_data->sizes, rev_data->kinds, rev_data->dev_num, rev_off_dev_to_host_cpy, rev_off_host_to_dev_cpy, copy_stream); CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream); __atomic_store_n (&rev_data->fn, 0, __ATOMIC_RELEASE); } usleep (1); } else r = CUDA_CALL_NOCHECK (cuCtxSynchronize, ); if (reverse_offload) CUDA_CALL_ASSERT (cuStreamDestroy, copy_stream); [...] Instead of this 'while (true)', 'usleep (1)' loop, shouldn't we be able to use "Stream Memory Operations", <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEMOP.html>, that allow to "Wait on a memory location", "until the given condition on the memory is satisfied"? For reference, current 'libgomp/config/nvptx/target.c:GOMP_target_ext': [...] GOMP_REV_OFFLOAD_VAR->mapnum = mapnum; GOMP_REV_OFFLOAD_VAR->addrs = (uint64_t) hostaddrs; GOMP_REV_OFFLOAD_VAR->sizes = (uint64_t) sizes; GOMP_REV_OFFLOAD_VAR->kinds = (uint64_t) kinds; GOMP_REV_OFFLOAD_VAR->dev_num = GOMP_ADDITIONAL_ICVS.device_num; /* Set 'fn' to trigger processing on the host; wait for completion, which is flagged by setting 'fn' back to 0 on the host. */ uint64_t addr_struct_fn = (uint64_t) &GOMP_REV_OFFLOAD_VAR->fn; #if __PTX_SM__ >= 700 asm volatile ("st.global.release.sys.u64 [%0], %1;" : : "r"(addr_struct_fn), "r" (fn) : "memory"); #else __sync_synchronize (); /* membar.sys */ asm volatile ("st.volatile.global.u64 [%0], %1;" : : "r"(addr_struct_fn), "r" (fn) : "memory"); #endif #if __PTX_SM__ >= 700 uint64_t fn2; do { asm volatile ("ld.acquire.sys.global.u64 %0, [%1];" : "=r" (fn2) : "r" (addr_struct_fn) : "memory"); } while (fn2 != 0); #else /* ld.global.u64 %r64,[__gomp_rev_offload_var]; ld.u64 %r36,[%r64]; membar.sys; */ while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_ACQUIRE) != 0) ; /* spin */ #endif [...] Grüße Thomas ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955