Hi Andrew, hi Thomas, hi all, @Thomas: I wouldn't mind if you could glance at the nvptx/CUDA bits.
On 12.09.23 16:27, Andrew Stubbs wrote:
This patch implements parallel execution of OpenMP reverse offload kernels. ... The device threads that sent requests are still blocked waiting for the completion signal, but any other threads may continue as usual.
Which matches the spec. (Except that, starting with TR12, a user may also use the 'nowait' clause on the reverse-offload target directive.)
+++ b/libgomp/config/nvptx/target.c @@ -93,7 +93,6 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum, void **hostaddrs, size_t *sizes, unsigned short *kinds, unsigned int flags, void **depend, void **args) {
...
+ if ((unsigned int) (index + 1) < GOMP_REV_OFFLOAD_VAR->consumed) + abort (); /* Overflow. */
[I assume the ideas is that this gets diagnosed by the host (via the GOMP_PLUGIN_fatal) and that that diagnosis is faster then the propagation of the abort() from the device to the host such that the message is always printed.] Should there be an "Error message is printed via the nvptx plugin on the host" or something along this line?
+++ b/libgomp/libgomp.texi
...
+* GOMP_REVERSE_OFFLOAD_THREADS:: Set the maximum number of host threads @end menu
...
+@node GOMP_REVERSE_OFFLOAD_THREADS +@section @env{GOMP_REVERSE_OFFLOAD_THREADS} -- Set the maximum number of host threads
Thanks but can you also update the gcn/nvptx description? We currently have: https://gcc.gnu.org/onlinedocs/libgomp/index.html --------<cut->---------- @section AMD Radeon (GCN) ... @item Reverse offload regions (i.e. @code{target} regions with @code{device(ancestor:1)}) are processed serially per @code{target} region such that the next reverse offload region is only executed after the previous one returned. ... @section nvptx ... @item Reverse offload regions (i.e. @code{target} regions with @code{device(ancestor:1)}) are processed serially per @code{target} region such that the next reverse offload region is only executed after the previous one returned. --------<cut->---------- Possibly by also adding a @ref{GOMP_REVERSE_OFFLOAD_THREADS} such that a user can find this. (I wonder whether the "UINTMAX (= 4294967296)" should be documented or whether that's an implementation detail we do not need to document. But given that a long-running code taking 4 weeks still can issues 1775 reverse offloads per second before exhausting the count, that should be a nonissue.)
+++ b/libgomp/plugin/plugin-nvptx.c @@ -1639,9 +1639,10 @@ nvptx_goacc_asyncqueue_construct (unsigned int flags) } struct goacc_asyncqueue * -GOMP_OFFLOAD_openacc_async_construct (int device __attribute__((unused))) +GOMP_OFFLOAD_openacc_async_construct (int device) { - return nvptx_goacc_asyncqueue_construct (CU_STREAM_DEFAULT); + nvptx_attach_host_thread_to_device (device); + return nvptx_goacc_asyncqueue_construct (CU_STREAM_NON_BLOCKING); }
That's not really new and we have plenty of code of this kind, but isn't this a race if this is called nearly instantaneously for multiple devices? (Still, the new code is surely better than the previous one.) @Thomas: ? I will have another look after the Cauldron, but I think the patch is otherwise okay. Tobias ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955