Hi Andrew, hi Thomas, hi all,

@Thomas: I wouldn't mind if you could glance at the nvptx/CUDA bits.

On 12.09.23 16:27, Andrew Stubbs wrote:
This patch implements parallel execution of OpenMP reverse offload
kernels.
...
The device threads that sent requests are still blocked waiting for
the completion signal, but any other threads may continue as usual.
Which matches the spec. (Except that, starting with TR12, a user may
also use the 'nowait' clause on the reverse-offload target directive.)
+++ b/libgomp/config/nvptx/target.c
@@ -93,7 +93,6 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t 
mapnum,
               void **hostaddrs, size_t *sizes, unsigned short *kinds,
               unsigned int flags, void **depend, void **args)
  {
...
+  if ((unsigned int) (index + 1) < GOMP_REV_OFFLOAD_VAR->consumed)
+    abort ();  /* Overflow.  */

[I assume the ideas is that this gets diagnosed by the host (via the
GOMP_PLUGIN_fatal) and that that diagnosis is faster then the
propagation of the abort() from the device to the host such that the
message is always printed.]

Should there be an "Error message is printed via the nvptx plugin on the
host" or something along this line?


+++ b/libgomp/libgomp.texi
...
+* GOMP_REVERSE_OFFLOAD_THREADS:: Set the maximum number of host threads
  @end menu
...
+@node GOMP_REVERSE_OFFLOAD_THREADS
+@section @env{GOMP_REVERSE_OFFLOAD_THREADS} -- Set the maximum number of host 
threads

Thanks but can you also update the gcn/nvptx description? We currently have:
https://gcc.gnu.org/onlinedocs/libgomp/index.html

--------<cut->----------
@section AMD Radeon (GCN)
...
@item Reverse offload regions (i.e. @code{target} regions with
      @code{device(ancestor:1)}) are processed serially per @code{target} region
      such that the next reverse offload region is only executed after the 
previous
      one returned.
...
@section nvptx
...
@item Reverse offload regions (i.e. @code{target} regions with
      @code{device(ancestor:1)}) are processed serially per @code{target} region
      such that the next reverse offload region is only executed after the 
previous
      one returned.
--------<cut->----------

Possibly by also adding a @ref{GOMP_REVERSE_OFFLOAD_THREADS} such that a user
can find this.

(I wonder whether the "UINTMAX (= 4294967296)" should be documented or whether
that's an implementation detail we do not need to document. But given that a
long-running code taking 4 weeks still can issues 1775 reverse offloads per
second before exhausting the count, that should be a nonissue.)



+++ b/libgomp/plugin/plugin-nvptx.c
@@ -1639,9 +1639,10 @@ nvptx_goacc_asyncqueue_construct (unsigned int flags)
  }

  struct goacc_asyncqueue *
-GOMP_OFFLOAD_openacc_async_construct (int device __attribute__((unused)))
+GOMP_OFFLOAD_openacc_async_construct (int device)
  {
-  return nvptx_goacc_asyncqueue_construct (CU_STREAM_DEFAULT);
+  nvptx_attach_host_thread_to_device (device);
+  return nvptx_goacc_asyncqueue_construct (CU_STREAM_NON_BLOCKING);
  }

That's not really new and we have plenty of code of this kind, but isn't
this a race if this is called nearly instantaneously for multiple
devices? (Still, the new code is surely better than the previous one.)

@Thomas: ?

I will have another look after the Cauldron, but I think the patch is
otherwise okay.

Tobias

-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955

Reply via email to