Re: [gomp4.1] depend nowait support for target {update,{enter,exit} data}

Jakub Jelinek Thu, 15 Oct 2015 07:03:29 -0700

Hi!

CCing various people, because I'd like to have something that won't work on
XeonPhi only.

On Fri, Oct 02, 2015 at 10:28:01PM +0300, Ilya Verbin wrote:
> On Tue, Sep 08, 2015 at 11:20:14 +0200, Jakub Jelinek wrote:
> > nowait support for #pragma omp target is not implemented yet, supposedly we
> > need to mark those somehow (some flag) already in the struct gomp_task
> > structure, essentially it will need either 2 or 3 callbacks
> > (the current one, executed when the dependencies are resolved (it actually
> > waits until some thread schedules it after that point, I think it is
> > undesirable to run it with the tasking lock held), which would perform
> > the gomp_map_vars and initiate the running of the region, and then some
> > query routine which would poll the plugin whether the task is done or not,
> > and either perform the finalization (unmap_vars) if it is done (and in any
> > case return bool whether it should be polled again or not), and if the
> > finalization is not done there, also another callback for the finalization.
> > Also, there is the issue that if we are waiting for task that needs to be
> > polled, and we don't have any further tasks to run, we shouldn't really
> > attempt to sleep on some semaphore (e.g. in taskwait, end of
> > taskgroup, etc.) or barrier, but rather either need to keep polling it, or
> > call the query hook with some argument that it should sleep in there until
> > the work is done by the offloading device.
> > Also, there needs to be a way for the target nowait first callback to say
> > that it is using host fallback and thus acts as a normal task, therefore
> > once the task fn finishes, the task is done.
> 
> Here is my WIP patch.  target.c part is obviously incorrect, but it 
> demonstrates
> a possible libgomp <-> plugin interface for running a target task function
> asynchronously and checking whether it is completed or not.
> (Refactored liboffloadmic/runtime/emulator from trunk is required to run
> target-tmp.c testcase.)

The difficulty is designing something that will work (if possible fast) on the
various devices we want to eventually support (at least XeonPhi, XeonPhi emul,
PTX/Cuda and HSA), ideally without too much busy waiting.

The OpenMP 4.5 spec says that there is a special "target task" on the host
side around the target region, and that the "target task" is mergeable and
if nowait is not specified is included (otherwise it may be), and that the
mapping operations (which include target device memory allocation,
refcount management and mapping data structure updates as well as the
memory copying to target device) happens only after the (optional) dependencies
are satisfied.  After the memory mapping operations are done, the offloading
kernel starts, and when it finishes, the unmapping operations are performed
(which includes memory copying from the target device, refcount management
and mapping data structure updates, and finally memory deallocation).

Right now on the OpenMP side everything is synchronous, e.g. target
enter/exit data and update are asynchronous only in that the mapping or
unmapping operation is scheduled as a task, but the whole mapping or
unmapping operations including all the above mentioned subparts are
performed while holding the particular device's lock.

To make that more asynchronous, e.g. for Cuda we might want to use Cuda
(non-default) streams, and perform the allocation, refcount management and
mapping data structure updates, and perform the data copying to device
already as part of the stream.  Except that it means that if another target
mapping/unmapping operation is enqueued at that point and it refers to any
of the affected objects, it could acquire the device lock, yet the data
copying would be still in flux.  Dunno here if it would be e.g. acceptable
to add some flags to the mapping data structures, this memory range has
either pending data transfers or has enqueued data transfers that depend on
whether the refcount will become zero or not.  When mapping if we'd want to
touch any of the regions marked with such in_flux flag, we'd need to wait
until all of the other stream's operation finish and the unmapping
operations are performed (and the device lock released again) before
continuing.  That way we could get good performance if either concurrent
async regions touch different variables, or target data or non-async target
enter data or exit data has been put around the mappings, so the streams can
be independent, but worst case we'd make them non-concurrent.

Anyway, let's put the asynchronous memory data transfers (which also implies
the ability to enqueue multiple different target regions into a stream for
the device to operate on independently from the host) on the side for now
and just discuss what we want for the actual async execution and for now
keep a device lock around all the mapping or unmapping operations.

If the "target task" has unresolved dependencies, then it will use existing
task.c waiting code first (if the above is resolved somehow, there could be
exceptions of "target task" depending on another "target task").
When the dependencies are resolved, we can run the gomp_target_task_fn
callback (but not with the team's tasking lock held), which can perform
the gomp_map_vars call and start the async execution.  For host fallback,
that is all we do, the task is at this point a normal task.
For offloading task, we now want the host to continue scheduling other tasks
if there are any, which means (not currently implemented on the task.c side)
we want to move the task somewhere that we don't consider it finished, and
that we'll need to schedule it again at some point to perform the unmapping
(perhaps requeue it again in a WAITING or some other state).

Right now, the tasking code would in that case try to schedule another task,
and if there are none or none that are runnable among the tasks of interest,
it can go to sleep and expect to be awaken when some task it is waiting for
is awaken.

And the main question is how to find out on the various devices whether
the async execution has completed already.

>From what I can see in the liboffloadmic plugin, you have an extra host
thread that can run a callback function on the host.  Such a callback could
say tweak the state of the "target task", could take the team's tasking
lock, and even awake sleepers, maybe even take the device lock and perform
unmapping of vars?  The function would need to be in task.c
so that it can access everything defined in there.  Or the callback could
just change something in the "target task" state and let the tasking poll
for the change.

Looking at Cuda, for async target region kernels we'd probably use
a non-default stream and enqueue the async kernel in there.  I see
we can e.g. cudaEventRecord into the stream and then either cudaEventQuery
to busy poll the event, or cudaEventSynchronize to block until the event
occurs, plus there is cudaStreamWaitEvent that perhaps might be even used to
resolve the above mentioned mapping/unmapping async issues for Cuda
- like add an event after the mapping operations that the other target tasks
could wait for if they see any in_flux stuff, and wait for an event etc.
I don't see a possibility to have something like a callback on stream
completion though, so it has to be handled with polling.  If that is true,
it means the tasking code can't go to sleep if there are any pending target
tasks (at least for devices that can't do a callback) it wants to wait for,
it would need to call in a loop the poll methods of the plugins that it
wants to wait for (unless there are no host tasks left and only a single
device is involved, then it could call a blocking method).

For HSA I have no idea.

Now, for the polling case, the question is how the polling is expensive,
whether it can be performed with the team's lock held or not.  If XeonPhi
doesn't do the full host callback, but polling, it could just read some
memory from target_task struct and thus be fast enough to run it with the
lock held.  How expensive is cudaEventQuery?

> diff --git a/libgomp/target.c b/libgomp/target.c
> index 77bd442..31f034c 100644
> --- a/libgomp/target.c
> +++ b/libgomp/target.c
> @@ -45,6 +45,10 @@
>  #include "plugin-suffix.h"
>  #endif
>  
> +/* FIXME: TMP */
> +#include <stdio.h>
> +#include <unistd.h>

I hope you mean to remove this later on.

> @@ -1227,6 +1231,44 @@ gomp_target_fallback (void (*fn) (void *), void 
> **hostaddrs)
>    *thr = old_thr;
>  }
>  
> +/* Host fallback with firstprivate map-type handling.  */
> +
> +static void
> +gomp_target_fallback_firstprivate (void (*fn) (void *), size_t mapnum,
> +                                void **hostaddrs, size_t *sizes,
> +                                unsigned short *kinds)
> +{
> +  size_t i, tgt_align = 0, tgt_size = 0;
> +  char *tgt = NULL;
> +  for (i = 0; i < mapnum; i++)
> +    if ((kinds[i] & 0xff) == GOMP_MAP_FIRSTPRIVATE)
> +      {
> +     size_t align = (size_t) 1 << (kinds[i] >> 8);
> +     if (tgt_align < align)
> +       tgt_align = align;
> +     tgt_size = (tgt_size + align - 1) & ~(align - 1);
> +     tgt_size += sizes[i];
> +      }
> +  if (tgt_align)
> +    {
> +      tgt = gomp_alloca (tgt_size + tgt_align - 1);
> +      uintptr_t al = (uintptr_t) tgt & (tgt_align - 1);
> +      if (al)
> +     tgt += tgt_align - al;
> +      tgt_size = 0;
> +      for (i = 0; i < mapnum; i++)
> +     if ((kinds[i] & 0xff) == GOMP_MAP_FIRSTPRIVATE)
> +       {
> +         size_t align = (size_t) 1 << (kinds[i] >> 8);
> +         tgt_size = (tgt_size + align - 1) & ~(align - 1);
> +         memcpy (tgt + tgt_size, hostaddrs[i], sizes[i]);
> +         hostaddrs[i] = tgt + tgt_size;
> +         tgt_size = tgt_size + sizes[i];
> +       }
> +    }
> +  gomp_target_fallback (fn, hostaddrs);
> +}

This is ok.

> +
>  /* Helper function of GOMP_target{,_41} routines.  */
>  
>  static void *
> @@ -1311,40 +1353,19 @@ GOMP_target_41 (int device, void (*fn) (void *), 
> size_t mapnum,
>    if (devicep == NULL
>        || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
>      {
> -      size_t i, tgt_align = 0, tgt_size = 0;
> -      char *tgt = NULL;
> -      for (i = 0; i < mapnum; i++)
> -     if ((kinds[i] & 0xff) == GOMP_MAP_FIRSTPRIVATE)
> -       {
> -         size_t align = (size_t) 1 << (kinds[i] >> 8);
> -         if (tgt_align < align)
> -           tgt_align = align;
> -         tgt_size = (tgt_size + align - 1) & ~(align - 1);
> -         tgt_size += sizes[i];
> -       }
> -      if (tgt_align)
> -     {
> -       tgt = gomp_alloca (tgt_size + tgt_align - 1);
> -       uintptr_t al = (uintptr_t) tgt & (tgt_align - 1);
> -       if (al)
> -         tgt += tgt_align - al;
> -       tgt_size = 0;
> -       for (i = 0; i < mapnum; i++)
> -         if ((kinds[i] & 0xff) == GOMP_MAP_FIRSTPRIVATE)
> -           {
> -             size_t align = (size_t) 1 << (kinds[i] >> 8);
> -             tgt_size = (tgt_size + align - 1) & ~(align - 1);
> -             memcpy (tgt + tgt_size, hostaddrs[i], sizes[i]);
> -             hostaddrs[i] = tgt + tgt_size;
> -             tgt_size = tgt_size + sizes[i];
> -           }
> -     }
> -      gomp_target_fallback (fn, hostaddrs);
> +      gomp_target_fallback_firstprivate (fn, mapnum, hostaddrs, sizes, 
> kinds);
>        return;
>      }

This too.

>    void *fn_addr = gomp_get_target_fn_addr (devicep, fn);
>  
> +  if (flags & GOMP_TARGET_FLAG_NOWAIT)
> +    {
> +      gomp_create_target_task (devicep, fn_addr, mapnum, hostaddrs, sizes,
> +                            kinds, flags, depend);
> +      return;
> +    }

But this is not ok.  You need to do this far earlier, already before the
if (depend != NULL) code in GOMP_target_41.  And, I think you should just
not pass fn_addr, but fn itself.

> @@ -1636,34 +1657,58 @@ void
>  gomp_target_task_fn (void *data)
>  {
>    struct gomp_target_task *ttask = (struct gomp_target_task *) data;
> +  struct gomp_device_descr *devicep = ttask->devicep;
> +
>    if (ttask->fn != NULL)
>      {
> -      /* GOMP_target_41 */
> +      if (devicep == NULL
> +       || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
> +     {
> +       /* FIXME: Save host fn addr into gomp_target_task?  */
> +       gomp_target_fallback_firstprivate (NULL, ttask->mapnum,

If you pass above fn instead of fn_addr, ttask->fn is what you want
to pass to gomp_target_fallback_firstprivate here and remove the FIXME.

> +                                          ttask->hostaddrs, ttask->sizes,
> +                                          ttask->kinds);
> +       return;
> +     }
> +
> +      struct target_mem_desc *tgt_vars
> +     = gomp_map_vars (devicep, ttask->mapnum, ttask->hostaddrs, NULL,
> +                      ttask->sizes, ttask->kinds, true,
> +                      GOMP_MAP_VARS_TARGET);
> +      devicep->async_run_func (devicep->target_id, ttask->fn,
> +                            (void *) tgt_vars->tgt_start, data);

You need to void *fn_addr = gomp_get_target_fn_addr (devicep, ttask->fn);
first obviously, and pass fn_addr.

> +
> +      /* FIXME: TMP example of checking for completion.
> +      Alternatively the plugin can set some completion flag in ttask.  */
> +      while (!devicep->async_is_completed_func (devicep->target_id, data))
> +     {
> +       fprintf (stderr, "-");
> +       usleep (100000);
> +     }

This obviously doesn't belong here.

>    if (device->capabilities & GOMP_OFFLOAD_CAP_OPENACC_200)
> diff --git a/libgomp/testsuite/libgomp.c/target-tmp.c 
> b/libgomp/testsuite/libgomp.c/target-tmp.c
> new file mode 100644
> index 0000000..23a739c
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.c/target-tmp.c
> @@ -0,0 +1,40 @@
> +#include <stdio.h>
> +#include <unistd.h>
> +
> +#pragma omp declare target
> +void foo (int n)
> +{
> +  printf ("Start tgt %d\n", n);
> +  usleep (5000000);

5s is too long.  Not to mention that not sure if PTX can do printf
and especially usleep.

> diff --git a/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp 
> b/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp
> index 26ac6fe..c843710 100644
> --- a/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp
> +++ b/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp
...
> +/* Set of asynchronously running target tasks.  */
> +static std::set<const void *> *async_tasks;
> +
>  /* Thread-safe registration of the main image.  */
>  static pthread_once_t main_image_is_registered = PTHREAD_ONCE_INIT;
>  
> +/* Mutex for protecting async_tasks.  */
> +static pthread_mutex_t async_tasks_lock = PTHREAD_MUTEX_INITIALIZER;
> +
>  static VarDesc vd_host2tgt = {
>    { 1, 1 },                /* dst, src                             */
>    { 1, 0 },                /* in, out                              */
> @@ -156,6 +163,8 @@ init (void)
>  
>  out:
>    address_table = new ImgDevAddrMap;
> +  async_tasks = new std::set<const void *>;
> +  pthread_mutex_init (&async_tasks_lock, NULL);

PTHREAD_MUTEX_INITIALIZER should already initialize the lock.
But, do you really need async_tasks and the lock?  Better store
something into some plugin's owned field in target_task struct and
let the plugin callback be passed address of that field rather than the
whole target_task?

> diff --git a/liboffloadmic/runtime/offload_host.cpp 
> b/liboffloadmic/runtime/offload_host.cpp
> index 08f626f..8cee12c 100644
> --- a/liboffloadmic/runtime/offload_host.cpp
> +++ b/liboffloadmic/runtime/offload_host.cpp
> @@ -64,6 +64,9 @@ static void __offload_fini_library(void);
>  #define GET_OFFLOAD_NUMBER(timer_data) \
>      timer_data? timer_data->offload_number : 0
>  
> +extern "C" void
> +__gomp_offload_intelmic_async_completed (const void *);
> +
>  extern "C" {
>  #ifdef TARGET_WINNT
>  // Windows does not support imports from libraries without actually
> @@ -2507,7 +2510,7 @@ extern "C" {
>          const void *info
>      )
>      {
> -     /* TODO: Call callback function, pass info.  */
> +     __gomp_offload_intelmic_async_completed (info);
>      }
>  }

Is this for the emul only, or KNL only, or both?
In any case, not sure how it works, this is in liboffloadmic.so and
the function defined in the plugin?

        Jakub

Re: [gomp4.1] depend nowait support for target {update,{enter,exit} data}

Reply via email to