Re: [gomp4] OpenACC async re-work

Cesar Philippidis Wed, 28 Jun 2017 15:32:18 -0700

On 06/27/2017 03:56 AM, Chung-Lin Tang wrote:
> On 2017/6/27 6:45 AM, Cesar Philippidis wrote:
>>> (1) Instead of essentially implementing the entire OpenACC async support
>>> inside the plugin, we now use an opaque 'goacc_asyncqueue' implemented
>>> by the plugin, along with core 'test', 'synchronize', 'serialize', etc.
>>> plugin functions. Most of the OpenACC specific logic is pulled into
>>> libgomp/oacc-async.c
>> I'm not sure if plugins need to maintain backwards compatibility.
>> However, I don't see any changes inside libgomp.map, so maybe it's not
>> required.
> 
> This patch is pretty large, but only inner workings (including libgomp vs. 
> plugin interface) were modified.
> No user API compatibility was altered.
> 
>>> (3) For 'wait + async', we now add a local thread synchronize, instead
>>> of just ordering the streams.
>>>
>>> (4) To work with the (3) change, some front end changes were added to
>>> propagate argument-less wait clauses as 'wait(GOACC_ASYNC_NOVAL)' to
>>> represent a 'wait all'.
>> What's the significance of GOMP_ASYNC_NOVAL? Wouldn't it have been
>> easier to make that change in the gimplifier?
> 
> Actually, we were basically throwing away argument-less wait clauses in 
> front-ends
> before this patch; i.e. '#pragma acc parallel async' and '#pragma acc 
> parallel wait async'
> were internally the same.
>
> The use of GOMP_ASYNC_NOVAL (-1) was just following the current 'async' 
> clause representation
> convention.


So then then wait was implied before? Or maybe that's why 'wait async'
didn't work.

>>> Patch was tested to have no regressions on gomp-4_0-branch. I'll commit
>>> this after the weekend (or Tues.)
>>>     * plugin/plugin-nvptx.c (struct cuda_map): Remove.
>>>         (GOMP_OFFLOAD_openacc_exec): Adjust parameters and code.
>>>         (GOMP_OFFLOAD_openacc_async_exec): New plugin hook function.
>> These two functions seem extremely similar.  I wonder if you should
>> consolidate them.
> 
> It would be nice to have a proper set of pthreads based host fallback hooks
> for the openacc.async substruct later. Ideally, an accelerator plugin can
> just implement GOMP_OFFLOAD_openacc_exec, and the default host pthreads-based
> GOMP_OFFLOAD_openacc_async_exec can be implemented in terms of
> the synchronous GOMP_OFFLOAD_openacc_exec. Combining the two hook routines
> would make this less clean.

After looking at this some more, I like how your patch simplifies
things. This small bit of somewhat duplicated code is much better than
what we had before. So I'm ok with it.

>> Overall, I like how you were able eliminate the externally managed map_*
>> data structure which was used to pass in arguments to nvptx_exec.
>> Although I wonder if we should just pass in those individual arguments
>> directly to cuLaunchKernel. But that's a big change in itself.
> 
> I didn't think of that when working on the current patch, maybe later.

Here's some more comments regarding the code below. One high-level
comment regarding the usage of async-specific locks. Can't you get by
with using the global device lock, instead of a special async queue or
would that cause a deadlock?

> Index: libgomp/oacc-async.c
> ===================================================================
> --- libgomp/oacc-async.c      (revision 249620)
> +++ libgomp/oacc-async.c      (working copy)
> @@ -27,10 +27,85 @@
>     <http://www.gnu.org/licenses/>.  */
>
>  #include <assert.h>
> +#include <string.h>
>  #include "openacc.h"
>  #include "libgomp.h"
>  #include "oacc-int.h"
>
> +static struct goacc_thread *
> +get_goacc_thread (void)
> +{
> +  struct goacc_thread *thr = goacc_thread ();
> +  if (!thr || !thr->dev)
> +    gomp_fatal ("no device active");
> +  return thr;
> +}
> +
> +static struct gomp_device_descr *
> +get_goacc_thread_device (void)
> +{
> +  struct goacc_thread *thr = goacc_thread ();
> +
> +  if (!thr || !thr->dev)
> +    gomp_fatal ("no device active");
> +
> +  return thr->dev;
> +}

These two functions can be made public because a lot of other functioncs
can use them too. I don't know where to stash them though. You can
change that later though.

> +attribute_hidden struct goacc_asyncqueue *
> +lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int
async)
> +{
> +  /* The special value acc_async_noval (-1) maps to the thread-specific
> +     default async stream.  */
> +  if (async == acc_async_noval)
> +    async = thr->default_async;

Is the default async queue device independent? I thought the default
async queue is defined in the acc_async_t enum. Maybe set

 async = acc_async_default

?

> +  if (async == acc_async_sync)
> +    return NULL;
> +
> +  if (async < 0)
> +    gomp_fatal ("bad async %d", async);
> +
> +  struct gomp_device_descr *dev = thr->dev;
> +
> +  if (!create
> +      && (async >= dev->openacc.async.nasyncqueue
> +       || !dev->openacc.async.asyncqueue[async]))
> +    return NULL;
> +
> +  gomp_mutex_lock (&dev->openacc.async.lock);
Is this lock sufficient? What happens if the device is released?

> +  if (async >= dev->openacc.async.nasyncqueue)
> +    {
Not your fault, but I wonder if we would be better off just hard-capping
the number of async queues. What happens if the user does something like
wait (1<<30)? That can be addressed later.

> +      int diff = async + 1 - dev->openacc.async.nasyncqueue;
> +      dev->openacc.async.asyncqueue
> +     = gomp_realloc (dev->openacc.async.asyncqueue,
> +                     sizeof (goacc_aq) * (async + 1));
> +      memset (dev->openacc.async.asyncqueue +
dev->openacc.async.nasyncqueue,
> +           0, sizeof (goacc_aq) * diff);
> +      dev->openacc.async.nasyncqueue = async + 1;
> +    }
> +
> +  if (!dev->openacc.async.asyncqueue[async])
> +    {
> +      dev->openacc.async.asyncqueue[async] =
dev->openacc.async.construct_func ();
> +
> +      /* Link new async queue into active list.  */
> +      goacc_aq_list n = gomp_malloc (sizeof (struct
goacc_asyncqueue_list));
> +      n->aq = dev->openacc.async.asyncqueue[async];
> +      n->next = dev->openacc.async.active;
> +      dev->openacc.async.active = n;
> +    }
> +  gomp_mutex_unlock (&dev->openacc.async.lock);
> +  return dev->openacc.async.asyncqueue[async];
> +}
> +
> +attribute_hidden struct goacc_asyncqueue *
> +get_goacc_asyncqueue (int async)
> +{
> +  struct goacc_thread *thr = get_goacc_thread ();
> +  return lookup_goacc_asyncqueue (thr, true, async);
> +}
> +
>  int
>  acc_async_test (int async)
>  {
> @@ -54,15 +129,14 @@ acc_async_test (int async)
>    if (!thr || !thr->dev)
>      gomp_fatal ("no device active");
>
> -  int res = thr->dev->openacc.async_test_func (async);
> -
>    if (profiling_setup_p)
>      {
>        thr->prof_info = NULL;
>        thr->api_info = NULL;
>      }
> -
> -  return res;
> +
> +  goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
> +  return thr->dev->openacc.async.test_func (aq);
I'm not sure how the profling stuff works. Should the profling state be
state be set after calling thr->dev->openacc.async.test_func?

>  }
>
>  int
> @@ -69,7 +143,6 @@ int
>  acc_async_test_all (void)
>  {
>    struct goacc_thread *thr = goacc_thread ();
> -
>    acc_prof_info prof_info;
>    acc_api_info api_info;
>    bool profiling_setup_p
> @@ -79,8 +152,6 @@ acc_async_test_all (void)
>    if (!thr || !thr->dev)
>      gomp_fatal ("no device active");
>
> -  int res = thr->dev->openacc.async_test_all_func ();
> -
>    if (profiling_setup_p)
>      {
>        thr->prof_info = NULL;
> @@ -87,7 +158,17 @@ acc_async_test_all (void)
>        thr->api_info = NULL;
>      }
>
> -  return res;
> +  int ret = 1;
> +  /*struct goacc_thread *thr = get_goacc_thread ();*/
> +  gomp_mutex_lock (&thr->dev->openacc.async.lock);
> +  for (goacc_aq_list l = thr->dev->openacc.async.active; l; l = l->next)
> +    if (!thr->dev->openacc.async.test_func (l->aq))
> +      {
> +     ret = 0;
> +     break;
> +      }
> +  gomp_mutex_unlock (&thr->dev->openacc.async.lock);
> +  return ret;
Likewise.

>  }
>
>  void
> @@ -113,7 +194,8 @@ acc_wait (int async)
>    if (!thr || !thr->dev)
>      gomp_fatal ("no device active");
>
> -  thr->dev->openacc.async_wait_func (async);
> +  goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
> +  thr->dev->openacc.async.synchronize_func (aq);
>
>    if (profiling_setup_p)
>      {

> Index: libgomp/oacc-cuda.c
> ===================================================================
> --- libgomp/oacc-cuda.c       (revision 249620)
> +++ libgomp/oacc-cuda.c       (working copy)
> @@ -99,17 +99,12 @@ acc_get_cuda_stream (int async)
>        prof_info.async_queue = prof_info.async;
>      }
>
> -  void *ret = NULL;
>    if (thr && thr->dev && thr->dev->openacc.cuda.get_stream_func)
> -    ret = thr->dev->openacc.cuda.get_stream_func (async);
> -
> -  if (profiling_setup_p)
>      {
> -      thr->prof_info = NULL;
> -      thr->api_info = NULL;
> +      goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);
> +      return aq ? thr->dev->openacc.cuda.get_stream_func (aq) : NULL;
Again, strange ordering fo profiling_setup_p.

By the way, why not use get_goacc_thread here and other places in this
function? Again, that's a problem for another day.

>      }
> -
> -  return ret;
> +  return NULL;
>  }
>
>  int
> @@ -138,7 +133,12 @@ acc_set_cuda_stream (int async, void *stream)
>
>    int ret = -1;
>    if (thr && thr->dev && thr->dev->openacc.cuda.set_stream_func)
> -    ret = thr->dev->openacc.cuda.set_stream_func (async, stream);
> +    {
> +      goacc_aq aq = get_goacc_asyncqueue (async);
> +      gomp_mutex_lock (&thr->dev->openacc.async.lock);
> +      ret = thr->dev->openacc.cuda.set_stream_func (aq, stream);
> +      gomp_mutex_unlock (&thr->dev->openacc.async.lock);
> +    }
>
>    if (profiling_setup_p)
>      {
> Index: libgomp/oacc-int.h
> ===================================================================
> --- libgomp/oacc-int.h        (revision 249620)
> +++ libgomp/oacc-int.h        (working copy)
> @@ -109,6 +109,15 @@ void goacc_restore_bind (void);
>  void goacc_lazy_initialize (void);
>  void goacc_host_init (void);
>
> +void goacc_init_asyncqueues (struct gomp_device_descr *);
> +bool goacc_fini_asyncqueues (struct gomp_device_descr *);
> +void goacc_async_copyout_unmap_vars (struct target_mem_desc *,
> +                                  struct goacc_asyncqueue *);
> +void goacc_async_free (struct gomp_device_descr *,
> +                    struct goacc_asyncqueue *, void *);
> +struct goacc_asyncqueue *get_goacc_asyncqueue (int);
> +struct goacc_asyncqueue *lookup_goacc_asyncqueue (struct goacc_thread
*, bool, int);
> +
>  void goacc_profiling_initialize (void);
>  bool goacc_profiling_setup_p (struct goacc_thread *,
>                             acc_prof_info *, acc_api_info *);
> Index: libgomp/oacc-mem.c
> ===================================================================
> --- libgomp/oacc-mem.c        (revision 249620)
> +++ libgomp/oacc-mem.c        (working copy)
> @@ -224,19 +224,12 @@ memcpy_tofrom_device (bool from, void *d, void *h,
>        goto out;
>      }
>
> -  if (async > acc_async_sync)
> -    thr->dev->openacc.async_set_async_func (async);
> +  goacc_aq aq = get_goacc_asyncqueue (async);
> +  if (from)
> +    gomp_copy_dev2host (thr->dev, aq, h, d, s);
> +  else
> +    gomp_copy_host2dev (thr->dev, aq, d, h, s);
>
> -  bool ret = (from
> -           ? thr->dev->dev2host_func (thr->dev->target_id, h, d, s)
> -           : thr->dev->host2dev_func (thr->dev->target_id, d, h, s));
> -
> -  if (async > acc_async_sync)
> -    thr->dev->openacc.async_set_async_func (acc_async_sync);
> -
> -  if (!ret)
> -    gomp_fatal ("error in %s", libfnname);
> -
>   out:
>    if (profiling_setup_p)
>      {
> @@ -381,7 +374,7 @@ acc_is_present (void *h, size_t s)
>
>    gomp_mutex_unlock (&acc_dev->lock);
>
> -  return n != NULL;
> +  return (n ? 1 : 0);
>  }
>
>  /* Create a mapping for host [H,+S] -> device [D,+S] */
> @@ -613,17 +606,13 @@ present_create_copy (unsigned f, void *h, size_t s
>
>        gomp_mutex_unlock (&acc_dev->lock);
>
> -      if (async > acc_async_sync)
> -     acc_dev->openacc.async_set_async_func (async);
> +      goacc_aq aq = get_goacc_asyncqueue (async);

Do you want to call async_set_async_func outside of the protection of
acc_dev->lock?

> -      tgt = gomp_map_vars (acc_dev, mapnum, &hostaddrs, NULL, &s,
&kinds, true,
> -                        GOMP_MAP_VARS_OPENACC);
> +      tgt = gomp_map_vars_async (acc_dev, aq, mapnum, &hostaddrs,
NULL, &s,
> +                              &kinds, true, GOMP_MAP_VARS_OPENACC);
>        /* Initialize dynamic refcount.  */
>        tgt->list[0].key->dynamic_refcount = 1;
>
> -      if (async > acc_async_sync)
> -     acc_dev->openacc.async_set_async_func (acc_async_sync);
> -
>        gomp_mutex_lock (&acc_dev->lock);
>
>        d = tgt->to_free;
> @@ -798,11 +787,8 @@ delete_copyout (unsigned f, void *h, size_t s, int
>
>        if (f & FLAG_COPYOUT)
>       {
> -       if (async > acc_async_sync)
> -         acc_dev->openacc.async_set_async_func (async);
> -       acc_dev->dev2host_func (acc_dev->target_id, h, d, s);
> -       if (async > acc_async_sync)
> -         acc_dev->openacc.async_set_async_func (acc_async_sync);
> +       goacc_aq aq = get_goacc_asyncqueue (async);
> +       gomp_copy_dev2host (acc_dev, aq, h, d, s);
>       }
>        gomp_remove_var (acc_dev, n);
>      }
> @@ -904,19 +890,15 @@ update_dev_host (int is_dev, void *h, size_t s, in
>    d = (void *) (n->tgt->tgt_start + n->tgt_offset
>               + (uintptr_t) h - n->host_start);
>
> -  if (async > acc_async_sync)
> -    acc_dev->openacc.async_set_async_func (async);
> +  goacc_aq aq = get_goacc_asyncqueue (async);
>
>    if (is_dev)
> -    acc_dev->host2dev_func (acc_dev->target_id, d, h, s);
> +    gomp_copy_host2dev (acc_dev, aq, d, h, s);
>    else
> -    acc_dev->dev2host_func (acc_dev->target_id, h, d, s);
> +    gomp_copy_dev2host (acc_dev, aq, h, d, s);
>
> -  if (async > acc_async_sync)
> -    acc_dev->openacc.async_set_async_func (acc_async_sync);
> -
Why did you remove this, but not add a clal to set_goacc_asyncqueue?
Maybe it's redundant.

>    gomp_mutex_unlock (&acc_dev->lock);
> -
> +
>    if (profiling_setup_p)
>      {
>        thr->prof_info = NULL;
> @@ -978,7 +960,7 @@ gomp_acc_declare_allocate (bool allocate, size_t m
>
>  void
>  gomp_acc_insert_pointer (size_t mapnum, void **hostaddrs, size_t *sizes,
> -                      void *kinds)
> +                      void *kinds, int async)
>  {
>    struct target_mem_desc *tgt;
>    struct goacc_thread *thr = goacc_thread ();
> @@ -1008,8 +990,9 @@ gomp_acc_insert_pointer (size_t mapnum, void **hos
>      }
>
>    gomp_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
> -  tgt = gomp_map_vars (acc_dev, mapnum, hostaddrs,
> -                    NULL, sizes, kinds, true, GOMP_MAP_VARS_OPENACC);
> +  goacc_aq aq = get_goacc_asyncqueue (async);
> +  tgt = gomp_map_vars_async (acc_dev, aq, mapnum, hostaddrs,
> +                          NULL, sizes, kinds, true, GOMP_MAP_VARS_OPENACC);
>    gomp_debug (0, "  %s: mappings prepared\n", __FUNCTION__);
>
>    /* Initialize dynamic refcount.  */
> @@ -1098,11 +1081,15 @@ gomp_acc_remove_pointer (void *h, size_t s, bool f
>           t->list[i].copy_from = force_copyfrom ? 1 : 0;
>           break;
>         }
> -      if (async > acc_async_sync)
> -     acc_dev->openacc.async_set_async_func (async);
> -      gomp_unmap_vars (t, true);
> -      if (async > acc_async_sync)
> -     acc_dev->openacc.async_set_async_func (acc_async_sync);
> +
> +      /* If running synchronously, unmap immediately.  */
> +      if (async < acc_async_noval)
> +     gomp_unmap_vars (t, true);
> +      else
> +     {
> +       goacc_aq aq = get_goacc_asyncqueue (async);
> +       goacc_async_copyout_unmap_vars (t, aq);
> +     }
>      }
>
>    gomp_mutex_unlock (&acc_dev->lock);
> Index: libgomp/oacc-parallel.c
> ===================================================================
> --- libgomp/oacc-parallel.c   (revision 249620)
> +++ libgomp/oacc-parallel.c   (working copy)
> @@ -215,7 +215,9 @@ GOACC_parallel_keyed (int device, void (*fn) (void
>        fn (hostaddrs);
>        goto out;
>      }
> -
> +  else if (profiling_dispatch_p)
> +    api_info.device_api = acc_device_api_cuda;
> +
That seems target specific. Does that belong in the generic code path?

>    /* Default: let the runtime choose.  */
>    for (i = 0; i != GOMP_DIM_MAX; i++)
>      dims[i] = 0;
> @@ -260,10 +262,14 @@ GOACC_parallel_keyed (int device, void (*fn) (void
>
>       case GOMP_LAUNCH_WAIT:
>         {
> -         unsigned num_waits = GOMP_LAUNCH_OP (tag);
> +         /* Be careful to cast the op field as a signed 16-bit, and
> +            sign-extend to full integer.  */
> +         int num_waits = ((signed short) GOMP_LAUNCH_OP (tag));
>
> -         if (num_waits)
> +         if (num_waits > 0)
>             goacc_wait (async, num_waits, &ap);
> +         else if (num_waits == acc_async_noval)
> +           acc_wait_all_async (async);
>           break;
>         }
>

Cesar

Re: [gomp4] OpenACC async re-work

Reply via email to