On 06/27/2017 03:56 AM, Chung-Lin Tang wrote: > On 2017/6/27 6:45 AM, Cesar Philippidis wrote: >>> (1) Instead of essentially implementing the entire OpenACC async support >>> inside the plugin, we now use an opaque 'goacc_asyncqueue' implemented >>> by the plugin, along with core 'test', 'synchronize', 'serialize', etc. >>> plugin functions. Most of the OpenACC specific logic is pulled into >>> libgomp/oacc-async.c >> I'm not sure if plugins need to maintain backwards compatibility. >> However, I don't see any changes inside libgomp.map, so maybe it's not >> required. > > This patch is pretty large, but only inner workings (including libgomp vs. > plugin interface) were modified. > No user API compatibility was altered. > >>> (3) For 'wait + async', we now add a local thread synchronize, instead >>> of just ordering the streams. >>> >>> (4) To work with the (3) change, some front end changes were added to >>> propagate argument-less wait clauses as 'wait(GOACC_ASYNC_NOVAL)' to >>> represent a 'wait all'. >> What's the significance of GOMP_ASYNC_NOVAL? Wouldn't it have been >> easier to make that change in the gimplifier? > > Actually, we were basically throwing away argument-less wait clauses in > front-ends > before this patch; i.e. '#pragma acc parallel async' and '#pragma acc > parallel wait async' > were internally the same. > > The use of GOMP_ASYNC_NOVAL (-1) was just following the current 'async' > clause representation > convention.
So then then wait was implied before? Or maybe that's why 'wait async' didn't work. >>> Patch was tested to have no regressions on gomp-4_0-branch. I'll commit >>> this after the weekend (or Tues.) >>> * plugin/plugin-nvptx.c (struct cuda_map): Remove. >>> (GOMP_OFFLOAD_openacc_exec): Adjust parameters and code. >>> (GOMP_OFFLOAD_openacc_async_exec): New plugin hook function. >> These two functions seem extremely similar. I wonder if you should >> consolidate them. > > It would be nice to have a proper set of pthreads based host fallback hooks > for the openacc.async substruct later. Ideally, an accelerator plugin can > just implement GOMP_OFFLOAD_openacc_exec, and the default host pthreads-based > GOMP_OFFLOAD_openacc_async_exec can be implemented in terms of > the synchronous GOMP_OFFLOAD_openacc_exec. Combining the two hook routines > would make this less clean. After looking at this some more, I like how your patch simplifies things. This small bit of somewhat duplicated code is much better than what we had before. So I'm ok with it. >> Overall, I like how you were able eliminate the externally managed map_* >> data structure which was used to pass in arguments to nvptx_exec. >> Although I wonder if we should just pass in those individual arguments >> directly to cuLaunchKernel. But that's a big change in itself. > > I didn't think of that when working on the current patch, maybe later. Here's some more comments regarding the code below. One high-level comment regarding the usage of async-specific locks. Can't you get by with using the global device lock, instead of a special async queue or would that cause a deadlock? > Index: libgomp/oacc-async.c > =================================================================== > --- libgomp/oacc-async.c (revision 249620) > +++ libgomp/oacc-async.c (working copy) > @@ -27,10 +27,85 @@ > <http://www.gnu.org/licenses/>. */ > > #include <assert.h> > +#include <string.h> > #include "openacc.h" > #include "libgomp.h" > #include "oacc-int.h" > > +static struct goacc_thread * > +get_goacc_thread (void) > +{ > + struct goacc_thread *thr = goacc_thread (); > + if (!thr || !thr->dev) > + gomp_fatal ("no device active"); > + return thr; > +} > + > +static struct gomp_device_descr * > +get_goacc_thread_device (void) > +{ > + struct goacc_thread *thr = goacc_thread (); > + > + if (!thr || !thr->dev) > + gomp_fatal ("no device active"); > + > + return thr->dev; > +} These two functions can be made public because a lot of other functioncs can use them too. I don't know where to stash them though. You can change that later though. > +attribute_hidden struct goacc_asyncqueue * > +lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async) > +{ > + /* The special value acc_async_noval (-1) maps to the thread-specific > + default async stream. */ > + if (async == acc_async_noval) > + async = thr->default_async; Is the default async queue device independent? I thought the default async queue is defined in the acc_async_t enum. Maybe set async = acc_async_default ? > + if (async == acc_async_sync) > + return NULL; > + > + if (async < 0) > + gomp_fatal ("bad async %d", async); > + > + struct gomp_device_descr *dev = thr->dev; > + > + if (!create > + && (async >= dev->openacc.async.nasyncqueue > + || !dev->openacc.async.asyncqueue[async])) > + return NULL; > + > + gomp_mutex_lock (&dev->openacc.async.lock); Is this lock sufficient? What happens if the device is released? > + if (async >= dev->openacc.async.nasyncqueue) > + { Not your fault, but I wonder if we would be better off just hard-capping the number of async queues. What happens if the user does something like wait (1<<30)? That can be addressed later. > + int diff = async + 1 - dev->openacc.async.nasyncqueue; > + dev->openacc.async.asyncqueue > + = gomp_realloc (dev->openacc.async.asyncqueue, > + sizeof (goacc_aq) * (async + 1)); > + memset (dev->openacc.async.asyncqueue + dev->openacc.async.nasyncqueue, > + 0, sizeof (goacc_aq) * diff); > + dev->openacc.async.nasyncqueue = async + 1; > + } > + > + if (!dev->openacc.async.asyncqueue[async]) > + { > + dev->openacc.async.asyncqueue[async] = dev->openacc.async.construct_func (); > + > + /* Link new async queue into active list. */ > + goacc_aq_list n = gomp_malloc (sizeof (struct goacc_asyncqueue_list)); > + n->aq = dev->openacc.async.asyncqueue[async]; > + n->next = dev->openacc.async.active; > + dev->openacc.async.active = n; > + } > + gomp_mutex_unlock (&dev->openacc.async.lock); > + return dev->openacc.async.asyncqueue[async]; > +} > + > +attribute_hidden struct goacc_asyncqueue * > +get_goacc_asyncqueue (int async) > +{ > + struct goacc_thread *thr = get_goacc_thread (); > + return lookup_goacc_asyncqueue (thr, true, async); > +} > + > int > acc_async_test (int async) > { > @@ -54,15 +129,14 @@ acc_async_test (int async) > if (!thr || !thr->dev) > gomp_fatal ("no device active"); > > - int res = thr->dev->openacc.async_test_func (async); > - > if (profiling_setup_p) > { > thr->prof_info = NULL; > thr->api_info = NULL; > } > - > - return res; > + > + goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async); > + return thr->dev->openacc.async.test_func (aq); I'm not sure how the profling stuff works. Should the profling state be state be set after calling thr->dev->openacc.async.test_func? > } > > int > @@ -69,7 +143,6 @@ int > acc_async_test_all (void) > { > struct goacc_thread *thr = goacc_thread (); > - > acc_prof_info prof_info; > acc_api_info api_info; > bool profiling_setup_p > @@ -79,8 +152,6 @@ acc_async_test_all (void) > if (!thr || !thr->dev) > gomp_fatal ("no device active"); > > - int res = thr->dev->openacc.async_test_all_func (); > - > if (profiling_setup_p) > { > thr->prof_info = NULL; > @@ -87,7 +158,17 @@ acc_async_test_all (void) > thr->api_info = NULL; > } > > - return res; > + int ret = 1; > + /*struct goacc_thread *thr = get_goacc_thread ();*/ > + gomp_mutex_lock (&thr->dev->openacc.async.lock); > + for (goacc_aq_list l = thr->dev->openacc.async.active; l; l = l->next) > + if (!thr->dev->openacc.async.test_func (l->aq)) > + { > + ret = 0; > + break; > + } > + gomp_mutex_unlock (&thr->dev->openacc.async.lock); > + return ret; Likewise. > } > > void > @@ -113,7 +194,8 @@ acc_wait (int async) > if (!thr || !thr->dev) > gomp_fatal ("no device active"); > > - thr->dev->openacc.async_wait_func (async); > + goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async); > + thr->dev->openacc.async.synchronize_func (aq); > > if (profiling_setup_p) > { > Index: libgomp/oacc-cuda.c > =================================================================== > --- libgomp/oacc-cuda.c (revision 249620) > +++ libgomp/oacc-cuda.c (working copy) > @@ -99,17 +99,12 @@ acc_get_cuda_stream (int async) > prof_info.async_queue = prof_info.async; > } > > - void *ret = NULL; > if (thr && thr->dev && thr->dev->openacc.cuda.get_stream_func) > - ret = thr->dev->openacc.cuda.get_stream_func (async); > - > - if (profiling_setup_p) > { > - thr->prof_info = NULL; > - thr->api_info = NULL; > + goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async); > + return aq ? thr->dev->openacc.cuda.get_stream_func (aq) : NULL; Again, strange ordering fo profiling_setup_p. By the way, why not use get_goacc_thread here and other places in this function? Again, that's a problem for another day. > } > - > - return ret; > + return NULL; > } > > int > @@ -138,7 +133,12 @@ acc_set_cuda_stream (int async, void *stream) > > int ret = -1; > if (thr && thr->dev && thr->dev->openacc.cuda.set_stream_func) > - ret = thr->dev->openacc.cuda.set_stream_func (async, stream); > + { > + goacc_aq aq = get_goacc_asyncqueue (async); > + gomp_mutex_lock (&thr->dev->openacc.async.lock); > + ret = thr->dev->openacc.cuda.set_stream_func (aq, stream); > + gomp_mutex_unlock (&thr->dev->openacc.async.lock); > + } > > if (profiling_setup_p) > { > Index: libgomp/oacc-int.h > =================================================================== > --- libgomp/oacc-int.h (revision 249620) > +++ libgomp/oacc-int.h (working copy) > @@ -109,6 +109,15 @@ void goacc_restore_bind (void); > void goacc_lazy_initialize (void); > void goacc_host_init (void); > > +void goacc_init_asyncqueues (struct gomp_device_descr *); > +bool goacc_fini_asyncqueues (struct gomp_device_descr *); > +void goacc_async_copyout_unmap_vars (struct target_mem_desc *, > + struct goacc_asyncqueue *); > +void goacc_async_free (struct gomp_device_descr *, > + struct goacc_asyncqueue *, void *); > +struct goacc_asyncqueue *get_goacc_asyncqueue (int); > +struct goacc_asyncqueue *lookup_goacc_asyncqueue (struct goacc_thread *, bool, int); > + > void goacc_profiling_initialize (void); > bool goacc_profiling_setup_p (struct goacc_thread *, > acc_prof_info *, acc_api_info *); > Index: libgomp/oacc-mem.c > =================================================================== > --- libgomp/oacc-mem.c (revision 249620) > +++ libgomp/oacc-mem.c (working copy) > @@ -224,19 +224,12 @@ memcpy_tofrom_device (bool from, void *d, void *h, > goto out; > } > > - if (async > acc_async_sync) > - thr->dev->openacc.async_set_async_func (async); > + goacc_aq aq = get_goacc_asyncqueue (async); > + if (from) > + gomp_copy_dev2host (thr->dev, aq, h, d, s); > + else > + gomp_copy_host2dev (thr->dev, aq, d, h, s); > > - bool ret = (from > - ? thr->dev->dev2host_func (thr->dev->target_id, h, d, s) > - : thr->dev->host2dev_func (thr->dev->target_id, d, h, s)); > - > - if (async > acc_async_sync) > - thr->dev->openacc.async_set_async_func (acc_async_sync); > - > - if (!ret) > - gomp_fatal ("error in %s", libfnname); > - > out: > if (profiling_setup_p) > { > @@ -381,7 +374,7 @@ acc_is_present (void *h, size_t s) > > gomp_mutex_unlock (&acc_dev->lock); > > - return n != NULL; > + return (n ? 1 : 0); > } > > /* Create a mapping for host [H,+S] -> device [D,+S] */ > @@ -613,17 +606,13 @@ present_create_copy (unsigned f, void *h, size_t s > > gomp_mutex_unlock (&acc_dev->lock); > > - if (async > acc_async_sync) > - acc_dev->openacc.async_set_async_func (async); > + goacc_aq aq = get_goacc_asyncqueue (async); Do you want to call async_set_async_func outside of the protection of acc_dev->lock? > - tgt = gomp_map_vars (acc_dev, mapnum, &hostaddrs, NULL, &s, &kinds, true, > - GOMP_MAP_VARS_OPENACC); > + tgt = gomp_map_vars_async (acc_dev, aq, mapnum, &hostaddrs, NULL, &s, > + &kinds, true, GOMP_MAP_VARS_OPENACC); > /* Initialize dynamic refcount. */ > tgt->list[0].key->dynamic_refcount = 1; > > - if (async > acc_async_sync) > - acc_dev->openacc.async_set_async_func (acc_async_sync); > - > gomp_mutex_lock (&acc_dev->lock); > > d = tgt->to_free; > @@ -798,11 +787,8 @@ delete_copyout (unsigned f, void *h, size_t s, int > > if (f & FLAG_COPYOUT) > { > - if (async > acc_async_sync) > - acc_dev->openacc.async_set_async_func (async); > - acc_dev->dev2host_func (acc_dev->target_id, h, d, s); > - if (async > acc_async_sync) > - acc_dev->openacc.async_set_async_func (acc_async_sync); > + goacc_aq aq = get_goacc_asyncqueue (async); > + gomp_copy_dev2host (acc_dev, aq, h, d, s); > } > gomp_remove_var (acc_dev, n); > } > @@ -904,19 +890,15 @@ update_dev_host (int is_dev, void *h, size_t s, in > d = (void *) (n->tgt->tgt_start + n->tgt_offset > + (uintptr_t) h - n->host_start); > > - if (async > acc_async_sync) > - acc_dev->openacc.async_set_async_func (async); > + goacc_aq aq = get_goacc_asyncqueue (async); > > if (is_dev) > - acc_dev->host2dev_func (acc_dev->target_id, d, h, s); > + gomp_copy_host2dev (acc_dev, aq, d, h, s); > else > - acc_dev->dev2host_func (acc_dev->target_id, h, d, s); > + gomp_copy_dev2host (acc_dev, aq, h, d, s); > > - if (async > acc_async_sync) > - acc_dev->openacc.async_set_async_func (acc_async_sync); > - Why did you remove this, but not add a clal to set_goacc_asyncqueue? Maybe it's redundant. > gomp_mutex_unlock (&acc_dev->lock); > - > + > if (profiling_setup_p) > { > thr->prof_info = NULL; > @@ -978,7 +960,7 @@ gomp_acc_declare_allocate (bool allocate, size_t m > > void > gomp_acc_insert_pointer (size_t mapnum, void **hostaddrs, size_t *sizes, > - void *kinds) > + void *kinds, int async) > { > struct target_mem_desc *tgt; > struct goacc_thread *thr = goacc_thread (); > @@ -1008,8 +990,9 @@ gomp_acc_insert_pointer (size_t mapnum, void **hos > } > > gomp_debug (0, " %s: prepare mappings\n", __FUNCTION__); > - tgt = gomp_map_vars (acc_dev, mapnum, hostaddrs, > - NULL, sizes, kinds, true, GOMP_MAP_VARS_OPENACC); > + goacc_aq aq = get_goacc_asyncqueue (async); > + tgt = gomp_map_vars_async (acc_dev, aq, mapnum, hostaddrs, > + NULL, sizes, kinds, true, GOMP_MAP_VARS_OPENACC); > gomp_debug (0, " %s: mappings prepared\n", __FUNCTION__); > > /* Initialize dynamic refcount. */ > @@ -1098,11 +1081,15 @@ gomp_acc_remove_pointer (void *h, size_t s, bool f > t->list[i].copy_from = force_copyfrom ? 1 : 0; > break; > } > - if (async > acc_async_sync) > - acc_dev->openacc.async_set_async_func (async); > - gomp_unmap_vars (t, true); > - if (async > acc_async_sync) > - acc_dev->openacc.async_set_async_func (acc_async_sync); > + > + /* If running synchronously, unmap immediately. */ > + if (async < acc_async_noval) > + gomp_unmap_vars (t, true); > + else > + { > + goacc_aq aq = get_goacc_asyncqueue (async); > + goacc_async_copyout_unmap_vars (t, aq); > + } > } > > gomp_mutex_unlock (&acc_dev->lock); > Index: libgomp/oacc-parallel.c > =================================================================== > --- libgomp/oacc-parallel.c (revision 249620) > +++ libgomp/oacc-parallel.c (working copy) > @@ -215,7 +215,9 @@ GOACC_parallel_keyed (int device, void (*fn) (void > fn (hostaddrs); > goto out; > } > - > + else if (profiling_dispatch_p) > + api_info.device_api = acc_device_api_cuda; > + That seems target specific. Does that belong in the generic code path? > /* Default: let the runtime choose. */ > for (i = 0; i != GOMP_DIM_MAX; i++) > dims[i] = 0; > @@ -260,10 +262,14 @@ GOACC_parallel_keyed (int device, void (*fn) (void > > case GOMP_LAUNCH_WAIT: > { > - unsigned num_waits = GOMP_LAUNCH_OP (tag); > + /* Be careful to cast the op field as a signed 16-bit, and > + sign-extend to full integer. */ > + int num_waits = ((signed short) GOMP_LAUNCH_OP (tag)); > > - if (num_waits) > + if (num_waits > 0) > goacc_wait (async, num_waits, &ap); > + else if (num_waits == acc_async_noval) > + acc_wait_all_async (async); > break; > } > Cesar