Hi! CCing various people, because I'd like to have something that won't work on XeonPhi only.
On Fri, Oct 02, 2015 at 10:28:01PM +0300, Ilya Verbin wrote: > On Tue, Sep 08, 2015 at 11:20:14 +0200, Jakub Jelinek wrote: > > nowait support for #pragma omp target is not implemented yet, supposedly we > > need to mark those somehow (some flag) already in the struct gomp_task > > structure, essentially it will need either 2 or 3 callbacks > > (the current one, executed when the dependencies are resolved (it actually > > waits until some thread schedules it after that point, I think it is > > undesirable to run it with the tasking lock held), which would perform > > the gomp_map_vars and initiate the running of the region, and then some > > query routine which would poll the plugin whether the task is done or not, > > and either perform the finalization (unmap_vars) if it is done (and in any > > case return bool whether it should be polled again or not), and if the > > finalization is not done there, also another callback for the finalization. > > Also, there is the issue that if we are waiting for task that needs to be > > polled, and we don't have any further tasks to run, we shouldn't really > > attempt to sleep on some semaphore (e.g. in taskwait, end of > > taskgroup, etc.) or barrier, but rather either need to keep polling it, or > > call the query hook with some argument that it should sleep in there until > > the work is done by the offloading device. > > Also, there needs to be a way for the target nowait first callback to say > > that it is using host fallback and thus acts as a normal task, therefore > > once the task fn finishes, the task is done. > > Here is my WIP patch. target.c part is obviously incorrect, but it > demonstrates > a possible libgomp <-> plugin interface for running a target task function > asynchronously and checking whether it is completed or not. > (Refactored liboffloadmic/runtime/emulator from trunk is required to run > target-tmp.c testcase.) The difficulty is designing something that will work (if possible fast) on the various devices we want to eventually support (at least XeonPhi, XeonPhi emul, PTX/Cuda and HSA), ideally without too much busy waiting. The OpenMP 4.5 spec says that there is a special "target task" on the host side around the target region, and that the "target task" is mergeable and if nowait is not specified is included (otherwise it may be), and that the mapping operations (which include target device memory allocation, refcount management and mapping data structure updates as well as the memory copying to target device) happens only after the (optional) dependencies are satisfied. After the memory mapping operations are done, the offloading kernel starts, and when it finishes, the unmapping operations are performed (which includes memory copying from the target device, refcount management and mapping data structure updates, and finally memory deallocation). Right now on the OpenMP side everything is synchronous, e.g. target enter/exit data and update are asynchronous only in that the mapping or unmapping operation is scheduled as a task, but the whole mapping or unmapping operations including all the above mentioned subparts are performed while holding the particular device's lock. To make that more asynchronous, e.g. for Cuda we might want to use Cuda (non-default) streams, and perform the allocation, refcount management and mapping data structure updates, and perform the data copying to device already as part of the stream. Except that it means that if another target mapping/unmapping operation is enqueued at that point and it refers to any of the affected objects, it could acquire the device lock, yet the data copying would be still in flux. Dunno here if it would be e.g. acceptable to add some flags to the mapping data structures, this memory range has either pending data transfers or has enqueued data transfers that depend on whether the refcount will become zero or not. When mapping if we'd want to touch any of the regions marked with such in_flux flag, we'd need to wait until all of the other stream's operation finish and the unmapping operations are performed (and the device lock released again) before continuing. That way we could get good performance if either concurrent async regions touch different variables, or target data or non-async target enter data or exit data has been put around the mappings, so the streams can be independent, but worst case we'd make them non-concurrent. Anyway, let's put the asynchronous memory data transfers (which also implies the ability to enqueue multiple different target regions into a stream for the device to operate on independently from the host) on the side for now and just discuss what we want for the actual async execution and for now keep a device lock around all the mapping or unmapping operations. If the "target task" has unresolved dependencies, then it will use existing task.c waiting code first (if the above is resolved somehow, there could be exceptions of "target task" depending on another "target task"). When the dependencies are resolved, we can run the gomp_target_task_fn callback (but not with the team's tasking lock held), which can perform the gomp_map_vars call and start the async execution. For host fallback, that is all we do, the task is at this point a normal task. For offloading task, we now want the host to continue scheduling other tasks if there are any, which means (not currently implemented on the task.c side) we want to move the task somewhere that we don't consider it finished, and that we'll need to schedule it again at some point to perform the unmapping (perhaps requeue it again in a WAITING or some other state). Right now, the tasking code would in that case try to schedule another task, and if there are none or none that are runnable among the tasks of interest, it can go to sleep and expect to be awaken when some task it is waiting for is awaken. And the main question is how to find out on the various devices whether the async execution has completed already. >From what I can see in the liboffloadmic plugin, you have an extra host thread that can run a callback function on the host. Such a callback could say tweak the state of the "target task", could take the team's tasking lock, and even awake sleepers, maybe even take the device lock and perform unmapping of vars? The function would need to be in task.c so that it can access everything defined in there. Or the callback could just change something in the "target task" state and let the tasking poll for the change. Looking at Cuda, for async target region kernels we'd probably use a non-default stream and enqueue the async kernel in there. I see we can e.g. cudaEventRecord into the stream and then either cudaEventQuery to busy poll the event, or cudaEventSynchronize to block until the event occurs, plus there is cudaStreamWaitEvent that perhaps might be even used to resolve the above mentioned mapping/unmapping async issues for Cuda - like add an event after the mapping operations that the other target tasks could wait for if they see any in_flux stuff, and wait for an event etc. I don't see a possibility to have something like a callback on stream completion though, so it has to be handled with polling. If that is true, it means the tasking code can't go to sleep if there are any pending target tasks (at least for devices that can't do a callback) it wants to wait for, it would need to call in a loop the poll methods of the plugins that it wants to wait for (unless there are no host tasks left and only a single device is involved, then it could call a blocking method). For HSA I have no idea. Now, for the polling case, the question is how the polling is expensive, whether it can be performed with the team's lock held or not. If XeonPhi doesn't do the full host callback, but polling, it could just read some memory from target_task struct and thus be fast enough to run it with the lock held. How expensive is cudaEventQuery? > diff --git a/libgomp/target.c b/libgomp/target.c > index 77bd442..31f034c 100644 > --- a/libgomp/target.c > +++ b/libgomp/target.c > @@ -45,6 +45,10 @@ > #include "plugin-suffix.h" > #endif > > +/* FIXME: TMP */ > +#include <stdio.h> > +#include <unistd.h> I hope you mean to remove this later on. > @@ -1227,6 +1231,44 @@ gomp_target_fallback (void (*fn) (void *), void > **hostaddrs) > *thr = old_thr; > } > > +/* Host fallback with firstprivate map-type handling. */ > + > +static void > +gomp_target_fallback_firstprivate (void (*fn) (void *), size_t mapnum, > + void **hostaddrs, size_t *sizes, > + unsigned short *kinds) > +{ > + size_t i, tgt_align = 0, tgt_size = 0; > + char *tgt = NULL; > + for (i = 0; i < mapnum; i++) > + if ((kinds[i] & 0xff) == GOMP_MAP_FIRSTPRIVATE) > + { > + size_t align = (size_t) 1 << (kinds[i] >> 8); > + if (tgt_align < align) > + tgt_align = align; > + tgt_size = (tgt_size + align - 1) & ~(align - 1); > + tgt_size += sizes[i]; > + } > + if (tgt_align) > + { > + tgt = gomp_alloca (tgt_size + tgt_align - 1); > + uintptr_t al = (uintptr_t) tgt & (tgt_align - 1); > + if (al) > + tgt += tgt_align - al; > + tgt_size = 0; > + for (i = 0; i < mapnum; i++) > + if ((kinds[i] & 0xff) == GOMP_MAP_FIRSTPRIVATE) > + { > + size_t align = (size_t) 1 << (kinds[i] >> 8); > + tgt_size = (tgt_size + align - 1) & ~(align - 1); > + memcpy (tgt + tgt_size, hostaddrs[i], sizes[i]); > + hostaddrs[i] = tgt + tgt_size; > + tgt_size = tgt_size + sizes[i]; > + } > + } > + gomp_target_fallback (fn, hostaddrs); > +} This is ok. > + > /* Helper function of GOMP_target{,_41} routines. */ > > static void * > @@ -1311,40 +1353,19 @@ GOMP_target_41 (int device, void (*fn) (void *), > size_t mapnum, > if (devicep == NULL > || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400)) > { > - size_t i, tgt_align = 0, tgt_size = 0; > - char *tgt = NULL; > - for (i = 0; i < mapnum; i++) > - if ((kinds[i] & 0xff) == GOMP_MAP_FIRSTPRIVATE) > - { > - size_t align = (size_t) 1 << (kinds[i] >> 8); > - if (tgt_align < align) > - tgt_align = align; > - tgt_size = (tgt_size + align - 1) & ~(align - 1); > - tgt_size += sizes[i]; > - } > - if (tgt_align) > - { > - tgt = gomp_alloca (tgt_size + tgt_align - 1); > - uintptr_t al = (uintptr_t) tgt & (tgt_align - 1); > - if (al) > - tgt += tgt_align - al; > - tgt_size = 0; > - for (i = 0; i < mapnum; i++) > - if ((kinds[i] & 0xff) == GOMP_MAP_FIRSTPRIVATE) > - { > - size_t align = (size_t) 1 << (kinds[i] >> 8); > - tgt_size = (tgt_size + align - 1) & ~(align - 1); > - memcpy (tgt + tgt_size, hostaddrs[i], sizes[i]); > - hostaddrs[i] = tgt + tgt_size; > - tgt_size = tgt_size + sizes[i]; > - } > - } > - gomp_target_fallback (fn, hostaddrs); > + gomp_target_fallback_firstprivate (fn, mapnum, hostaddrs, sizes, > kinds); > return; > } This too. > void *fn_addr = gomp_get_target_fn_addr (devicep, fn); > > + if (flags & GOMP_TARGET_FLAG_NOWAIT) > + { > + gomp_create_target_task (devicep, fn_addr, mapnum, hostaddrs, sizes, > + kinds, flags, depend); > + return; > + } But this is not ok. You need to do this far earlier, already before the if (depend != NULL) code in GOMP_target_41. And, I think you should just not pass fn_addr, but fn itself. > @@ -1636,34 +1657,58 @@ void > gomp_target_task_fn (void *data) > { > struct gomp_target_task *ttask = (struct gomp_target_task *) data; > + struct gomp_device_descr *devicep = ttask->devicep; > + > if (ttask->fn != NULL) > { > - /* GOMP_target_41 */ > + if (devicep == NULL > + || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400)) > + { > + /* FIXME: Save host fn addr into gomp_target_task? */ > + gomp_target_fallback_firstprivate (NULL, ttask->mapnum, If you pass above fn instead of fn_addr, ttask->fn is what you want to pass to gomp_target_fallback_firstprivate here and remove the FIXME. > + ttask->hostaddrs, ttask->sizes, > + ttask->kinds); > + return; > + } > + > + struct target_mem_desc *tgt_vars > + = gomp_map_vars (devicep, ttask->mapnum, ttask->hostaddrs, NULL, > + ttask->sizes, ttask->kinds, true, > + GOMP_MAP_VARS_TARGET); > + devicep->async_run_func (devicep->target_id, ttask->fn, > + (void *) tgt_vars->tgt_start, data); You need to void *fn_addr = gomp_get_target_fn_addr (devicep, ttask->fn); first obviously, and pass fn_addr. > + > + /* FIXME: TMP example of checking for completion. > + Alternatively the plugin can set some completion flag in ttask. */ > + while (!devicep->async_is_completed_func (devicep->target_id, data)) > + { > + fprintf (stderr, "-"); > + usleep (100000); > + } This obviously doesn't belong here. > if (device->capabilities & GOMP_OFFLOAD_CAP_OPENACC_200) > diff --git a/libgomp/testsuite/libgomp.c/target-tmp.c > b/libgomp/testsuite/libgomp.c/target-tmp.c > new file mode 100644 > index 0000000..23a739c > --- /dev/null > +++ b/libgomp/testsuite/libgomp.c/target-tmp.c > @@ -0,0 +1,40 @@ > +#include <stdio.h> > +#include <unistd.h> > + > +#pragma omp declare target > +void foo (int n) > +{ > + printf ("Start tgt %d\n", n); > + usleep (5000000); 5s is too long. Not to mention that not sure if PTX can do printf and especially usleep. > diff --git a/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp > b/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp > index 26ac6fe..c843710 100644 > --- a/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp > +++ b/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp ... > +/* Set of asynchronously running target tasks. */ > +static std::set<const void *> *async_tasks; > + > /* Thread-safe registration of the main image. */ > static pthread_once_t main_image_is_registered = PTHREAD_ONCE_INIT; > > +/* Mutex for protecting async_tasks. */ > +static pthread_mutex_t async_tasks_lock = PTHREAD_MUTEX_INITIALIZER; > + > static VarDesc vd_host2tgt = { > { 1, 1 }, /* dst, src */ > { 1, 0 }, /* in, out */ > @@ -156,6 +163,8 @@ init (void) > > out: > address_table = new ImgDevAddrMap; > + async_tasks = new std::set<const void *>; > + pthread_mutex_init (&async_tasks_lock, NULL); PTHREAD_MUTEX_INITIALIZER should already initialize the lock. But, do you really need async_tasks and the lock? Better store something into some plugin's owned field in target_task struct and let the plugin callback be passed address of that field rather than the whole target_task? > diff --git a/liboffloadmic/runtime/offload_host.cpp > b/liboffloadmic/runtime/offload_host.cpp > index 08f626f..8cee12c 100644 > --- a/liboffloadmic/runtime/offload_host.cpp > +++ b/liboffloadmic/runtime/offload_host.cpp > @@ -64,6 +64,9 @@ static void __offload_fini_library(void); > #define GET_OFFLOAD_NUMBER(timer_data) \ > timer_data? timer_data->offload_number : 0 > > +extern "C" void > +__gomp_offload_intelmic_async_completed (const void *); > + > extern "C" { > #ifdef TARGET_WINNT > // Windows does not support imports from libraries without actually > @@ -2507,7 +2510,7 @@ extern "C" { > const void *info > ) > { > - /* TODO: Call callback function, pass info. */ > + __gomp_offload_intelmic_async_completed (info); > } > } Is this for the emul only, or KNL only, or both? In any case, not sure how it works, this is in liboffloadmic.so and the function defined in the plugin? Jakub