Hi, On Thu, Oct 15, 2015 at 04:01:56PM +0200, Jakub Jelinek wrote: > Hi! > > CCing various people, because I'd like to have something that won't work on > XeonPhi only.
thanks. However, I have not paid too much attention to OMP tasks yet. Nevertheless, let me try to answer some of the questions. > > On Fri, Oct 02, 2015 at 10:28:01PM +0300, Ilya Verbin wrote: > > On Tue, Sep 08, 2015 at 11:20:14 +0200, Jakub Jelinek wrote: > > > nowait support for #pragma omp target is not implemented yet, supposedly > > > we > > > need to mark those somehow (some flag) already in the struct gomp_task > > > structure, essentially it will need either 2 or 3 callbacks > > > (the current one, executed when the dependencies are resolved (it actually > > > waits until some thread schedules it after that point, I think it is > > > undesirable to run it with the tasking lock held), which would perform > > > the gomp_map_vars and initiate the running of the region, and then some > > > query routine which would poll the plugin whether the task is done or not, > > > and either perform the finalization (unmap_vars) if it is done (and in any > > > case return bool whether it should be polled again or not), and if the > > > finalization is not done there, also another callback for the > > > finalization. > > > Also, there is the issue that if we are waiting for task that needs to be > > > polled, and we don't have any further tasks to run, we shouldn't really > > > attempt to sleep on some semaphore (e.g. in taskwait, end of > > > taskgroup, etc.) or barrier, but rather either need to keep polling it, or > > > call the query hook with some argument that it should sleep in there until > > > the work is done by the offloading device. > > > Also, there needs to be a way for the target nowait first callback to say > > > that it is using host fallback and thus acts as a normal task, therefore > > > once the task fn finishes, the task is done. > > > > Here is my WIP patch. target.c part is obviously incorrect, but it > > demonstrates > > a possible libgomp <-> plugin interface for running a target task function > > asynchronously and checking whether it is completed or not. > > (Refactored liboffloadmic/runtime/emulator from trunk is required to run > > target-tmp.c testcase.) > > The difficulty is designing something that will work (if possible fast) on the > various devices we want to eventually support (at least XeonPhi, XeonPhi emul, > PTX/Cuda and HSA), ideally without too much busy waiting. > > The OpenMP 4.5 spec says that there is a special "target task" on the host > side around the target region, and that the "target task" is mergeable and > if nowait is not specified is included (otherwise it may be), and that the > mapping operations (which include target device memory allocation, > refcount management and mapping data structure updates as well as the > memory copying to target device) happens only after the (optional) > dependencies > are satisfied. After the memory mapping operations are done, the offloading > kernel starts, and when it finishes, the unmapping operations are performed > (which includes memory copying from the target device, refcount management > and mapping data structure updates, and finally memory deallocation). > > Right now on the OpenMP side everything is synchronous, e.g. target > enter/exit data and update are asynchronous only in that the mapping or > unmapping operation is scheduled as a task, but the whole mapping or > unmapping operations including all the above mentioned subparts are > performed while holding the particular device's lock. Memory mapping and unmapping is a no-op on HSA so this is fortunately a concern for us. (I'm assuming that ref-counting is also something device specific and not part of running a task here). > Anyway, let's put the asynchronous memory data transfers (which also implies > the ability to enqueue multiple different target regions into a stream for > the device to operate on independently from the host) on the side for now > and just discuss what we want for the actual async execution and for now > keep a device lock around all the mapping or unmapping operations. > > If the "target task" has unresolved dependencies, then it will use existing > task.c waiting code first (if the above is resolved somehow, there could be > exceptions of "target task" depending on another "target task"). > When the dependencies are resolved, we can run the gomp_target_task_fn > callback (but not with the team's tasking lock held), which can perform > the gomp_map_vars call and start the async execution. For host fallback, > that is all we do, the task is at this point a normal task. > For offloading task, we now want the host to continue scheduling other tasks > if there are any, which means (not currently implemented on the task.c side) > we want to move the task somewhere that we don't consider it finished, and > that we'll need to schedule it again at some point to perform the unmapping > (perhaps requeue it again in a WAITING or some other state). > > Right now, the tasking code would in that case try to schedule another task, > and if there are none or none that are runnable among the tasks of interest, > it can go to sleep and expect to be awaken when some task it is waiting for > is awaken. > > And the main question is how to find out on the various devices whether > the async execution has completed already. > > From what I can see in the liboffloadmic plugin, you have an extra host > thread that can run a callback function on the host. Such a callback could > say tweak the state of the "target task", could take the team's tasking > lock, and even awake sleepers, maybe even take the device lock and perform > unmapping of vars? The function would need to be in task.c > so that it can access everything defined in there. Or the callback could > just change something in the "target task" state and let the tasking poll > for the change. > ... > > For HSA I have no idea. > In HSA, the task completion is signaled via so called "signals." Which are basically (long) integers that you can atomically read/write (etc) with the given API and wait until a specified condition (eq, ne, le, gte) happens. Atomic reading should be very cheap. I do not see a way to wait on multiple signals but we can arrange it so that completions of a number of kernels are communicated with a single signal. At the moment we wait and do not create any special servicing threads in our libgomp plugin and, as far as I know, run-time itself does not offer a way of registering a call-back to announce kernel completion. So polling is certainly a possibility, blocking wait if HSA task(s) are the last ones we wait for is also simple. Sleeping until either a CPU or a HSA task completes might be tricky. I hope this helps, Martin