Hi,

On Thu, Oct 15, 2015 at 04:01:56PM +0200, Jakub Jelinek wrote:
> Hi!
> 
> CCing various people, because I'd like to have something that won't work on
> XeonPhi only.

thanks.  However, I have not paid too much attention to OMP tasks
yet.  Nevertheless, let me try to answer some of the questions.

> 
> On Fri, Oct 02, 2015 at 10:28:01PM +0300, Ilya Verbin wrote:
> > On Tue, Sep 08, 2015 at 11:20:14 +0200, Jakub Jelinek wrote:
> > > nowait support for #pragma omp target is not implemented yet, supposedly 
> > > we
> > > need to mark those somehow (some flag) already in the struct gomp_task
> > > structure, essentially it will need either 2 or 3 callbacks
> > > (the current one, executed when the dependencies are resolved (it actually
> > > waits until some thread schedules it after that point, I think it is
> > > undesirable to run it with the tasking lock held), which would perform
> > > the gomp_map_vars and initiate the running of the region, and then some
> > > query routine which would poll the plugin whether the task is done or not,
> > > and either perform the finalization (unmap_vars) if it is done (and in any
> > > case return bool whether it should be polled again or not), and if the
> > > finalization is not done there, also another callback for the 
> > > finalization.
> > > Also, there is the issue that if we are waiting for task that needs to be
> > > polled, and we don't have any further tasks to run, we shouldn't really
> > > attempt to sleep on some semaphore (e.g. in taskwait, end of
> > > taskgroup, etc.) or barrier, but rather either need to keep polling it, or
> > > call the query hook with some argument that it should sleep in there until
> > > the work is done by the offloading device.
> > > Also, there needs to be a way for the target nowait first callback to say
> > > that it is using host fallback and thus acts as a normal task, therefore
> > > once the task fn finishes, the task is done.
> > 
> > Here is my WIP patch.  target.c part is obviously incorrect, but it 
> > demonstrates
> > a possible libgomp <-> plugin interface for running a target task function
> > asynchronously and checking whether it is completed or not.
> > (Refactored liboffloadmic/runtime/emulator from trunk is required to run
> > target-tmp.c testcase.)
> 
> The difficulty is designing something that will work (if possible fast) on the
> various devices we want to eventually support (at least XeonPhi, XeonPhi emul,
> PTX/Cuda and HSA), ideally without too much busy waiting.
> 
> The OpenMP 4.5 spec says that there is a special "target task" on the host
> side around the target region, and that the "target task" is mergeable and
> if nowait is not specified is included (otherwise it may be), and that the
> mapping operations (which include target device memory allocation,
> refcount management and mapping data structure updates as well as the
> memory copying to target device) happens only after the (optional) 
> dependencies
> are satisfied.  After the memory mapping operations are done, the offloading
> kernel starts, and when it finishes, the unmapping operations are performed
> (which includes memory copying from the target device, refcount management
> and mapping data structure updates, and finally memory deallocation).
> 
> Right now on the OpenMP side everything is synchronous, e.g. target
> enter/exit data and update are asynchronous only in that the mapping or
> unmapping operation is scheduled as a task, but the whole mapping or
> unmapping operations including all the above mentioned subparts are
> performed while holding the particular device's lock.

Memory mapping and unmapping is a no-op on HSA so this is fortunately
a concern for us.  (I'm assuming that ref-counting is also something
device specific and not part of running a task here).

> Anyway, let's put the asynchronous memory data transfers (which also implies
> the ability to enqueue multiple different target regions into a stream for
> the device to operate on independently from the host) on the side for now
> and just discuss what we want for the actual async execution and for now
> keep a device lock around all the mapping or unmapping operations.
> 
> If the "target task" has unresolved dependencies, then it will use existing
> task.c waiting code first (if the above is resolved somehow, there could be
> exceptions of "target task" depending on another "target task").
> When the dependencies are resolved, we can run the gomp_target_task_fn
> callback (but not with the team's tasking lock held), which can perform
> the gomp_map_vars call and start the async execution.  For host fallback,
> that is all we do, the task is at this point a normal task.
> For offloading task, we now want the host to continue scheduling other tasks
> if there are any, which means (not currently implemented on the task.c side)
> we want to move the task somewhere that we don't consider it finished, and
> that we'll need to schedule it again at some point to perform the unmapping
> (perhaps requeue it again in a WAITING or some other state).
> 
> Right now, the tasking code would in that case try to schedule another task,
> and if there are none or none that are runnable among the tasks of interest,
> it can go to sleep and expect to be awaken when some task it is waiting for
> is awaken.
> 
> And the main question is how to find out on the various devices whether
> the async execution has completed already.
> 
> From what I can see in the liboffloadmic plugin, you have an extra host
> thread that can run a callback function on the host.  Such a callback could
> say tweak the state of the "target task", could take the team's tasking
> lock, and even awake sleepers, maybe even take the device lock and perform
> unmapping of vars?  The function would need to be in task.c
> so that it can access everything defined in there.  Or the callback could
> just change something in the "target task" state and let the tasking poll
> for the change.
> 

...

> 
> For HSA I have no idea.
> 

In HSA, the task completion is signaled via so called "signals."
Which are basically (long) integers that you can atomically read/write
(etc) with the given API and wait until a specified condition (eq, ne,
le, gte) happens.  Atomic reading should be very cheap.  I do not see
a way to wait on multiple signals but we can arrange it so that
completions of a number of kernels are communicated with a single
signal.

At the moment we wait and do not create any special servicing threads
in our libgomp plugin and, as far as I know, run-time itself does not
offer a way of registering a call-back to announce kernel completion.

So polling is certainly a possibility, blocking wait if HSA task(s)
are the last ones we wait for is also simple.  Sleeping until either a
CPU or a HSA task completes might be tricky.

I hope this helps,

Martin

Reply via email to