On Thu, Oct 15, 2015 at 07:18:53PM +0300, Alexander Monakov wrote:
> On Thu, 15 Oct 2015, Jakub Jelinek wrote:
> > Looking at Cuda, for async target region kernels we'd probably use
> > a non-default stream and enqueue the async kernel in there.  I see
> > we can e.g. cudaEventRecord into the stream and then either cudaEventQuery
> > to busy poll the event, or cudaEventSynchronize to block until the event
> > occurs, plus there is cudaStreamWaitEvent that perhaps might be even used to
> > resolve the above mentioned mapping/unmapping async issues for Cuda
> > - like add an event after the mapping operations that the other target tasks
> > could wait for if they see any in_flux stuff, and wait for an event etc.
> > I don't see a possibility to have something like a callback on stream
> > completion though, so it has to be handled with polling.
> 
> Not sure why you say so.  There's cu[da]StreamAddCallback, which exists
> exactly for registering completion callback, but there are restrictions:

Ah, thanks.

>   - this functionality doesn't currently work through CUDA MPS ("multi-process
>     server", for funneling CUDA calls from different processes through a
>     single "server" process, avoiding context-switch overhead on the device,
>     sometimes used for CUDA-with-MPI applications);

That shouldn't be an issue for the OpenMP 4.5 / PTX offloading, right?

>   - it is explicitely forbidden to invoke CUDA API calls from the callback;
>     perhaps understandable, as the callback may be running in a signal-handler
>     context (unlikely), or, more plausibly, in a different thread than the one
>     that registered the callback.

So, is it run from async signal handlers, or just could be?
If all we need to achieve is just change some word in target_task struct,
then it should be enough to just asynchronously memcpy there the value,
or e.g. use the events.  If we need to also gomp_sem_post, then for
config/linux/ that is also something that can be done from async signal
contexts, but not for other OSes (but perhaps we could just not go to sleep
on those OSes if there are pending offloading tasks).

> Ideally we'd queue all accelerator work up front via
> EventRecord/StreamWaitEvent, and not rely on callbacks.

> If host-side work
> must be done on completion, we could spawn a helper thread waiting on
> cudaEventSynchronize.

Spawning a helper thread is very expensive and we need something to be done
upon completion pretty much always.  Perhaps we can optimize and somehow
deal with merging multiple async tasks that are waiting on each other, but
the user could have intermixed the offloading tasks with host tasks and have
dependencies in between them, plus there are all the various spots where
user wants to wait for both host and offloading tasks, or e.g. offloading
tasks from two different devices, or multiple offloading tasks from the same
devices (multiple streams), etc.

        Jakub

Reply via email to