On Mon, Oct 26, 2020 at 07:14:48AM -0700, Julian Brown wrote:
> This patch adds caching for the stack block allocated for offloaded
> OpenMP kernel launches on NVPTX. This is a performance optimisation --
> we observed an average 11% or so performance improvement with this patch
> across a set of accelerated GPU benchmarks on one machine (results vary
> according to individual benchmark and with hardware used).
> 
> A given kernel launch will reuse the stack block from the previous launch
> if it is large enough, else it is freed and reallocated. A slight caveat
> is that memory will not be freed until the device is closed, so e.g. if
> code is using highly variable launch geometries and large amounts of
> GPU RAM, you might run out of resources slightly quicker with this patch.
> 
> Another way this patch gains performance is by omitting the
> synchronisation at the end of an OpenMP offload kernel launch -- it's
> safe for the GPU and CPU to continue executing in parallel at that point,
> because e.g. copies-back from the device will be synchronised properly
> with kernel completion anyway.
> 
> In turn, the last part necessitates a change to the way "(perhaps abort
> was called)" errors are detected and reported.
> 
> Tested with offloading to NVPTX. OK for mainline?

I'm afraid I don't know the plugin nor CUDA well enough to review this
properly (therefore I'd like to hear from Thomas, Tom and/or Alexander.
Anyway, just two questions, wouldn't it make sense to add some upper bound
limit over which it wouldn't cache the stacks, so that it would cache
most of the time for normal programs but if some kernel is really excessive
and then many normal ones wouldn't result in memory allocation failures?

And, in which context are cuStreamAddCallback registered callbacks run?
E.g. if it is inside of asynchronous interrput, using locking in there might
not be the best thing to do.

> -  r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
> -  if (r == CUDA_ERROR_LAUNCH_FAILED)
> -    GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s %s\n", cuda_error (r),
> -                    maybe_abort_msg);
> -  else if (r != CUDA_SUCCESS)
> -    GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s", cuda_error (r));
> -  nvptx_stacks_free (stacks, teams * threads);
> +  CUDA_CALL_ASSERT (cuStreamAddCallback, NULL, nvptx_stacks_release,
> +                 (void *) ptx_dev, 0);
>  }
>  
>  /* TODO: Implement GOMP_OFFLOAD_async_run. */
> -- 
> 2.28.0

        Jakub

Reply via email to