Hi!

Responding the per device question (sorry I can't make it shorter, the topic is 
quite dense).

A typical CUDA application uses a single cuda context, and multiple cuda 
streams to allow asynchronicity between cuda tasks (memory transfers, kernels, 
memsets) and make overlapping between those tasks possible, to save potentially 
a lot to execution time.

Even with Timo's changes, by default the behaviour of ffmpeg is the same. Each 
ffmpeg decoding instance creates a new cuda context by calling 
cuda_device_create.

The result of this, can be seen with (for instance) NVIDIA NSIGHT Visual Studio 
plugin timeline. Basically, the GPU changes it's cuda context every frame, as 
many times as videos you are decoding. This is less optimal than what is 
possible with Timo's additions.

The reason for the un-efficiency is partly because of the time the GPU needs to 
change between contexts (very small in recent hardware), but also and more 
importantly, because any cuda task executing in a cuda context, will never 
overlap with tasks in other cuda contexts. And this can imply a huge waste of 
time, specially if you need to download each frame from GPU to CPU.

Before Timo's changes, I could already use the same cuda contex for all 
decoding instances, using a CPU thread per video, that will not call 
cuda_device_create. Instead, each thread will allocate an ffmpeg hw_ctx, and 
manually set the cuda contex that already exists in the application. This way 
you solve the problem of switching contexts, and another problem. All cuda 
tasks enqueued by nvdec and ffmpeg for the decoding task, now are using the 
default stream of the same cuda context, so now I can synchronize this default 
stream, with another non-default stream in my aplication, making all the 
following cuda tasks to be asynchronous and able to overlap. Before, I had to 
copy the data from GPU to CPU in a blocking way, indirectly using the context 
created by ffmpeg. But still, the cuda tasks enqueued by nvdec and ffmpeg are 
not overlapping because they are not asynchronous, because they use the default 
cuda stream.

So now, we are using the same "device" or cuda context for all decoding feeds 
in the application.

Thanks to Timo's changes, besides the removal of some memory copies (very 
nice), now additionally to being able to manually set the cuda context, I can 
set a non-default cuda stream in the ffmpeg cuda hw_ctx struct, so all cuda 
tasks performed by ffmpeg and nvdec will use this stream, and can overlap with 
all my other cuda tasks, instead of blocking the gpu until they finish.

I our use case, this is a very relevant execution time improvement, and we are 
going to use this. So I hope this, or something similar can go into the master 
branch of the project.

Hope my explanation is useful.

Oscar
________________________________________
De: ffmpeg-devel <ffmpeg-devel-boun...@ffmpeg.org> en nombre de wm4 
<nfx...@googlemail.com>
Enviado: martes, 8 de mayo de 2018 17:55
Para: ffmpeg-devel@ffmpeg.org
Asunto: Re: [FFmpeg-devel] [PATCH 3/6] avutil/hwcontext_cuda: add CUstream in 
cuda hwctx

On Tue, 8 May 2018 17:48:21 +0200
Timo Rothenpieler <t...@rothenpieler.org> wrote:

> Am 08.05.2018 um 17:26 schrieb wm4:
> > On Tue,  8 May 2018 15:31:29 +0200
> > Timo Rothenpieler <t...@rothenpieler.org> wrote:
> >
> >> ---
> >>   configure                  | 6 ++++--
> >>   doc/APIchanges             | 3 ++-
> >>   libavutil/hwcontext_cuda.c | 3 +++
> >>   libavutil/hwcontext_cuda.h | 1 +
> >>   libavutil/version.h        | 2 +-
> >>   5 files changed, 11 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/configure b/configure
> >> index 7c143238a8..cae8a235a4 100755
> >> --- a/configure
> >> +++ b/configure
> >> @@ -5887,8 +5887,10 @@ check_type "va/va.h va/va_enc_vp9.h"  
> >> "VAEncPictureParameterBufferVP9"
> >>   check_type "vdpau/vdpau.h" "VdpPictureInfoHEVC"
> >>
> >>   if ! disabled ffnvcodec; then
> >> -    check_pkg_config ffnvcodec "ffnvcodec >= 8.0.14.1" \
> >> -        "ffnvcodec/nvEncodeAPI.h ffnvcodec/dynlink_cuda.h 
> >> ffnvcodec/dynlink_cuviddec.h ffnvcodec/dynlink_nvcuvid.h" ""
> >> +    check_pkg_config ffnvcodec "ffnvcodec >= 8.1.24.2" \
> >> +          "ffnvcodec/nvEncodeAPI.h ffnvcodec/dynlink_cuda.h 
> >> ffnvcodec/dynlink_cuviddec.h ffnvcodec/dynlink_nvcuvid.h" "" || \
> >> +        { test_pkg_config ffnvcodec_tmp "ffnvcodec < 8.1" "" "" && 
> >> check_pkg_config ffnvcodec "ffnvcodec >= 8.0.14.2" \
> >> +          "ffnvcodec/nvEncodeAPI.h ffnvcodec/dynlink_cuda.h 
> >> ffnvcodec/dynlink_cuviddec.h ffnvcodec/dynlink_nvcuvid.h" ""; }
> >>   fi
> >>
> >>   check_cpp_condition winrt windows.h 
> >> "!WINAPI_FAMILY_PARTITION(WINAPI_PARTITION_DESKTOP)"
> >> diff --git a/doc/APIchanges b/doc/APIchanges
> >> index f8ae6b0433..7a0a8522f9 100644
> >> --- a/doc/APIchanges
> >> +++ b/doc/APIchanges
> >> @@ -15,8 +15,9 @@ libavutil:     2017-10-21
> >>
> >>   API changes, most recent first:
> >>
> >> -2018-05-xx - xxxxxxxxxx - lavu 56.19.100 - hwcontext_cuda.h
> >> +2018-05-xx - xxxxxxxxxx, xxxxxxxxxx - lavu 56.19.100/101 - 
> >> hwcontext_cuda.h
> >>     Add AVCUDAFramesContext and AVCUDAFramesContext.flags.
> >> +  Add AVCUDADeviceContext.stream.
> >>
> >>   2018-04-xx - xxxxxxxxxx - lavu 56.18.100 - pixdesc.h
> >>     Add AV_PIX_FMT_FLAG_ALPHA to AV_PIX_FMT_PAL8.
> >> diff --git a/libavutil/hwcontext_cuda.c b/libavutil/hwcontext_cuda.c
> >> index b0b4bf24ae..8024eec79d 100644
> >> --- a/libavutil/hwcontext_cuda.c
> >> +++ b/libavutil/hwcontext_cuda.c
> >> @@ -395,6 +395,9 @@ static int cuda_device_create(AVHWDeviceContext *ctx, 
> >> const char *device,
> >>           goto error;
> >>       }
> >>
> >> +    // Setting stream to NULL will make functions automatically use the 
> >> default CUstream
> >> +    hwctx->stream = NULL;
> >> +
> >>       cu->cuCtxPopCurrent(&dummy);
> >>
> >>       hwctx->internal->is_allocated = 1;
> >> diff --git a/libavutil/hwcontext_cuda.h b/libavutil/hwcontext_cuda.h
> >> index 19accbd9be..cd797ae920 100644
> >> --- a/libavutil/hwcontext_cuda.h
> >> +++ b/libavutil/hwcontext_cuda.h
> >> @@ -41,6 +41,7 @@ typedef struct AVCUDADeviceContextInternal 
> >> AVCUDADeviceContextInternal;
> >>    */
> >>   typedef struct AVCUDADeviceContext {
> >>       CUcontext cuda_ctx;
> >> +    CUstream stream;
> >>       AVCUDADeviceContextInternal *internal;
> >>   } AVCUDADeviceContext;
> >>
> >> diff --git a/libavutil/version.h b/libavutil/version.h
> >> index 84409b1d69..f84ec89154 100644
> >> --- a/libavutil/version.h
> >> +++ b/libavutil/version.h
> >> @@ -80,7 +80,7 @@
> >>
> >>   #define LIBAVUTIL_VERSION_MAJOR  56
> >>   #define LIBAVUTIL_VERSION_MINOR  19
> >> -#define LIBAVUTIL_VERSION_MICRO 100
> >> +#define LIBAVUTIL_VERSION_MICRO 101
> >>
> >>   #define LIBAVUTIL_VERSION_INT   AV_VERSION_INT(LIBAVUTIL_VERSION_MAJOR, \
> >>                                                  LIBAVUTIL_VERSION_MINOR, \
> >
> > What is this?
>
> https://docs.nvidia.com/cuda/cuda-driver-api/stream-sync-behavior.html
>
> It allows asynchronous processing of CUDA workloads. The next couple
> patches make use of it.
> There's no change in behaviour if it remains unset/NULL, but if you set
> one, the workload won't block the main CUDA stream so you can do
> multiple transcode sessions in the same application without blocking one
> another.
>

Could probably be documented.

It seems a bit strange that this is per device. Wouldn't it be per
operation?
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Reply via email to