av_hwdevice_ctx_create() for decoding with NVDEC

Oscar Amoros Huguet Fri, 20 Apr 2018 09:54:26 -0700

Hi!

We changed 4 files in ffmpeg, libavcodec/nvdec.c, libavutil/hwcontext.c, 
libavutil/hwcontext_cuda.h, libavutil/hwcontext_cuda.c.


The purpose of this modification is very simple. We needed, for performance 
reasons (per frame execution time), that nvdec.c used the same CUDA context as 
we use in our software.

The reason for this is not so simple, and two fold:
- We wanted to remove the overhead of having the GPU constantly switching 
contexts, as we use up to 8 nvdec instances at the same time, plus a lot of 
CUDA computations.
- For video syncronization and buffering purposes, after decoding we need to 
download the frame from GPU to CPU, but in a non blocking and overlapped (with 
computation and other transfers) manner, so the impact of the transfer is 
almost zero.

In order to do the later, we need to be able to synchronize our manually 
created CUDA stream with the CUDA stream being used by ffmpeg, which by default 
is the Legacy default stream. 
To do so, we need to be in the same CUDA context, otherwise we don't have 
access to the Legacy CUDA stream being used by ffmpeg.

The conseqüence is, that without changin ffmpeg code, the transfer of the frame 
from GPU to CPU, could not be asynchronous, because if made asynchronous, it 
overlapped with the device to device cuMemcpy made internally by ffmpeg, and 
therefore, the resulting frames where (many times) a mix of two frames.

So what did we change?

- Outside of the ffmpeg code, we allocate an AVBufferRef with 
av_hwdevice_ctx_alloc(AV_HWDEVICE_TYPE_CUDA), and we access the 
AVCUDADeviceContext associated, to set the CUDA context (cuda_ctx).
- We modified libavutil/hwcontext.c call av_hwdevice_ctx_create() so it detects 
that the AVBufferRef being passed, was allocaded externally. We don't check 
that AVHWDeviceType is AV_HWDEVICE_TYPE_CUDA. Let us know if you think we 
should check that, otherwise go back to default behavior.
- If the AVBufferRef was allocated, then we skip the allocation call, and pass 
the data as AVHWDeviceContext type to cuda_device_create.
- We modified libavutil/hwcontext_cuda.c in several parts:
- cuda_device_create detects if there is a cuda context already present in the 
AVCUDADeviceContext, and if so, sets the new parameter 
AVCUDADeviceContext.is_ctx_externally_allocated to 1.
- This way, all the succesive calls to this file, take into account that ffmpeg 
is not responsible for either the creation, thread binding/unbinding and 
destruction of the CUDA context.
- Also, we skip context push and pop if the context was passed externally 
(specially in non initialization calls), to reduce the number of calls to the 
CUDA runtime, and improve the execution times of the CPU threads using ffmpeg.

With this, we managed to have all the CUDA calls in the aplication, in the same 
CUDA context. Also, we use CUDA default stream per-thread, so in order to synch 
with the CUDA stream used by ffmpeg, we only had to put the GPU to CPU copy, to 
the globally accessible cudaStreamPerThread CUDA stream.

So, of 33ms of available time we have per frame, we save more than 6ms, that 
where being used by the blocking copies from GPU to CPU.

We considered further optimizing the code, by changing ffmpeg so it can 
internally access the cudaStreamPerThread, and cuMemcpyAsynch, so the 
DevicetoDevice copies are aslo asynchronous and overlapped with the rest of the 
computation, but the time saved is much lower, and we have other optimizations 
to do in our code, that can save more time.

Nevetheless, if you find interesting this last optimization, let us know.

Also, please, let us know any thing we did wrong or missed.

Thanks! 
   
---
 libavcodec/nvdec.c         | 14 +++++--
 libavutil/hwcontext.c      | 15 ++++---
 libavutil/hwcontext_cuda.c | 97 ++++++++++++++++++++++++++++------------------
 libavutil/hwcontext_cuda.h |  1 +
 4 files changed, 80 insertions(+), 47 deletions(-)

diff --git a/libavcodec/nvdec.c b/libavcodec/nvdec.c
index ab3cb88..af92218 100644
--- a/libavcodec/nvdec.c
+++ b/libavcodec/nvdec.c
@@ -39,6 +39,7 @@ typedef struct NVDECDecoder {
 
     AVBufferRef *hw_device_ref;
     CUcontext    cuda_ctx;
+    int          is_ctx_externally_allocated;
 
     CudaFunctions *cudl;
     CuvidFunctions *cvdl;
@@ -188,6 +189,7 @@ static int nvdec_decoder_create(AVBufferRef **out, 
AVBufferRef *hw_device_ref,
         goto fail;
     }
     decoder->cuda_ctx = device_hwctx->cuda_ctx;
+    decoder->is_ctx_externally_allocated = 
device_hwctx->is_ctx_externally_allocated;
     decoder->cudl = device_hwctx->internal->cuda_dl;
 
     ret = cuvid_load_functions(&decoder->cvdl, logctx);
@@ -370,9 +372,11 @@ static int nvdec_retrieve_data(void *logctx, AVFrame 
*frame)
     unsigned int offset = 0;
     int ret = 0;
 
-    err = decoder->cudl->cuCtxPushCurrent(decoder->cuda_ctx);
-    if (err != CUDA_SUCCESS)
-        return AVERROR_UNKNOWN;
+    if (!decoder->is_ctx_externally_allocated) {
+        err = decoder->cudl->cuCtxPushCurrent(decoder->cuda_ctx);
+        if (err != CUDA_SUCCESS)
+            return AVERROR_UNKNOWN;
+    }
 
     err = decoder->cvdl->cuvidMapVideoFrame(decoder->decoder, cf->idx, &devptr,
                                             &pitch, &vpp);
@@ -411,7 +415,9 @@ copy_fail:
     decoder->cvdl->cuvidUnmapVideoFrame(decoder->decoder, devptr);
 
 finish:
-    decoder->cudl->cuCtxPopCurrent(&dummy);
+    if (!decoder->is_ctx_externally_allocated)
+        decoder->cudl->cuCtxPopCurrent(&dummy);
+
     return ret;
 }
 
diff --git a/libavutil/hwcontext.c b/libavutil/hwcontext.c
index 70c556e..51bc8c8 100644
--- a/libavutil/hwcontext.c
+++ b/libavutil/hwcontext.c
@@ -575,12 +575,17 @@ int av_hwdevice_ctx_create(AVBufferRef **pdevice_ref, 
enum AVHWDeviceType type,
     AVHWDeviceContext *device_ctx;
     int ret = 0;
 
-    device_ref = av_hwdevice_ctx_alloc(type);
-    if (!device_ref) {
-        ret = AVERROR(ENOMEM);
-        goto fail;
+    if (!*pdevice_ref) {
+        device_ref = av_hwdevice_ctx_alloc(type);
+        if (!device_ref) {
+            ret = AVERROR(ENOMEM);
+            goto fail;
+        }
+        device_ctx = (AVHWDeviceContext*)device_ref->data;
+    } else {
+        device_ref = *pdevice_ref;
+        device_ctx = (AVHWDeviceContext*)device_ref->data;
     }
-    device_ctx = (AVHWDeviceContext*)device_ref->data;
 
     if (!device_ctx->internal->hw_type->device_create) {
         ret = AVERROR(ENOSYS);
diff --git a/libavutil/hwcontext_cuda.c b/libavutil/hwcontext_cuda.c
index 37827a7..c7ad0ed 100644
--- a/libavutil/hwcontext_cuda.c
+++ b/libavutil/hwcontext_cuda.c
@@ -73,11 +73,13 @@ static void cuda_buffer_free(void *opaque, uint8_t *data)
 
     CUcontext dummy;
 
-    cu->cuCtxPushCurrent(hwctx->cuda_ctx);
+    if (!hwctx->is_ctx_externally_allocated)
+        cu->cuCtxPushCurrent(hwctx->cuda_ctx);
 
     cu->cuMemFree((CUdeviceptr)data);
 
-    cu->cuCtxPopCurrent(&dummy);
+    if (!hwctx->is_ctx_externally_allocated)
+        cu->cuCtxPopCurrent(&dummy);
 }
 
 static AVBufferRef *cuda_pool_alloc(void *opaque, int size)
@@ -91,10 +93,12 @@ static AVBufferRef *cuda_pool_alloc(void *opaque, int size)
     CUdeviceptr data;
     CUresult err;
 
-    err = cu->cuCtxPushCurrent(hwctx->cuda_ctx);
-    if (err != CUDA_SUCCESS) {
-        av_log(ctx, AV_LOG_ERROR, "Error setting current CUDA context\n");
-        return NULL;
+    if (!hwctx->is_ctx_externally_allocated) {
+        err = cu->cuCtxPushCurrent(hwctx->cuda_ctx);
+        if (err != CUDA_SUCCESS) {
+            av_log(ctx, AV_LOG_ERROR, "Error setting current CUDA context\n");
+            return NULL;
+        }
     }
 
     err = cu->cuMemAlloc(&data, size);
@@ -108,7 +112,9 @@ static AVBufferRef *cuda_pool_alloc(void *opaque, int size)
     }
 
 fail:
-    cu->cuCtxPopCurrent(&dummy);
+    if (!hwctx->is_ctx_externally_allocated)
+        cu->cuCtxPopCurrent(&dummy);
+
     return ret;
 }
 
@@ -242,9 +248,11 @@ static int cuda_transfer_data_from(AVHWFramesContext *ctx, 
AVFrame *dst,
     CUresult err;
     int i;
 
-    err = cu->cuCtxPushCurrent(device_hwctx->cuda_ctx);
-    if (err != CUDA_SUCCESS)
-        return AVERROR_UNKNOWN;
+    if (!device_hwctx->is_ctx_externally_allocated) {
+        err = cu->cuCtxPushCurrent(device_hwctx->cuda_ctx);
+        if (err != CUDA_SUCCESS)
+            return AVERROR_UNKNOWN;
+    }
 
     for (i = 0; i < FF_ARRAY_ELEMS(src->data) && src->data[i]; i++) {
         CUDA_MEMCPY2D cpy = {
@@ -265,7 +273,8 @@ static int cuda_transfer_data_from(AVHWFramesContext *ctx, 
AVFrame *dst,
         }
     }
 
-    cu->cuCtxPopCurrent(&dummy);
+    if (!device_hwctx->is_ctx_externally_allocated)
+        cu->cuCtxPopCurrent(&dummy);
 
     return 0;
 }
@@ -281,9 +290,11 @@ static int cuda_transfer_data_to(AVHWFramesContext *ctx, 
AVFrame *dst,
     CUresult err;
     int i;
 
-    err = cu->cuCtxPushCurrent(device_hwctx->cuda_ctx);
-    if (err != CUDA_SUCCESS)
-        return AVERROR_UNKNOWN;
+    if (!device_hwctx->is_ctx_externally_allocated) {
+        err = cu->cuCtxPushCurrent(device_hwctx->cuda_ctx);
+        if (err != CUDA_SUCCESS)
+            return AVERROR_UNKNOWN;
+    }
 
     for (i = 0; i < FF_ARRAY_ELEMS(src->data) && src->data[i]; i++) {
         CUDA_MEMCPY2D cpy = {
@@ -304,7 +315,8 @@ static int cuda_transfer_data_to(AVHWFramesContext *ctx, 
AVFrame *dst,
         }
     }
 
-    cu->cuCtxPopCurrent(&dummy);
+    if (!device_hwctx->is_ctx_externally_allocated)
+        cu->cuCtxPopCurrent(&dummy);
 
     return 0;
 }
@@ -314,7 +326,8 @@ static void cuda_device_uninit(AVHWDeviceContext *ctx)
     AVCUDADeviceContext *hwctx = ctx->hwctx;
 
     if (hwctx->internal) {
-        if (hwctx->internal->is_allocated && hwctx->cuda_ctx) {
+        if (hwctx->internal->is_allocated && hwctx->cuda_ctx
+            && !hwctx->is_ctx_externally_allocated) {
             hwctx->internal->cuda_dl->cuCtxDestroy(hwctx->cuda_ctx);
             hwctx->cuda_ctx = NULL;
         }
@@ -351,42 +364,50 @@ error:
 }
 
 static int cuda_device_create(AVHWDeviceContext *ctx, const char *device,
-                              AVDictionary *opts, int flags)
+    AVDictionary *opts, int flags)
 {
     AVCUDADeviceContext *hwctx = ctx->hwctx;
     CudaFunctions *cu;
     CUdevice cu_device;
     CUcontext dummy;
     CUresult err;
-    int device_idx = 0;
-
-    if (device)
-        device_idx = strtol(device, NULL, 0);
 
+    if (hwctx->cuda_ctx) {
+        hwctx->is_ctx_externally_allocated = 1;
+    } else {
+        hwctx->is_ctx_externally_allocated = 0;
+    }
     if (cuda_device_init(ctx) < 0)
         goto error;
 
-    cu = hwctx->internal->cuda_dl;
+    if (!hwctx->is_ctx_externally_allocated) {
+        int device_idx = 0;
 
-    err = cu->cuInit(0);
-    if (err != CUDA_SUCCESS) {
-        av_log(ctx, AV_LOG_ERROR, "Could not initialize the CUDA driver 
API\n");
-        goto error;
-    }
+        if (device)
+            device_idx = strtol(device, NULL, 0);
 
-    err = cu->cuDeviceGet(&cu_device, device_idx);
-    if (err != CUDA_SUCCESS) {
-        av_log(ctx, AV_LOG_ERROR, "Could not get the device number %d\n", 
device_idx);
-        goto error;
-    }
+        cu = hwctx->internal->cuda_dl;
 
-    err = cu->cuCtxCreate(&hwctx->cuda_ctx, CU_CTX_SCHED_BLOCKING_SYNC, 
cu_device);
-    if (err != CUDA_SUCCESS) {
-        av_log(ctx, AV_LOG_ERROR, "Error creating a CUDA context\n");
-        goto error;
-    }
+        err = cu->cuInit(0);
+        if (err != CUDA_SUCCESS) {
+            av_log(ctx, AV_LOG_ERROR, "Could not initialize the CUDA driver 
API\n");
+            goto error;
+        }
+
+        err = cu->cuDeviceGet(&cu_device, device_idx);
+        if (err != CUDA_SUCCESS) {
+            av_log(ctx, AV_LOG_ERROR, "Could not get the device number %d\n", 
device_idx);
+            goto error;
+        }
 
-    cu->cuCtxPopCurrent(&dummy);
+        err = cu->cuCtxCreate(&hwctx->cuda_ctx, CU_CTX_SCHED_BLOCKING_SYNC, 
cu_device);
+        if (err != CUDA_SUCCESS) {
+            av_log(ctx, AV_LOG_ERROR, "Error creating a CUDA context\n");
+            goto error;
+        }
+
+        cu->cuCtxPopCurrent(&dummy);
+    }
 
     hwctx->internal->is_allocated = 1;
 
diff --git a/libavutil/hwcontext_cuda.h b/libavutil/hwcontext_cuda.h
index 12dae84..e6435c1 100644
--- a/libavutil/hwcontext_cuda.h
+++ b/libavutil/hwcontext_cuda.h
@@ -42,6 +42,7 @@ typedef struct AVCUDADeviceContextInternal 
AVCUDADeviceContextInternal;
 typedef struct AVCUDADeviceContext {
     CUcontext cuda_ctx;
     AVCUDADeviceContextInternal *internal;
+    int is_ctx_externally_allocated;
 } AVCUDADeviceContext;
 
 /**
-- 
2.7.4
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

[FFmpeg-devel] [PATCH] Added the possibility to pass an externally created CUDA context to libavutil/hwcontext.c/av_hwdevice_ctx_create() for decoding with NVDEC

Reply via email to