[FFmpeg-devel] [PATCH] avfilter: add CUDA stack filters (hstack_cuda, vstack_cuda, xstack_cuda)

Faeez Kadiri Fri, 23 May 2025 14:58:46 -0700

Add hardware-accelerated stack filters for CUDA that provide equivalent
functionality to the software stack filters but with GPU acceleration.


Features:
- Support for hstack, vstack, and xstack operations
- Compatible pixel formats such as:
  yuv420p, nv12, yuv444p, p010le, p016le, yuv444p16le, rgb0, bgr0, rgba, bgra
- Fill color support with automatic RGB to YUV conversion for YUV formats
- Proper chroma subsampling handling for all supported formats
- Integration with existing stack filter infrastructure via stack_internal.h

The implementation follows the established CUDA filter pattern from
vf_scale_cuda.c, using PTX modules for kernel execution and proper
CUDA context management. Copy operations handle frame placement while
color operations fill background areas when using fill colors.

This enables efficient video composition workflows entirely on GPU
without CPU-GPU memory transfers, significantly improving performance
for multi-input video processing pipelines.

Examples:
$ ffmpeg -hwaccel cuda -i input.h265 -filter_complex "[0:v][0:v]hstack_cuda" 
-c:v hevc_nvenc out.h265

$ ffmpeg \
  -hwaccel cuda -i input1.mp4 \
  -hwaccel cuda -i input2.mp4 \
  -hwaccel cuda -i input3.mp4 \
  -hwaccel cuda -i input4.mp4 \
  -filter_complex 
"[0:v]hwupload_cuda[0v];[1:v]hwupload_cuda[1v];[2:v]hwupload_cuda[2v];[3:v]hwupload_cuda[3v];[0v][1v][2v][3v]xstack_cuda=inputs=4:fill=black:layout=0_0|w0_0|0_h0|w0_h0"
 \
  -c:v hevc_nvenc out.mp4

Signed-off-by: Faeez Kadiri <f1k2fa...@gmail.com>
---
 Changelog                    |   1 +
 configure                    |   6 +
 doc/filters.texi             |  78 +++++
 libavfilter/Makefile         |   3 +
 libavfilter/allfilters.c     |   3 +
 libavfilter/vf_stack_cuda.c  | 589 +++++++++++++++++++++++++++++++++++
 libavfilter/vf_stack_cuda.cu | 389 +++++++++++++++++++++++
 7 files changed, 1069 insertions(+)
 create mode 100644 libavfilter/vf_stack_cuda.c
 create mode 100644 libavfilter/vf_stack_cuda.cu

diff --git a/Changelog b/Changelog
index 4217449438..0dec3443d4 100644
--- a/Changelog
+++ b/Changelog
@@ -18,6 +18,7 @@ version <next>:
 - APV encoding support through a libopenapv wrapper
 - VVC decoder supports all content of SCC (Screen Content Coding):
   IBC (Inter Block Copy), Palette Mode and ACT (Adaptive Color Transform
+- hstack_cuda, vstack_cuda and xstack_cuda filters
 
 
 version 7.1:
diff --git a/configure b/configure
index 3730b0524c..5c2d6e132d 100755
--- a/configure
+++ b/configure
@@ -4033,6 +4033,12 @@ xfade_vulkan_filter_deps="vulkan spirv_compiler"
 yadif_cuda_filter_deps="ffnvcodec"
 yadif_cuda_filter_deps_any="cuda_nvcc cuda_llvm"
 yadif_videotoolbox_filter_deps="metal corevideo videotoolbox"
+hstack_cuda_filter_deps="ffnvcodec"
+hstack_cuda_filter_deps_any="cuda_nvcc cuda_llvm"
+vstack_cuda_filter_deps="ffnvcodec"
+vstack_cuda_filter_deps_any="cuda_nvcc cuda_llvm"
+xstack_cuda_filter_deps="ffnvcodec"
+xstack_cuda_filter_deps_any="cuda_nvcc cuda_llvm"
 hstack_vaapi_filter_deps="vaapi_1"
 vstack_vaapi_filter_deps="vaapi_1"
 xstack_vaapi_filter_deps="vaapi_1"
diff --git a/doc/filters.texi b/doc/filters.texi
index 6d2df07508..1c9afac9eb 100644
--- a/doc/filters.texi
+++ b/doc/filters.texi
@@ -26850,6 +26850,84 @@ Only deinterlace frames marked as interlaced.
 The default value is @code{all}.
 @end table
 
+@section hstack_cuda
+Stack input videos horizontally.
+
+This is the CUDA variant of the @ref{vstack} filter, each input stream may
+have different width, this filter will scale down/up each input stream while
+keeping the orignal aspect.
+
+It accepts the following options:
+
+@table @option
+@item inputs
+See @ref{hstack}.
+
+@item shortest
+See @ref{hstack}.
+
+@item height
+Set height of output. If set to 0, this filter will set height of output to
+height of the first input stream. Default value is 0.
+@end table
+
+@section vstack_cuda
+Stack input videos vertically.
+
+This is the CUDA variant of the @ref{vstack} filter, each input stream may
+have different width, this filter will scale down/up each input stream while
+keeping the orignal aspect.
+
+It accepts the following options:
+
+@table @option
+@item inputs
+See @ref{vstack}.
+
+@item shortest
+See @ref{vstack}.
+
+@item width
+Set width of output. If set to 0, this filter will set width of output to
+width of the first input stream. Default value is 0.
+@end table
+
+@section xstack_cuda
+Stack video inputs into custom layout.
+
+This is the CUDA variant of the @ref{xstack} filter,  each input stream may
+have different size, this filter will scale down/up each input stream to the
+given output size, or the size of the first input stream.
+
+It accepts the following options:
+
+@table @option
+@item inputs
+See @ref{xstack}.
+
+@item shortest
+See @ref{xstack}.
+
+@item layout
+See @ref{xstack}.
+Moreover, this permits the user to supply output size for each input stream.
+@example
+xstack_cuda=inputs=4:layout=0_0_1920x1080|0_h0_1920x1080|w0_0_1920x1080|w0_h0_1920x1080
+@end example
+
+@item grid
+See @ref{xstack}.
+
+@item grid_tile_size
+Set output size for each input stream when @option{grid} is set. If this option
+is not set, this filter will set output size by default to the size of the
+first input stream. For the syntax of this option, check the
+@ref{video size syntax,,"Video size" section in the ffmpeg-utils 
manual,ffmpeg-utils}.
+
+@item fill
+See @ref{xstack}.
+@end table
+
 @anchor{CUDA NPP}
 @section CUDA NPP
 Below is a description of the currently available NVIDIA Performance 
Primitives (libnpp) video filters.
diff --git a/libavfilter/Makefile b/libavfilter/Makefile
index 0effe4127f..ad876ccd53 100644
--- a/libavfilter/Makefile
+++ b/libavfilter/Makefile
@@ -583,6 +583,9 @@ OBJS-$(CONFIG_YAEPBLUR_FILTER)               += 
vf_yaepblur.o
 OBJS-$(CONFIG_ZMQ_FILTER)                    += f_zmq.o
 OBJS-$(CONFIG_ZOOMPAN_FILTER)                += vf_zoompan.o
 OBJS-$(CONFIG_ZSCALE_FILTER)                 += vf_zscale.o
+OBJS-$(CONFIG_HSTACK_CUDA_FILTER)            += vf_stack_cuda.o framesync.o 
vf_stack_cuda.ptx.o cuda/load_helper.o
+OBJS-$(CONFIG_VSTACK_CUDA_FILTER)            += vf_stack_cuda.o framesync.o 
vf_stack_cuda.ptx.o cuda/load_helper.o
+OBJS-$(CONFIG_XSTACK_CUDA_FILTER)            += vf_stack_cuda.o framesync.o 
vf_stack_cuda.ptx.o cuda/load_helper.o
 OBJS-$(CONFIG_HSTACK_VAAPI_FILTER)           += vf_stack_vaapi.o framesync.o 
vaapi_vpp.o
 OBJS-$(CONFIG_VSTACK_VAAPI_FILTER)           += vf_stack_vaapi.o framesync.o 
vaapi_vpp.o
 OBJS-$(CONFIG_XSTACK_VAAPI_FILTER)           += vf_stack_vaapi.o framesync.o 
vaapi_vpp.o
diff --git a/libavfilter/allfilters.c b/libavfilter/allfilters.c
index 5ea33cdf01..89a7fb9277 100644
--- a/libavfilter/allfilters.c
+++ b/libavfilter/allfilters.c
@@ -545,6 +545,9 @@ extern const FFFilter ff_vf_yaepblur;
 extern const FFFilter ff_vf_zmq;
 extern const FFFilter ff_vf_zoompan;
 extern const FFFilter ff_vf_zscale;
+extern const FFFilter ff_vf_hstack_cuda;
+extern const FFFilter ff_vf_vstack_cuda;
+extern const FFFilter ff_vf_xstack_cuda;
 extern const FFFilter ff_vf_hstack_vaapi;
 extern const FFFilter ff_vf_vstack_vaapi;
 extern const FFFilter ff_vf_xstack_vaapi;
diff --git a/libavfilter/vf_stack_cuda.c b/libavfilter/vf_stack_cuda.c
new file mode 100644
index 0000000000..002602b2bf
--- /dev/null
+++ b/libavfilter/vf_stack_cuda.c
@@ -0,0 +1,589 @@
+/*
+ * Copyright (c) 2025, Faeez Kadiri < f1k2faeez at gmail dot com>
+ *
+ * This file is part of FFmpeg.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ */
+
+/**
+ * @file
+ * Hardware accelerated hstack, vstack and xstack filters based on CUDA
+ */
+
+#include "config_components.h"
+
+#include "libavutil/opt.h"
+#include "libavutil/common.h"
+#include "libavutil/pixdesc.h"
+#include "libavutil/eval.h"
+#include "libavutil/hwcontext.h"
+#include "libavutil/hwcontext_cuda_internal.h"
+#include "libavutil/cuda_check.h"
+#include "libavutil/avstring.h"
+#include "libavutil/avassert.h"
+#include "libavutil/imgutils.h"
+#include "libavutil/mathematics.h"
+#include "libavutil/parseutils.h"
+#include "libavutil/colorspace.h"
+#include "libavutil/mem.h"
+
+#include "filters.h"
+#include "formats.h"
+#include "video.h"
+
+#include "framesync.h"
+#include "cuda/load_helper.h"
+
+static const enum AVPixelFormat supported_formats[] = {
+    AV_PIX_FMT_YUV420P,
+    AV_PIX_FMT_NV12,
+    AV_PIX_FMT_YUV444P,
+    AV_PIX_FMT_P010,
+    AV_PIX_FMT_P016,
+    AV_PIX_FMT_YUV444P16,
+    AV_PIX_FMT_0RGB32,
+    AV_PIX_FMT_0BGR32,
+    AV_PIX_FMT_RGB32,
+    AV_PIX_FMT_BGR32,
+};
+
+#define DIV_UP(a, b) ( ((a) + (b) - 1) / (b) )
+#define BLOCKX 32
+#define BLOCKY 16
+
+#define CHECK_CU(x) FF_CUDA_CHECK_DL(ctx, s->hwctx->internal->cuda_dl, x)
+
+typedef struct CUDAStackContext {
+    AVCUDADeviceContext *hwctx;
+    CudaFunctions *cuda_dl;
+
+    CUcontext   cu_ctx;
+    CUmodule    cu_module;
+    CUstream    cu_stream;
+
+    // For copy operations
+    CUfunction  cu_func_copy;
+    CUfunction  cu_func_copy_uv;
+
+    // For color operations
+    CUfunction  cu_func_color;
+    CUfunction  cu_func_color_uv;
+
+    enum AVPixelFormat in_fmt;
+    const AVPixFmtDescriptor *in_desc;
+    int in_planes;
+    int in_plane_depths[4];
+    int in_plane_channels[4];
+
+    uint8_t fillcolor_rgba[4];
+    uint8_t fillcolor_yuv[4];
+} CUDAStackContext;
+
+#define HSTACK_NAME             "hstack_cuda"
+#define VSTACK_NAME             "vstack_cuda"
+#define XSTACK_NAME             "xstack_cuda"
+#define HWContext               CUDAStackContext
+#define StackHWContext          StackCudaContext
+#include "stack_internal.h"
+
+typedef struct StackCudaContext {
+    StackBaseContext base;
+    CUDAStackContext cuda;
+} StackCudaContext;
+
+static int format_is_supported(enum AVPixelFormat fmt)
+{
+    int i;
+
+    for (i = 0; i < FF_ARRAY_ELEMS(supported_formats); i++)
+        if (supported_formats[i] == fmt)
+            return 1;
+    return 0;
+}
+
+static void rgb2yuv(float r, float g, float b, int *y, int *u, int *v, int 
depth)
+{
+    *y = ((0.21260*219.0/255.0) * r + (0.71520*219.0/255.0) * g +
+         (0.07220*219.0/255.0) * b) * ((1 << depth) - 1);
+    *u = (-(0.11457*224.0/255.0) * r - (0.38543*224.0/255.0) * g +
+         (0.50000*224.0/255.0) * b + 0.5) * ((1 << depth) - 1);
+    *v = ((0.50000*224.0/255.0) * r - (0.45415*224.0/255.0) * g -
+         (0.04585*224.0/255.0) * b + 0.5) * ((1 << depth) - 1);
+}
+
+static av_cold int cuda_stack_load_functions(AVFilterContext *ctx, enum 
AVPixelFormat format)
+{
+    StackCudaContext *sctx = ctx->priv;
+    CUDAStackContext *s = &sctx->cuda;
+    CUcontext dummy, cuda_ctx = s->hwctx->cuda_ctx;
+    CudaFunctions *cu = s->cuda_dl;
+    int ret;
+    char buf[128];
+
+    const char *fmt_name = av_get_pix_fmt_name(format);
+
+    extern const unsigned char ff_vf_stack_cuda_ptx_data[];
+    extern const unsigned int ff_vf_stack_cuda_ptx_len;
+
+    ret = CHECK_CU(cu->cuCtxPushCurrent(cuda_ctx));
+    if (ret < 0)
+        return ret;
+
+    ret = ff_cuda_load_module(ctx, s->hwctx, &s->cu_module,
+                              ff_vf_stack_cuda_ptx_data, 
ff_vf_stack_cuda_ptx_len);
+    if (ret < 0)
+        goto fail;
+
+    // Load copy functions
+    snprintf(buf, sizeof(buf), "StackCopy_%s_%s", fmt_name, fmt_name);
+    ret = CHECK_CU(cu->cuModuleGetFunction(&s->cu_func_copy, s->cu_module, 
buf));
+    if (ret < 0) {
+        av_log(ctx, AV_LOG_FATAL, "Unsupported format for copy: %s\n", 
fmt_name);
+        ret = AVERROR(ENOSYS);
+        goto fail;
+    }
+
+    snprintf(buf, sizeof(buf), "StackCopy_%s_%s_uv", fmt_name, fmt_name);
+    ret = CHECK_CU(cu->cuModuleGetFunction(&s->cu_func_copy_uv, s->cu_module, 
buf));
+    if (ret < 0)
+        goto fail;
+
+    // Load color functions
+    snprintf(buf, sizeof(buf), "SetColor_%s", fmt_name);
+    ret = CHECK_CU(cu->cuModuleGetFunction(&s->cu_func_color, s->cu_module, 
buf));
+    if (ret < 0) {
+        av_log(ctx, AV_LOG_FATAL, "Unsupported format for color: %s\n", 
fmt_name);
+        ret = AVERROR(ENOSYS);
+        goto fail;
+    }
+
+    snprintf(buf, sizeof(buf), "SetColor_%s_uv", fmt_name);
+    ret = CHECK_CU(cu->cuModuleGetFunction(&s->cu_func_color_uv, s->cu_module, 
buf));
+    if (ret < 0)
+        goto fail;
+
+fail:
+    CHECK_CU(cu->cuCtxPopCurrent(&dummy));
+    return ret;
+}
+
+static av_cold int cuda_stack_color_kernel(AVFilterContext *ctx, CUfunction 
func,
+                            AVFrame *out_frame, const uint8_t *color,
+                            int width, int height,
+                            int dst_x, int dst_y,
+                            int dst_width, int dst_height, int dst_pitch)
+{
+    StackCudaContext *sctx = ctx->priv;
+    CUDAStackContext *s = &sctx->cuda;
+    CudaFunctions *cu = s->cuda_dl;
+
+    CUdeviceptr dst_devptr[4] = {
+        (CUdeviceptr)out_frame->data[0], (CUdeviceptr)out_frame->data[1],
+        (CUdeviceptr)out_frame->data[2], (CUdeviceptr)out_frame->data[3]
+    };
+
+    void *args[] = {
+        &dst_devptr[0], &dst_devptr[1], &dst_devptr[2], &dst_devptr[3],
+        &width, &height, &dst_pitch,
+        &dst_x, &dst_y,
+        (void *)&color[0], (void *)&color[1], (void *)&color[2], (void 
*)&color[3],
+        &dst_width, &dst_height,
+    };
+
+    return CHECK_CU(cu->cuLaunchKernel(func,
+                                     DIV_UP(width, BLOCKX), DIV_UP(height, 
BLOCKY), 1,
+                                     BLOCKX, BLOCKY, 1,
+                                     0, s->cu_stream, args, NULL));
+}
+
+static av_cold int cuda_stack_copy_kernel(AVFilterContext *ctx, CUfunction 
func,
+                            CUtexObject src_tex[4],
+                            AVFrame *out_frame,
+                            int width, int height,
+                            int dst_x, int dst_y, int dst_pitch,
+                            int src_width, int src_height)
+{
+    StackCudaContext *sctx = ctx->priv;
+    CUDAStackContext *s = &sctx->cuda;
+    CudaFunctions *cu = s->cuda_dl;
+
+    CUdeviceptr dst_devptr[4] = {
+        (CUdeviceptr)out_frame->data[0], (CUdeviceptr)out_frame->data[1],
+        (CUdeviceptr)out_frame->data[2], (CUdeviceptr)out_frame->data[3]
+    };
+
+    void *args[] = {
+        &src_tex[0], &src_tex[1], &src_tex[2], &src_tex[3],
+        &dst_devptr[0], &dst_devptr[1], &dst_devptr[2], &dst_devptr[3],
+        &width, &height, &dst_pitch,
+        &dst_x, &dst_y,
+        &src_width, &src_height,
+        &out_frame->width, &out_frame->height
+    };
+
+    return CHECK_CU(cu->cuLaunchKernel(func,
+                                     DIV_UP(width, BLOCKX), DIV_UP(height, 
BLOCKY), 1,
+                                     BLOCKX, BLOCKY, 1,
+                                     0, s->cu_stream, args, NULL));
+}
+
+static int cuda_stack_color_op(AVFilterContext *ctx, StackItemRegion *region, 
AVFrame *out, const uint8_t *color) {
+    StackCudaContext *sctx = ctx->priv;
+    CUDAStackContext *s = &sctx->cuda;
+    CudaFunctions *cu = s->cuda_dl;
+    int ret = 0;
+    CUcontext dummy;
+
+    // Push CUDA context
+    ret = CHECK_CU(cu->cuCtxPushCurrent(s->hwctx->cuda_ctx));
+    if (ret < 0)
+        return ret;
+
+    ret = cuda_stack_color_kernel(ctx, s->cu_func_color,
+                                out, color, region->width, region->height,
+                                region->x, region->y,
+                                out->width, out->height,
+                                out->linesize[0]);
+    if (ret < 0) {
+        av_log(ctx, AV_LOG_ERROR, "Error during color operation: %d\n", ret);
+        goto fail;
+    }
+
+    if (s->in_planes > 1) {
+        ret = cuda_stack_color_kernel(ctx, s->cu_func_color_uv,
+                                    out, color,
+                                    AV_CEIL_RSHIFT(region->width, 
s->in_desc->log2_chroma_w),
+                                    AV_CEIL_RSHIFT(region->height, 
s->in_desc->log2_chroma_h),
+                                    AV_CEIL_RSHIFT(region->x, 
s->in_desc->log2_chroma_w),
+                                    AV_CEIL_RSHIFT(region->y, 
s->in_desc->log2_chroma_h),
+                                    out->width, out->height,
+                                    out->linesize[1]);
+        if (ret < 0)
+            av_log(ctx, AV_LOG_ERROR, "Error during color UV operation: %d\n", 
ret);
+    }
+
+fail:
+    CHECK_CU(cu->cuCtxPopCurrent(&dummy));
+    return ret;
+}
+
+static int cuda_stack_copy_op(AVFilterContext *ctx, StackItemRegion *region, 
AVFrame *in, AVFrame *out) {
+    StackCudaContext *sctx = ctx->priv;
+    CUDAStackContext *s = &sctx->cuda;
+    CudaFunctions *cu = s->cuda_dl;
+    CUtexObject tex[4] = { 0, 0, 0, 0 };
+    int ret = 0;
+    int i;
+    CUcontext dummy;
+
+    // Push CUDA context
+    ret = CHECK_CU(cu->cuCtxPushCurrent(s->hwctx->cuda_ctx));
+    if (ret < 0)
+        return ret;
+
+    for (i = 0; i < s->in_planes; i++) {
+        CUDA_TEXTURE_DESC tex_desc = {
+            .filterMode = CU_TR_FILTER_MODE_POINT,
+            .flags = CU_TRSF_READ_AS_INTEGER,
+        };
+
+        CUDA_RESOURCE_DESC res_desc = {
+            .resType = CU_RESOURCE_TYPE_PITCH2D,
+            .res.pitch2D.format = s->in_plane_depths[i] <= 8 ?
+                                  CU_AD_FORMAT_UNSIGNED_INT8 :
+                                  CU_AD_FORMAT_UNSIGNED_INT16,
+            .res.pitch2D.numChannels = s->in_plane_channels[i],
+            .res.pitch2D.pitchInBytes = in->linesize[i],
+            .res.pitch2D.devPtr = (CUdeviceptr)in->data[i],
+        };
+
+        if (i == 1 || i == 2) {
+            res_desc.res.pitch2D.width = AV_CEIL_RSHIFT(in->width, 
s->in_desc->log2_chroma_w);
+            res_desc.res.pitch2D.height = AV_CEIL_RSHIFT(in->height, 
s->in_desc->log2_chroma_h);
+        } else {
+            res_desc.res.pitch2D.width = in->width;
+            res_desc.res.pitch2D.height = in->height;
+        }
+
+        ret = CHECK_CU(cu->cuTexObjectCreate(&tex[i], &res_desc, &tex_desc, 
NULL));
+        if (ret < 0)
+            goto fail;
+    }
+
+    ret = cuda_stack_copy_kernel(ctx, s->cu_func_copy,
+                             tex, out, region->width, region->height,
+                                region->x, region->y, out->linesize[0],
+                                in->width, in->height);
+    if (ret < 0) {
+        av_log(ctx, AV_LOG_ERROR, "Error during copy operation: %d\n", ret);
+        goto fail;
+    }
+
+    if (s->in_planes > 1) {
+        ret = cuda_stack_copy_kernel(ctx, s->cu_func_copy_uv, tex, out,
+                                    AV_CEIL_RSHIFT(region->width, 
s->in_desc->log2_chroma_w),
+                                    AV_CEIL_RSHIFT(region->height, 
s->in_desc->log2_chroma_h),
+                                    AV_CEIL_RSHIFT(region->x, 
s->in_desc->log2_chroma_w),
+                                    AV_CEIL_RSHIFT(region->y, 
s->in_desc->log2_chroma_h),
+                                    out->linesize[1],
+                                    AV_CEIL_RSHIFT(in->width, 
s->in_desc->log2_chroma_w),
+                                    AV_CEIL_RSHIFT(in->height, 
s->in_desc->log2_chroma_h));
+        if (ret < 0)
+            av_log(ctx, AV_LOG_ERROR, "Error during copy UV operation: %d\n", 
ret);
+    }
+
+fail:
+    for (i = 0; i < FF_ARRAY_ELEMS(tex); i++)
+        if (tex[i])
+            CHECK_CU(cu->cuTexObjectDestroy(tex[i]));
+
+    CHECK_CU(cu->cuCtxPopCurrent(&dummy));
+    return ret;
+}
+
+static int process_frame(FFFrameSync *fs)
+{
+    AVFilterContext *ctx = fs->parent;
+    StackCudaContext *sctx = fs->opaque;
+    CUDAStackContext *s = &sctx->cuda;
+    AVFilterLink *outlink = ctx->outputs[0];
+    AVFrame *out_frame = NULL;
+    AVFrame *in_frame = NULL;
+    int ret = 0;
+
+    out_frame = ff_get_video_buffer(outlink, outlink->w, outlink->h);
+    if (!out_frame)
+        return AVERROR(ENOMEM);
+
+    // Fill the entire output frame with fill color if enabled
+    if (sctx->base.fillcolor_enable) {
+        StackItemRegion full_region = {
+            .x = 0,
+            .y = 0,
+            .width = outlink->w,
+            .height = outlink->h
+        };
+
+        ret = cuda_stack_color_op(ctx, &full_region, out_frame, 
s->fillcolor_yuv);
+        if (ret < 0) {
+            av_log(ctx, AV_LOG_ERROR, "Failed to fill background color\n");
+            goto fail;
+        }
+    }
+
+    for (int i = 0; i < ctx->nb_inputs; i++) {
+        ret = ff_framesync_get_frame(fs, i, &in_frame, 0);
+        if (ret)
+            goto fail;
+
+        if (i == 0) {
+            ret = av_frame_copy_props(out_frame, in_frame);
+            if (ret < 0)
+                goto fail;
+        }
+
+        ret = cuda_stack_copy_op(ctx, &sctx->base.regions[i], in_frame, 
out_frame);
+        if (ret < 0)
+            goto fail;
+    }
+
+    out_frame->pts = av_rescale_q(sctx->base.fs.pts, sctx->base.fs.time_base, 
outlink->time_base);
+    out_frame->sample_aspect_ratio = outlink->sample_aspect_ratio;
+
+    return ff_filter_frame(outlink, out_frame);
+
+fail:
+    av_frame_free(&out_frame);
+    return ret;
+}
+
+static int config_output(AVFilterLink *outlink)
+{
+    AVFilterContext *ctx = outlink->src;
+    StackCudaContext *sctx = ctx->priv;
+    CUDAStackContext *s = &sctx->cuda;
+    AVFilterLink *inlink0 = ctx->inputs[0];
+    FilterLink      *inl0 = ff_filter_link(inlink0);
+    FilterLink      *outl = ff_filter_link(outlink);
+    enum AVPixelFormat in_format;
+    int depth = 8, ret;
+    AVHWFramesContext *in_frames_ctx;
+    AVBufferRef *hw_frames_ctx;
+    AVHWFramesContext *out_frames_ctx;
+
+    if (inlink0->format != AV_PIX_FMT_CUDA || !inl0->hw_frames_ctx || 
!inl0->hw_frames_ctx->data) {
+        av_log(ctx, AV_LOG_ERROR, "Software pixel format is not supported.\n");
+        return AVERROR(EINVAL);
+    }
+
+    in_frames_ctx = (AVHWFramesContext*)inl0->hw_frames_ctx->data;
+    in_format = in_frames_ctx->sw_format;
+
+    if (!format_is_supported(in_format)) {
+        av_log(ctx, AV_LOG_ERROR, "Unsupported input format: %s\n",
+               av_get_pix_fmt_name(in_format));
+        return AVERROR(ENOSYS);
+    }
+
+    s->in_fmt = in_format;
+    s->in_desc = av_pix_fmt_desc_get(s->in_fmt);
+    s->in_planes = av_pix_fmt_count_planes(s->in_fmt);
+
+    // Set up plane information
+    for (int i = 0; i < s->in_desc->nb_components; i++) {
+        int d = (s->in_desc->comp[i].depth + 7) / 8;
+        int p = s->in_desc->comp[i].plane;
+        s->in_plane_channels[p] = FFMAX(s->in_plane_channels[p], 
s->in_desc->comp[i].step / d);
+        s->in_plane_depths[p] = s->in_desc->comp[i].depth;
+    }
+
+    s->hwctx = in_frames_ctx->device_ctx->hwctx;
+    s->cuda_dl = s->hwctx->internal->cuda_dl;
+    s->cu_stream = s->hwctx->stream;
+
+    for (int i = 1; i < sctx->base.nb_inputs; i++) {
+        AVFilterLink *inlink = ctx->inputs[i];
+        FilterLink      *inl = ff_filter_link(inlink);
+        AVHWFramesContext *hwfc = NULL;
+
+        if (inlink->format != AV_PIX_FMT_CUDA || !inl->hw_frames_ctx || 
!inl->hw_frames_ctx->data) {
+            av_log(ctx, AV_LOG_ERROR, "Software pixel format is not 
supported.\n");
+            return AVERROR(EINVAL);
+        }
+
+        hwfc = (AVHWFramesContext *)inl->hw_frames_ctx->data;
+
+        if (in_frames_ctx->sw_format != hwfc->sw_format) {
+            av_log(ctx, AV_LOG_ERROR, "All inputs should have the same 
underlying software pixel format.\n");
+            return AVERROR(EINVAL);
+        }
+    }
+
+    if (in_format == AV_PIX_FMT_P010)
+        depth = 10;
+
+    if (sctx->base.fillcolor_enable) {
+        // Check if this is an RGB format
+        if (s->in_desc->flags & AV_PIX_FMT_FLAG_RGB) {
+            // For RGB formats, use RGB values directly
+            s->fillcolor_yuv[0] = sctx->base.fillcolor[0]; // R
+            s->fillcolor_yuv[1] = sctx->base.fillcolor[1]; // G
+            s->fillcolor_yuv[2] = sctx->base.fillcolor[2]; // B
+            s->fillcolor_yuv[3] = sctx->base.fillcolor[3]; // A
+        } else {
+            // For YUV formats, convert RGB to YUV
+            int Y, U, V;
+
+            rgb2yuv(sctx->base.fillcolor[0] / 255.0, sctx->base.fillcolor[1] / 
255.0,
+                    sctx->base.fillcolor[2] / 255.0, &Y, &U, &V, depth);
+            s->fillcolor_yuv[0] = Y;
+            s->fillcolor_yuv[1] = U;
+            s->fillcolor_yuv[2] = V;
+            s->fillcolor_yuv[3] = sctx->base.fillcolor[3];
+        }
+    }
+
+    ret = config_comm_output(outlink);
+    if (ret < 0)
+        return ret;
+
+    ret = cuda_stack_load_functions(ctx, in_format);
+    if (ret < 0)
+        return ret;
+
+    // Initialize hardware frames context for output
+    hw_frames_ctx = av_hwframe_ctx_alloc(in_frames_ctx->device_ref);
+    if (!hw_frames_ctx)
+        return AVERROR(ENOMEM);
+
+    out_frames_ctx = (AVHWFramesContext*)hw_frames_ctx->data;
+    out_frames_ctx->format = AV_PIX_FMT_CUDA;
+    out_frames_ctx->sw_format = in_format;
+    out_frames_ctx->width = outlink->w;
+    out_frames_ctx->height = outlink->h;
+
+    ret = av_hwframe_ctx_init(hw_frames_ctx);
+    if (ret < 0) {
+        av_buffer_unref(&hw_frames_ctx);
+        return ret;
+    }
+
+    av_buffer_unref(&outl->hw_frames_ctx);
+    outl->hw_frames_ctx = hw_frames_ctx;
+
+    return 0;
+}
+
+static int cuda_stack_init(AVFilterContext *ctx)
+{
+    int ret;
+
+    ret = stack_init(ctx);
+    if (ret)
+        return ret;
+
+    return 0;
+}
+
+static av_cold void cuda_stack_uninit(AVFilterContext *ctx)
+{
+    StackCudaContext *sctx = ctx->priv;
+    CUDAStackContext *s = &sctx->cuda;
+
+    if (s->hwctx && s->cu_module) {
+        CudaFunctions *cu = s->cuda_dl;
+        CUcontext dummy;
+
+        CHECK_CU(cu->cuCtxPushCurrent(s->hwctx->cuda_ctx));
+        CHECK_CU(cu->cuModuleUnload(s->cu_module));
+        s->cu_module = NULL;
+        CHECK_CU(cu->cuCtxPopCurrent(&dummy));
+    }
+
+    stack_uninit(ctx);
+}
+
+static const enum AVPixelFormat cuda_stack_pix_fmts[] = {
+    AV_PIX_FMT_CUDA,
+    AV_PIX_FMT_NONE,
+};
+
+#include "stack_internal.c"
+
+#if CONFIG_HSTACK_CUDA_FILTER
+
+DEFINE_HSTACK_OPTIONS(cuda);
+DEFINE_STACK_FILTER(hstack, cuda, "CUDA", 0);
+
+#endif
+
+#if CONFIG_VSTACK_CUDA_FILTER
+
+DEFINE_VSTACK_OPTIONS(cuda);
+DEFINE_STACK_FILTER(vstack, cuda, "CUDA", 0);
+
+#endif
+
+#if CONFIG_XSTACK_CUDA_FILTER
+
+DEFINE_XSTACK_OPTIONS(cuda);
+DEFINE_STACK_FILTER(xstack, cuda, "CUDA", 0);
+
+#endif
\ No newline at end of file
diff --git a/libavfilter/vf_stack_cuda.cu b/libavfilter/vf_stack_cuda.cu
new file mode 100644
index 0000000000..c19595e0a6
--- /dev/null
+++ b/libavfilter/vf_stack_cuda.cu
@@ -0,0 +1,389 @@
+/*
+ * Copyright (c) 2025, Faeez Kadiri < f1k2faeez at gmail dot com>
+ *
+ * This file is part of FFmpeg.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ */
+
+#include "cuda/vector_helpers.cuh"
+
+// --- CONVERSION LOGIC ---
+
+static const ushort mask_10bit = 0xFFC0;
+static const ushort mask_16bit = 0xFFFF;
+
+static inline __device__ ushort conv_8to16(uchar in, ushort mask)
+{
+    return ((ushort)in | ((ushort)in << 8)) & mask;
+}
+
+// FFmpeg passes pitch in bytes, CUDA uses potentially larger types
+#define FIXED_PITCH \
+    (dst_pitch/sizeof(*dst[0]))
+
+#define DEFAULT_DST(n) \
+    dst[n][yo*FIXED_PITCH+xo]
+
+#define OFFSET_DST(n) \
+    dst[n][(yo+dst_y)*FIXED_PITCH+(xo+dst_x)]
+
+// --- COMMON BOUNDS CHECKING ---
+
+#define BOUNDS_CHECK() \
+    if (xo >= width || yo >= height) \
+        return; \
+    int target_x = xo + dst_x; \
+    int target_y = yo + dst_y; \
+    if (target_x < 0 || target_y < 0 || target_x >= frame_width || target_y >= 
frame_height) \
+        return;
+
+#define BOUNDS_CHECK_UV(chroma_shift) \
+    if (xo >= width || yo >= height) \
+        return; \
+    int target_x = xo + dst_x; \
+    int target_y = yo + dst_y; \
+    int frame_uv_height = frame_height >> chroma_shift; \
+    if (target_x < 0 || target_y < 0 || target_x >= frame_width || target_y >= 
frame_uv_height) \
+        return;
+
+#define COPY_BOUNDS_CHECK() \
+    int target_x = xo + dst_x; \
+    int target_y = yo + dst_y; \
+    if (target_x < 0 || target_y < 0 || target_x >= frame_width || target_y >= 
frame_height) \
+        return;
+
+#define COPY_BOUNDS_CHECK_UV(chroma_shift) \
+    int target_x = xo + dst_x; \
+    int target_y = yo + dst_y; \
+    int frame_uv_width = frame_width >> chroma_shift; \
+    int frame_uv_height = frame_height >> chroma_shift; \
+    if (target_x < 0 || target_y < 0 || target_x >= frame_uv_width || target_y 
>= frame_uv_height) \
+        return;
+
+// --- COLOR OPERATIONS ---
+
+#define COLOR_DEF_F(N, T) \
+    __device__ static inline void N(T *dst[4], int xo, int yo, \
+                                     int width, int height, int dst_pitch, \
+                                     int dst_x, int dst_y, \
+                                     unsigned char y_color, unsigned char 
u_color, \
+                                     unsigned char v_color, unsigned char 
a_color, \
+                                     int frame_width, int frame_height)
+
+// Macro for YUV planar formats (420p, 444p, etc.)
+#define DEFINE_SETCOLOR_YUV_PLANAR(name, out_type, out_type_uv, y_assign, 
uv_assign) \
+struct SetColor_##name \
+{ \
+    typedef out_type out_T; \
+    typedef out_type_uv out_T_uv; \
+    \
+    COLOR_DEF_F(SetColor, out_T) \
+    { \
+        BOUNDS_CHECK(); \
+        OFFSET_DST(0) = y_assign; \
+    } \
+    \
+    COLOR_DEF_F(SetColor_uv, out_T_uv) \
+    { \
+        BOUNDS_CHECK(); \
+        uv_assign; \
+    } \
+};
+
+// Macro for NV12-style formats (interleaved UV)
+#define DEFINE_SETCOLOR_NV(name, out_type, out_type_uv, y_assign, uv_assign) \
+struct SetColor_##name \
+{ \
+    typedef out_type out_T; \
+    typedef out_type_uv out_T_uv; \
+    \
+    COLOR_DEF_F(SetColor, out_T) \
+    { \
+        BOUNDS_CHECK(); \
+        OFFSET_DST(0) = y_assign; \
+    } \
+    \
+    COLOR_DEF_F(SetColor_uv, out_T_uv) \
+    { \
+        BOUNDS_CHECK_UV(1); \
+        OFFSET_DST(1) = uv_assign; \
+    } \
+};
+
+// Macro for RGB formats
+#define DEFINE_SETCOLOR_RGB(name, out_type, color_assign) \
+struct SetColor_##name \
+{ \
+    typedef out_type out_T; \
+    typedef uchar out_T_uv; \
+    \
+    COLOR_DEF_F(SetColor, out_T) \
+    { \
+        BOUNDS_CHECK(); \
+        OFFSET_DST(0) = color_assign; \
+    } \
+    \
+    COLOR_DEF_F(SetColor_uv, out_T_uv) \
+    { \
+        /* No UV plane for RGB formats */ \
+    } \
+};
+
+// Define all SetColor structs using macros
+DEFINE_SETCOLOR_YUV_PLANAR(yuv420p, uchar, uchar, y_color,
+    OFFSET_DST(1) = u_color; OFFSET_DST(2) = v_color)
+
+DEFINE_SETCOLOR_NV(nv12, uchar, uchar2, y_color,
+    make_uchar2(u_color, v_color))
+
+DEFINE_SETCOLOR_YUV_PLANAR(yuv444p, uchar, uchar, y_color,
+    OFFSET_DST(1) = u_color; OFFSET_DST(2) = v_color)
+
+DEFINE_SETCOLOR_NV(p010le, ushort, ushort2, conv_8to16(y_color, mask_10bit),
+    make_ushort2(conv_8to16(u_color, mask_10bit), conv_8to16(v_color, 
mask_10bit)))
+
+DEFINE_SETCOLOR_NV(p016le, ushort, ushort2, conv_8to16(y_color, mask_16bit),
+    make_ushort2(conv_8to16(u_color, mask_16bit), conv_8to16(v_color, 
mask_16bit)))
+
+DEFINE_SETCOLOR_YUV_PLANAR(yuv444p16le, ushort, ushort, conv_8to16(y_color, 
mask_16bit),
+    OFFSET_DST(1) = conv_8to16(u_color, mask_16bit); OFFSET_DST(2) = 
conv_8to16(v_color, mask_16bit))
+
+DEFINE_SETCOLOR_RGB(rgb0, uchar4, make_uchar4(y_color, u_color, v_color, 0))
+DEFINE_SETCOLOR_RGB(bgr0, uchar4, make_uchar4(v_color, u_color, y_color, 0))
+DEFINE_SETCOLOR_RGB(rgba, uchar4, make_uchar4(y_color, u_color, v_color, 
a_color))
+DEFINE_SETCOLOR_RGB(bgra, uchar4, make_uchar4(v_color, u_color, y_color, 
a_color))
+
+// --- COPY OPERATIONS ---
+
+template<typename T>
+using copy_function_t = T (*)(cudaTextureObject_t tex, int xo, int yo,
+                               int dst_width, int dst_height,
+                               int src_width, int src_height,
+                               int bit_depth);
+
+#define COPY_DEF_F(N, T) \
+    template<copy_function_t<in_T> copy_func_y,                                
           \
+             copy_function_t<in_T_uv> copy_func_uv>                            
           \
+    __device__ static inline void N(cudaTextureObject_t src_tex[4], T *dst[4], 
int xo, int yo, \
+                                        int dst_width, int dst_height, int 
dst_pitch,          \
+                                        int dst_x, int dst_y,                  
                \
+                                        int src_width, int src_height,         
               \
+                                        int frame_width, int frame_height)
+
+#define COPY_SUB_F(m, plane) \
+    copy_func_##m(src_tex[plane], xo, yo, \
+                  dst_width, dst_height,  \
+                  src_width, src_height,  \
+                  in_bit_depth)
+
+// Macro for YUV planar copy operations
+#define DEFINE_STACKCOPY_YUV_PLANAR(name, bit_depth, in_type, in_type_uv, 
out_type, out_type_uv, chroma_shift) \
+struct StackCopy_##name \
+{ \
+    static const int in_bit_depth = bit_depth; \
+    typedef in_type in_T; \
+    typedef in_type_uv in_T_uv; \
+    typedef out_type out_T; \
+    typedef out_type_uv out_T_uv; \
+    \
+    COPY_DEF_F(StackCopy, out_T) \
+    { \
+        COPY_BOUNDS_CHECK(); \
+        OFFSET_DST(0) = COPY_SUB_F(y, 0); \
+    } \
+    \
+    COPY_DEF_F(StackCopy_uv, out_T_uv) \
+    { \
+        COPY_BOUNDS_CHECK_UV(chroma_shift); \
+        OFFSET_DST(1) = COPY_SUB_F(uv, 1); \
+        OFFSET_DST(2) = COPY_SUB_F(uv, 2); \
+    } \
+};
+
+// Macro for NV12-style copy operations
+#define DEFINE_STACKCOPY_NV(name, bit_depth, in_type, in_type_uv, out_type, 
out_type_uv) \
+struct StackCopy_##name \
+{ \
+    static const int in_bit_depth = bit_depth; \
+    typedef in_type in_T; \
+    typedef in_type_uv in_T_uv; \
+    typedef out_type out_T; \
+    typedef out_type_uv out_T_uv; \
+    \
+    COPY_DEF_F(StackCopy, out_T) \
+    { \
+        COPY_BOUNDS_CHECK(); \
+        OFFSET_DST(0) = COPY_SUB_F(y, 0); \
+    } \
+    \
+    COPY_DEF_F(StackCopy_uv, out_T_uv) \
+    { \
+        COPY_BOUNDS_CHECK_UV(1); \
+        OFFSET_DST(1) = COPY_SUB_F(uv, 1); \
+    } \
+};
+
+// Macro for RGB copy operations
+#define DEFINE_STACKCOPY_RGB(name, bit_depth, in_type, out_type) \
+struct StackCopy_##name \
+{ \
+    static const int in_bit_depth = bit_depth; \
+    typedef in_type in_T; \
+    typedef uchar in_T_uv; \
+    typedef out_type out_T; \
+    typedef uchar out_T_uv; \
+    \
+    COPY_DEF_F(StackCopy, out_T) \
+    { \
+        COPY_BOUNDS_CHECK(); \
+        OFFSET_DST(0) = COPY_SUB_F(y, 0); \
+    } \
+    \
+    COPY_DEF_F(StackCopy_uv, out_T_uv) \
+    { \
+        /* No UV plane for RGB formats */ \
+    } \
+};
+
+// Define all StackCopy structs using macros
+DEFINE_STACKCOPY_YUV_PLANAR(yuv420p_yuv420p, 8, uchar, uchar, uchar, uchar, 1)
+DEFINE_STACKCOPY_NV(nv12_nv12, 8, uchar, uchar2, uchar, uchar2)
+DEFINE_STACKCOPY_YUV_PLANAR(yuv444p_yuv444p, 8, uchar, uchar, uchar, uchar, 0)
+DEFINE_STACKCOPY_NV(p010le_p010le, 10, ushort, ushort2, ushort, ushort2)
+DEFINE_STACKCOPY_NV(p016le_p016le, 16, ushort, ushort2, ushort, ushort2)
+DEFINE_STACKCOPY_YUV_PLANAR(yuv444p16le_yuv444p16le, 16, ushort, ushort, 
ushort, ushort, 0)
+DEFINE_STACKCOPY_RGB(rgb0_rgb0, 8, uchar4, uchar4)
+DEFINE_STACKCOPY_RGB(bgr0_bgr0, 8, uchar4, uchar4)
+DEFINE_STACKCOPY_RGB(rgba_rgba, 8, uchar4, uchar4)
+DEFINE_STACKCOPY_RGB(bgra_bgra, 8, uchar4, uchar4)
+
+// --- COPY LOGIC ---
+
+template<typename T>
+__device__ static inline T StackCopyPixel(cudaTextureObject_t tex,
+                                        int xo, int yo,
+                                        int dst_width, int dst_height,
+                                        int src_width, int src_height,
+                                        int bit_depth)
+{
+    float hscale = (float)src_width / (float)dst_width;
+    float vscale = (float)src_height / (float)dst_height;
+    float xi = (xo + 0.5f) * hscale;
+    float yi = (yo + 0.5f) * vscale;
+
+    return tex2D<T>(tex, xi, yi);
+}
+
+/// --- FUNCTION EXPORTS ---
+
+#define COLOR_KERNEL_ARGS(T) \
+    T *dst_0, T *dst_1, T *dst_2, T *dst_3,             \
+    int width, int height, int dst_pitch,               \
+    int dst_x, int dst_y,                               \
+    unsigned char y_color, unsigned char u_color,       \
+    unsigned char v_color, unsigned char a_color,       \
+    int frame_width, int frame_height
+
+#define COLOR_FUNC(SetColorFunc, T) \
+    T *dst[4] = { dst_0, dst_1, dst_2, dst_3 };           \
+    int xo = blockIdx.x * blockDim.x + threadIdx.x;       \
+    int yo = blockIdx.y * blockDim.y + threadIdx.y;       \
+    if (yo >= height || xo >= width) return;              \
+    SetColorFunc(                                         \
+        dst, xo, yo,                                      \
+        width, height, dst_pitch,                         \
+        dst_x, dst_y, y_color, u_color, v_color, a_color, \
+        frame_width, frame_height);
+
+#define COPY_KERNEL_ARGS(T) \
+    cudaTextureObject_t src_tex_0, cudaTextureObject_t src_tex_1, \
+    cudaTextureObject_t src_tex_2, cudaTextureObject_t src_tex_3, \
+    T *dst_0, T *dst_1, T *dst_2, T *dst_3,                       \
+    int dst_width, int dst_height, int dst_pitch,                 \
+    int dst_x, int dst_y,                                         \
+    int src_width, int src_height,                                \
+    int frame_width, int frame_height
+
+#define COPY_FUNC(StackCopyFunc, T) \
+    cudaTextureObject_t src_tex[4] =                    \
+        { src_tex_0, src_tex_1, src_tex_2, src_tex_3 }; \
+    T *dst[4] = { dst_0, dst_1, dst_2, dst_3 };         \
+    int xo = blockIdx.x * blockDim.x + threadIdx.x;     \
+    int yo = blockIdx.y * blockDim.y + threadIdx.y;     \
+    if (yo >= dst_height || xo >= dst_width) return;   \
+    StackCopyFunc(                                      \
+        src_tex, dst, xo, yo,                           \
+        dst_width, dst_height, dst_pitch,               \
+        dst_x, dst_y,                                   \
+        src_width, src_height,                          \
+        frame_width, frame_height);
+
+extern "C" {
+
+#define COLOR_KERNEL(C, S) \
+    __global__ void SetColor_##C##S(                                  \
+        COLOR_KERNEL_ARGS(SetColor_##C::out_T##S))                    \
+    {                                                                 \
+        COLOR_FUNC(SetColor_##C::SetColor##S, SetColor_##C::out_T##S) \
+    }
+
+#define COLOR_KERNEL_RAW(C) \
+    COLOR_KERNEL(C,)        \
+    COLOR_KERNEL(C,_uv)
+
+// Define color kernels for all supported formats
+COLOR_KERNEL_RAW(yuv420p)
+COLOR_KERNEL_RAW(nv12)
+COLOR_KERNEL_RAW(yuv444p)
+COLOR_KERNEL_RAW(p010le)
+COLOR_KERNEL_RAW(p016le)
+COLOR_KERNEL_RAW(yuv444p16le)
+COLOR_KERNEL_RAW(rgb0)
+COLOR_KERNEL_RAW(bgr0)
+COLOR_KERNEL_RAW(rgba)
+COLOR_KERNEL_RAW(bgra)
+
+#define COPY_KERNEL(C, S) \
+    __global__ void StackCopy_##C##S(                      \
+        COPY_KERNEL_ARGS(StackCopy_##C::out_T##S))                 \
+    {                                                        \
+        COPY_FUNC((StackCopy_##C::StackCopy##S<                \
+                  StackCopyPixel<StackCopy_##C::in_T>,           \
+                  StackCopyPixel<StackCopy_##C::in_T_uv> >),     \
+                  StackCopy_##C::out_T##S)                    \
+    }
+
+#define COPY_KERNEL_RAW(C) \
+    COPY_KERNEL(C,)   \
+    COPY_KERNEL(C,_uv)
+
+// Define copy kernels for all supported formats
+COPY_KERNEL_RAW(yuv420p_yuv420p)
+COPY_KERNEL_RAW(nv12_nv12)
+COPY_KERNEL_RAW(yuv444p_yuv444p)
+COPY_KERNEL_RAW(p010le_p010le)
+COPY_KERNEL_RAW(p016le_p016le)
+COPY_KERNEL_RAW(yuv444p16le_yuv444p16le)
+COPY_KERNEL_RAW(rgb0_rgb0)
+COPY_KERNEL_RAW(bgr0_bgr0)
+COPY_KERNEL_RAW(rgba_rgba)
+COPY_KERNEL_RAW(bgra_bgra)
+
+}
\ No newline at end of file
-- 
2.34.1

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH] avfilter: add CUDA stack filters (hstack_cuda, vstack_cuda, xstack_cuda)

Reply via email to