[FFmpeg-devel] [PATCH] HDR Transcode VUI Info Copy

2017-04-03 Thread Ben Chang
Hi,

This patch adds copy of HDR  VUI info from decode ctx to encode ctx. Currently, 
information under colour_description_present_flag (eg. colour primaries, 
transfer_characteristics, matrix_coeffs) do not get copied to output stream 
when trancode happens.

Testing performed:
ffmpeg.exe -y -hwaccel cuvid -vcodec hevc_cuvid -i input.h265 -vcodec 
hevc_nvenc output.h265
Ensure the output bitstream contains the same colour_description_present_flag 
fields as input.

Also attaching previous discussion on this subject.

Thanks,
Ben

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


HDR_transcode_VUI_copy.patch
Description: HDR_transcode_VUI_copy.patch
--- Begin Message ---
On Fri, Mar 3, 2017 at 4:38 AM, Ben Chang  wrote:
>
> In short, is there any way to transfer meta data between a decode and encode 
> context in transcode scenario? If not, would it be supported in foreseeable 
> future?
>

Hi,

AVFrames do contain fields for:
* color_primaries
* color_trc
* colorspace
... and then there is the "new" side data type for the mastering
display data (AVMasteringDisplayMetadata).

So in theory if the decoder exports that information (AVC supports at
least the first three, and HEVC the latter side data as well) and you
utilize those values of the AVFrame in the encoder module, that should
be possible to obtain. Also I think AVFilter also takes in AVFrames,
which makes utilization of such information and updating it throughout
the chain possible when using the libav* framework.

Also, the recently merged better initialization of inputs in ffmpeg.c
might possibly help with some formats if ffmpeg.c is utilized and the
components involved support using the information within the AVFrames.

Best regards,
Jan Ekström
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
--- End Message ---
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [Patch] NVENC Surface Allocation Reduction

2017-04-19 Thread Ben Chang
Hi,

This patch aims to reduce the number of input/output surfaces NVENC allocates 
per session. Previous default sets allocated surfaces to 32 (unless there is 
user specified param or lookahead involved). Having large number of surfaces 
consumes extra video memory (esp for higher resolution encoding), and perf 
return saturates at a certain point. The patch changes the surfaces calculation 
for default, B-frames, lookahead scenario respectively.

The other change involves surface selection. Previously, if a session allocates 
x surfaces, only x-1 surfaces are used (due to combination of output delay and 
lock toggle logic). To prevent unused surfaces, changing surface rotation to 
using predefined fifo.

Testing done:
-Ensure the above changes have no perf impact and does not change bitstream

Thanks,
Ben

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


NVENC_surface_allocation_reduction.patch
Description: NVENC_surface_allocation_reduction.patch
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH] NVENC Surface Allocation Reduction

2017-04-24 Thread Ben Chang
[sorry for re-sending; but still looking for review. Thanks!]


Hi,



This patch aims to reduce the number of input/output surfaces NVENC allocates 
per session. Previous default sets allocated surfaces to 32 (unless there is 
user specified param or lookahead involved). Having large number of surfaces 
consumes extra video memory (esp for higher resolution encoding), and perf 
return saturates at a certain point. The patch changes the surfaces calculation 
for default, B-frames, lookahead scenario respectively.



The other change involves surface selection. Previously, if a session allocates 
x surfaces, only x-1 surfaces are used (due to combination of output delay and 
lock toggle logic). To prevent unused surfaces, changing surface rotation to 
using predefined fifo.



Testing done:

-Ensure the above changes have no perf impact and does not change bitstream



Thanks,

Ben


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


NVENC_surface_allocation_reduction.patch
Description: NVENC_surface_allocation_reduction.patch
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH] NVENC Surface Allocation Reduction

2017-04-25 Thread Ben Chang
Hi Timo,

Thanks for the review. Attaching patch updated with your suggestions and 
answering some queries from previous email.


>Did you test if and how much it affects performance to reduce the default 
>delay from 32 to 4?

>This was originally done because nvenc is extremely slow if you try to 
>download the frames without some delay headroom.


I have not seen drop in perf on windows in various scenarios (encode only, cpu 
-> nvenc transcode, nvdec -> nvenc transcode) on several gpu arch (Kepler, 
Maxwell, and Pascal). In fact, in some cases, the perf increases (by 1-2 fps). 
I am using the fps # reported by ffmpeg in most cases. Reducing number of 
surfaces effectively reduces the output delay (async_depth) which I believe is 
why there is a decrease in encode/transcode time.


> What do you mean by "*2 for number of NVENCs"?


This is a hardcoded value for the number of NVENCs present on a GPU. Commercial 
gpu can have up to two (most of the time). There is no support yet to inquire 
number of NVENCs present on gpu with api. I have changed the comment to 
"multiply by 2 for number of NVENCs on gpu (hardcode to 2)" for more clear 
wording.


>> --- a/libavcodec/nvenc_h264.c

>> +++ b/libavcodec/nvenc_h264.c

>> @@ -79,8 +79,8 @@ static const AVOption options[] = {

>>  0,  
>>   AV_OPT_TYPE_CONST, { .i64 = NV_ENC_PARAMS_RC_2_PASS_FRAMESIZE_CAP }, 
>> 0, 0, VE, "rc" },

>>  { "vbr_2pass","Multi-pass variable bitrate mode",   0,  
>>   AV_OPT_TYPE_CONST, { .i64 = NV_ENC_PARAMS_RC_2_PASS_VBR },   
>> 0, 0, VE, "rc" },

>>  { "rc-lookahead", "Number of frames to look ahead for rate-control",

>> -
>> OFFSET(rc_lookahead), AV_OPT_TYPE_INT,   { .i64 = -1 }, -1, INT_MAX, VE },

>> -{ "surfaces", "Number of concurrent surfaces",  
>> OFFSET(nb_surfaces),  AV_OPT_TYPE_INT,   { .i64 = 32 },  0, 
>> MAX_REGISTERED_FRAMES, VE },

>> +
>> OFFSET(rc_lookahead), AV_OPT_TYPE_INT,   { .i64 = 0 }, 0, INT_MAX, VE },

>Why the change of default here? Kinda gives up the possibility to 
>differentiate between unset and user-set to 0.


I just thought it would make more sense to have these value be 0 or greater 
since they may be/are used in positive integer calculation. Currently, there 
are condition checks to prevent unspecified value (-1) from being used; but I 
feel user unspecified and user set to 0 are essentially the same thing in 
rc_looahead and nb_surfaces scenario.



Other things addressed as you suggested:
-remove lockCount from NvencSurface struct as it is no longer referenced
-rename temporary surface variable to tmp_surf to avoid camel casing
-rename IO_surface_queue to unused_surface_queue
-remove pointless braces
-change statement: !(ctx->nb_surfaces > 0) to  ctx->nb_surfaces <= 0
-fix mixed code and declaration

Thanks!
Ben

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


NVENC_surface_allocation_reduction_v2.patch
Description: NVENC_surface_allocation_reduction_v2.patch
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] HEVC Video Transcode Transfer VUI, SEI Information

2017-03-02 Thread Ben Chang
Hi,

I posted a query regarding HDR transcode on the user forum a few weeks earlier 
and did not get a very clear answer; re-posting here in devel to see if I can 
gain more insight (sorry about duplication).

>I was wondering if ffmpeg supports transfer of VUI and SEI info between an 
>original bitstream and re-encoded bitstream (in a transcode scenario).
>I have been digging through the documentation; the closest I can find is using 
>-f metadata and -map_metadata but these don't seem to include the params I 
>need.
>This is specific to HDR bitstream transcode where I want to maintain 
>information such as color_description_present_flag, colour_primaries, 
>transfer_characteristics, etc.

In short, is there any way to transfer meta data between a decode and encode 
context in transcode scenario? If not, would it be supported in foreseeable 
future?

Thanks,
Ben



---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH]: Change Stack Frame Limit in Cuda Context

2018-01-24 Thread Ben Chang
Hi,

Please help review this patch to reduce stack frame size per GPU thread. The 
default allocation size per thread (1024 bytes) is excessive and can be reduced 
to 128 bytes based on nvidia cuda kernel compilation statistics. This should 
help with reducing video memory usage per cuda context.

Thanks,
Ben

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


0001-Reduce-cuda-context-s-stack-frame-size-limit-through.patch
Description: 0001-Reduce-cuda-context-s-stack-frame-size-limit-through.patch
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH]: Change Stack Frame Limit in Cuda Context

2018-01-24 Thread Ben Chang
Thanks for the review Carl.

> This looks as if your commit message spans several lines, should be one line 
> followed by an empty line and as many more lines as you need.
Fixed. Reattaching.

> Is there a reason why the error messages are different?
I am following the current convention of cuda error messages based on each file 
(hwcontext_cudda.c & nvenc.c).

> Please remove this or use another email address.
Is this absolutely necessary? I am unable to disable the disclaimer through 
outlook. Seems to be tied to nvidia's IT infra.

Thanks,
Ben

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


0001-Reduce-cuda-context-s-stack-frame-size-limit-through.patch
Description: 0001-Reduce-cuda-context-s-stack-frame-size-limit-through.patch
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH]: Change Stack Frame Limit in Cuda Context

2018-01-25 Thread Ben Chang
>Just use another provider like gmail.

Done.


Patch-wise, is it approved?

Thanks,
Ben
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH]: Change Stack Frame Limit in Cuda Context

2018-01-26 Thread Ben Chang
Thanks for the review Mark.

On Thu, Jan 25, 2018 at 4:13 PM, Mark Thompson  wrote:
>
> > diff --git a/libavcodec/nvenc.c b/libavcodec/nvenc.c
> > index 4a91d99..2da251b 100644
> > --- a/libavcodec/nvenc.c
> > +++ b/libavcodec/nvenc.c
> > @@ -420,6 +420,12 @@ static av_cold int nvenc_check_device(AVCodecContext
> *avctx, int idx)
> >  goto fail;
> >  }
> >
> > +cu_res = dl_fn->cuda_dl->cuCtxSetLimit(CU_LIMIT_STACK_SIZE, 128);
> > +if (cu_res != CUDA_SUCCESS) {
> > +av_log(avctx, AV_LOG_FATAL, "Failed reducing CUDA context stack
> limit for NVENC: 0x%x\n", (int)cu_res);
> > +goto fail;
> > +}
> > +
> >  ctx->cu_context = ctx->cu_context_internal;
> >
> >  if ((ret = nvenc_pop_context(avctx)) < 0)
>
> Does this actually have any effect?  I was under the impression that the
> CUDA context created inside the NVENC encoder wouldn't actually be used for
> any CUDA operations at all (really just a GPU device handle).
>
 There are some cuda kernels in the driver that may be invoked depending on
the nvenc operations specified in the commandline. My observation from
looking at the nvcc statistics is that most stack frame size for these cuda
kernels are 0 (highest observed was 120 bytes).

>
> > diff --git a/libavutil/hwcontext_cuda.c b/libavutil/hwcontext_cuda.c
> > index 37827a7..1f022fa 100644
> > --- a/libavutil/hwcontext_cuda.c
> > +++ b/libavutil/hwcontext_cuda.c
> > @@ -386,6 +386,12 @@ static int cuda_device_create(AVHWDeviceContext
> *ctx, const char *device,
> >  goto error;
> >  }
> >
> > +err = cu->cuCtxSetLimit(CU_LIMIT_STACK_SIZE, 128);
> > +if (err != CUDA_SUCCESS) {
> > +av_log(ctx, AV_LOG_ERROR, "Error reducing CUDA context stack
> limit\n");
> > +goto error;
> > +}
> > +
> >  cu->cuCtxPopCurrent(&dummy);
> >
> >  hwctx->internal->is_allocated = 1;
> > --
> > 2.9.1
> >
>
> This is technically a user-visible change, since it will apply to all user
> programs run on the CUDA context created here as well as those inside
> ffmpeg.  I'm not sure how many people actually use that, though, so maybe
> it won't affect anyone.
>
In ffmpeg, I see vf_thumbnail_cuda and vf_scale_cuda available (not sure if
there is more, but these two should not be affected by this reduction).
User can always raise the stack limit size if their own custom kernel
require higher stack frame size.

>
> If the stack limit is violated, what happens?  Will that be undefined
> behaviour with random effects (crash / incorrect results), or is it likely
> to be caught at program compile/load-time?
>
Stack will likely overflow and kernel will terminate (though I have yet
encounter this before).

Thanks,
Ben
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH]: Change Stack Frame Limit in Cuda Context

2018-01-26 Thread Ben Chang
On Fri, Jan 26, 2018 at 3:32 AM, Mark Thompson  wrote:

> On 26/01/18 09:06, Ben Chang wrote:
> > Thanks for the review Mark.
> >
> >  There are some cuda kernels in the driver that may be invoked depending
> on
> > the nvenc operations specified in the commandline. My observation from
> > looking at the nvcc statistics is that most stack frame size for these
> cuda
> > kernels are 0 (highest observed was 120 bytes).
>
> Right, that makes sense.  If Nvidia is happy that this will always work in
> drivers compatible with this API version (including any future ones) then
> sure.
>
I am not saying this should be the "permanent" value for stack frame size
per GPU thread. However, at this moment (looking at existing cuda kernels
that devs have control over), I do not see this reduction as an issue.

>
> >>
> >>
> >> This is technically a user-visible change, since it will apply to all
> user
> >> programs run on the CUDA context created here as well as those inside
> >> ffmpeg.  I'm not sure how many people actually use that, though, so
> maybe
> >> it won't affect anyone.
> >>
> > In ffmpeg, I see vf_thumbnail_cuda and vf_scale_cuda available (not sure
> if
> > there is more, but these two should not be affected by this reduction).
> > User can always raise the stack limit size if their own custom kernel
> > require higher stack frame size.
>
> I don't mean filters inside ffmpeg, I mean a user program which probably
> uses NVDEC and/or NVENC (and possibly other things) from libavcodec but
> then does its own CUDA processing with the same context.  This is silently
> changing the setup underneath it, and 128 feels like a very small number.
>
Yes, this is really a trade off between reducing memory usage (since there
are numerous complaints of high memory usage preventing having more ffmpeg
instances) and user convenience (custom cuda implementation may be
impacted). My thought (which can be wrong) is that users who implement
their own cuda kernel may have better knowledge about cuda (eg. how much
stack frame size their kernel needs or use cuda debugger to find out what
issue they may have). The size of the kernels are really implementation
dependent (eg, allocating arrays in stacks or heap, recursions, how much
register spills, etc) so stack frame sizes may vary widely. The default,
1024 bytes, may not be enough at times and user will need to adjust the
stack limit accordingly anyway.

>
> >>
> >> If the stack limit is violated, what happens?  Will that be undefined
> >> behaviour with random effects (crash / incorrect results), or is it
> likely
> >> to be caught at program compile/load-time?
> >>
> > Stack will likely overflow and kernel will terminate (though I have yet
> > encounter this before).
>
> As long as the user gets a clear message that a stack overflow has
> occurred so that they can realise that they need to raise the value then it
> should be fine.

I believe you will see stack overflow if attached to cuda debugger. But the
default error may just be kernel launch error/failure. This goes back to my
opinion that cuda developer should figure this out relatively easy if they
want to customize the cuda part of their program.

Copying Timo's comment from another thread to consolidate discussion.
>>Wouldn't it affect potential future CUDA filters, which might make more
use of the stack?
If nvidia introduces a new kernel that exceed this limit, changes will need
to be made (but I do not think is anytime soon).

Thanks,
Ben
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH]: Change Stack Frame Limit in Cuda Context

2018-01-30 Thread Ben Chang
On Fri, Jan 26, 2018 at 3:10 PM, Mark Thompson  wrote:

> On 26/01/18 20:51, Ben Chang wrote:
> > On Fri, Jan 26, 2018 at 3:32 AM, Mark Thompson  wrote:
> >
> >> On 26/01/18 09:06, Ben Chang wrote:
> >>> Thanks for the review Mark.
> >>>
>
> To clarify, since it is less clear now with the trimmed context: my two
> comments about this change are completely independent.  (Given that, maybe
> it should be split into two parts - one for hwcontext and one for nvenc?)

Sorry for the delay in reply Mark; been caught up by something else.

>


> This part is about the change to NVENC:
>
> >>>  There are some cuda kernels in the driver that may be invoked
> depending
> >> on
> >>> the nvenc operations specified in the commandline. My observation from
> >>> looking at the nvcc statistics is that most stack frame size for these
> >> cuda
> >>> kernels are 0 (highest observed was 120 bytes).
> >>
> >> Right, that makes sense.  If Nvidia is happy that this will always work
> in
> >> drivers compatible with this API version (including any future ones)
> then
> >> sure.
> >>
> > I am not saying this should be the "permanent" value for stack frame size
> > per GPU thread. However, at this moment (looking at existing cuda kernels
> > that devs have control over), I do not see this reduction as an issue.
>
> I think you should be confident that the chosen value here will last well
> into the future for NVENC use.  Consider that this will end up in releases
> - if a future Nvidia driver update happens to need a larger stack then all
> previous releases and binaries will stop working for all users.
>
>
> This part is about the change to the hwcontext device creation:
>
> >>>> This is technically a user-visible change, since it will apply to all
> >> user
> >>>> programs run on the CUDA context created here as well as those inside
> >>>> ffmpeg.  I'm not sure how many people actually use that, though, so
> >> maybe
> >>>> it won't affect anyone.
> >>>>
> >>> In ffmpeg, I see vf_thumbnail_cuda and vf_scale_cuda available (not
> sure
> >> if
> >>> there is more, but these two should not be affected by this reduction).
> >>> User can always raise the stack limit size if their own custom kernel
> >>> require higher stack frame size.
> >>
> >> I don't mean filters inside ffmpeg, I mean a user program which probably
> >> uses NVDEC and/or NVENC (and possibly other things) from libavcodec but
> >> then does its own CUDA processing with the same context.  This is
> silently
> >> changing the setup underneath it, and 128 feels like a very small
> number.
> >>
> > Yes, this is really a trade off between reducing memory usage (since
> there
> > are numerous complaints of high memory usage preventing having more
> ffmpeg
> > instances) and user convenience (custom cuda implementation may be
> > impacted). My thought (which can be wrong) is that users who implement
> > their own cuda kernel may have better knowledge about cuda (eg. how much
> > stack frame size their kernel needs or use cuda debugger to find out what
> > issue they may have). The size of the kernels are really implementation
> > dependent (eg, allocating arrays in stacks or heap, recursions, how much
> > register spills, etc) so stack frame sizes may vary widely. The default,
> > 1024 bytes, may not be enough at times and user will need to adjust the
> > stack limit accordingly anyway.
>
> Note that since you are changing a library the users in this context are
> all ffmpeg library users.  So, it means any program or library which uses
> ffmpeg, and transitively anyone who uses them.  The end-user need not be
> informed about CUDA at all.
>
> (What you've said also makes it sound like it can change by compiler
> version, but I guess such changes should be small.)
>
> >>>>
> >>>> If the stack limit is violated, what happens?  Will that be undefined
> >>>> behaviour with random effects (crash / incorrect results), or is it
> >> likely
> >>>> to be caught at program compile/load-time?
> >>>>
> >>> Stack will likely overflow and kernel will terminate (though I have yet
> >>> encounter this before).
> >>
> >> As long as the user gets a clear message that a stack overflow has
> >> occurred so that they can realise that they need to raise the value
> then it
> >> s

Re: [FFmpeg-devel] [PATCH]: Change Stack Frame Limit in Cuda Context

2018-02-05 Thread Ben Chang
Hi,

Do we have a conclusion on whether this patch can be pushed in?

Thanks,
Ben

On Tue, Jan 30, 2018 at 4:25 PM, Ben Chang  wrote:

>
>
> On Fri, Jan 26, 2018 at 3:10 PM, Mark Thompson  wrote:
>
>> On 26/01/18 20:51, Ben Chang wrote:
>> > On Fri, Jan 26, 2018 at 3:32 AM, Mark Thompson  wrote:
>> >
>> >> On 26/01/18 09:06, Ben Chang wrote:
>> >>> Thanks for the review Mark.
>> >>>
>>
>> To clarify, since it is less clear now with the trimmed context: my two
>> comments about this change are completely independent.  (Given that, maybe
>> it should be split into two parts - one for hwcontext and one for nvenc?)
>
> Sorry for the delay in reply Mark; been caught up by something else.
>
>>
>
>
>> This part is about the change to NVENC:
>>
>> >>>  There are some cuda kernels in the driver that may be invoked
>> depending
>> >> on
>> >>> the nvenc operations specified in the commandline. My observation from
>> >>> looking at the nvcc statistics is that most stack frame size for these
>> >> cuda
>> >>> kernels are 0 (highest observed was 120 bytes).
>> >>
>> >> Right, that makes sense.  If Nvidia is happy that this will always
>> work in
>> >> drivers compatible with this API version (including any future ones)
>> then
>> >> sure.
>> >>
>> > I am not saying this should be the "permanent" value for stack frame
>> size
>> > per GPU thread. However, at this moment (looking at existing cuda
>> kernels
>> > that devs have control over), I do not see this reduction as an issue.
>>
>> I think you should be confident that the chosen value here will last well
>> into the future for NVENC use.  Consider that this will end up in releases
>> - if a future Nvidia driver update happens to need a larger stack then all
>> previous releases and binaries will stop working for all users.
>>
>>
>> This part is about the change to the hwcontext device creation:
>>
>> >>>> This is technically a user-visible change, since it will apply to all
>> >> user
>> >>>> programs run on the CUDA context created here as well as those inside
>> >>>> ffmpeg.  I'm not sure how many people actually use that, though, so
>> >> maybe
>> >>>> it won't affect anyone.
>> >>>>
>> >>> In ffmpeg, I see vf_thumbnail_cuda and vf_scale_cuda available (not
>> sure
>> >> if
>> >>> there is more, but these two should not be affected by this
>> reduction).
>> >>> User can always raise the stack limit size if their own custom kernel
>> >>> require higher stack frame size.
>> >>
>> >> I don't mean filters inside ffmpeg, I mean a user program which
>> probably
>> >> uses NVDEC and/or NVENC (and possibly other things) from libavcodec but
>> >> then does its own CUDA processing with the same context.  This is
>> silently
>> >> changing the setup underneath it, and 128 feels like a very small
>> number.
>> >>
>> > Yes, this is really a trade off between reducing memory usage (since
>> there
>> > are numerous complaints of high memory usage preventing having more
>> ffmpeg
>> > instances) and user convenience (custom cuda implementation may be
>> > impacted). My thought (which can be wrong) is that users who implement
>> > their own cuda kernel may have better knowledge about cuda (eg. how much
>> > stack frame size their kernel needs or use cuda debugger to find out
>> what
>> > issue they may have). The size of the kernels are really implementation
>> > dependent (eg, allocating arrays in stacks or heap, recursions, how much
>> > register spills, etc) so stack frame sizes may vary widely. The default,
>> > 1024 bytes, may not be enough at times and user will need to adjust the
>> > stack limit accordingly anyway.
>>
>> Note that since you are changing a library the users in this context are
>> all ffmpeg library users.  So, it means any program or library which uses
>> ffmpeg, and transitively anyone who uses them.  The end-user need not be
>> informed about CUDA at all.
>>
>> (What you've said also makes it sound like it can change by compiler
>> version, but I guess such changes should be small.)
>>
>> >>>>
>> >>>> If the stack limit is violated, what happens?