Re: [FFmpeg-devel] [PATCH] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.

Mark Thompson Tue, 26 Mar 2019 15:44:29 -0700

On 26/03/2019 22:07, Shaofei Wang wrote:
> It enabled MULTIPLE SIMPLE filter graph concurrency, which bring above about
> 4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration
> 
> ...
> ---
> The patch will only effect on multiple SIMPLE filter graphs pipeline,
> Passed fate and refine the possible data race,
> AFL tested, without introducing extra crashs/hangs
> 
>  fftools/ffmpeg.c | 172 
> +++++++++++++++++++++++++++++++++++++++++++++++++------
>  fftools/ffmpeg.h |  13 +++++
>  2 files changed, 169 insertions(+), 16 deletions(-)


On further thought, I don't think this patch should continue in its present 
form.

The fundamental problem is that the bolted-on parallelism renders it pretty 
much unreviewable - the codebase here was not designed with any parallelism in 
mind, so it does not follow the usual rules expected of such code.  In 
particular, there are global variables in many places accessed without regard 
for data races, and that becomes fatal undefined behaviour once parallelism is 
added.  While in theory every one of those can be fixed (by adding more locks, 
or one big lock covering many instances together), it will be very hard to 
verify that all cases have been covered and the code will suffer significantly 
in readability/maintainability for the change.

To give an explicit example of the sort of problems you are hitting, the 
-benchmark_all option is entirely broken in the current proposed version.  It 
invokes undefined behaviour by writing to the current_time global in every 
thread.  Even if that were avoided by moving the value to a thread-local 
structure, it also gives wrong results - getrusage(RUSAGE_SELF) returns values 
for the whole process, so it won't say anything useful once there are multiple 
threads looking at the value in parallel.  Perhaps that is fixable by using 
some sort of per-thread storage and RUSAGE_THREAD (and some equivalent on 
Windows), but whatever that is certainly requires more investigation and 
probably a separate patch.

Three possible thoughts on where to go from here:

* Start by refactoring the code to make it easier to verify.  This would entail 
something like removing all global variable accesses from parallel paths, and 
then ensuring that each thread only ever writes to its own local working set 
(e.g. OutputStream structure) while marking all other structures as const.  
After such refactoring has happened, a new version of the present patch would 
then be possible to assess.

* Run exhaustively in tsan/valgrind/other race detectors and fix every problem 
it finds, then provide evidence from that to help with review.  Since these 
issues can be hidden behind specific option combinations (as the benchmark one 
is), this would need a lot of care to ensure full coverage, and some parts 
(like the benchmark one) would need rewriting anyway.  As noted above I'm not 
sure this would be very good for the code, but we could at least have some 
amount of confidence that the result was correct.

* Reconsider whether the patch is worth pursuing.  While I agree that a correct 
implementation of this would help in the specific use-cases you describe, I'm 
not sure that the set of users who gain from it is actually particularly large 
- most people wanting to maximise throughput in the way this change can help 
with already run multiple FFmpeg instances (or their own programs using the 
libraries) because the serial limitations of FFmpeg are well-known.


- Mark
_______________________________________________
ffmpeg-devel mailing list
[email protected]
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
[email protected] with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.

Reply via email to