On Sat, 29 Jun 2024 15:41:58 +0800 Zhao Zhili <quinkbl...@foxmail.com> wrote: > > > > On Jun 22, 2024, at 21:13, Niklas Haas <ffm...@haasn.xyz> wrote: > > > > Hey, > > > > As some of you know, I got contracted (by STF 2024) to work on improving > > swscale, over the course of the next couple of months. I want to share my > > current plans and gather feedback + measure sentiment. > > > > ## Problem statement > > > > The two issues I'd like to focus on for now are: > > > > 1. Lack of support for a lot of modern formats and conversions (HDR, ICtCp, > > IPTc2, BT.2020-CL, XYZ, YCgCo, Dolby Vision, ...) > > 2. Complicated context management, with cascaded contexts, threading, > > stateful > > configuration, multi-step init procedures, etc; and related bugs > > > > In order to make these feasible, some amount of internal re-organization of > > duties inside swscale is prudent. > > > > ## Proposed approach > > > > The first step is to create a new API, which will (tentatively) live in > > <libswscale/avscale.h>. This API will initially start off as a near-copy of > > the > > current swscale public API, but with the major difference that I want it to > > be > > state-free and only access metadata in terms of AVFrame properties. So there > > will be no independent configuration of the input chroma location etc. like > > there is currently, and no need to re-configure or re-init the context when > > feeding it frames with different properties. The goal is for users to be > > able > > to just feed it AVFrame pairs and have it internally cache expensive > > pre-processing steps as needed. Finally, avscale_* should ultimately also > > support hardware frames directly, in which case it will dispatch to some > > equivalent of scale_vulkan/vaapi/cuda or possibly even libplacebo. (But I > > will > > defer this to a future milestone) > > > > After this API is established, I want to start expanding the functionality > > in > > the following manner: > > > > ### Phase 1 > > > > For basic operation, avscale_* will just dispatch to a sequence of swscale_* > > invocations. In the basic case, it will just directly invoke swscale with > > minimal overhead. In more advanced cases, it might resolve to a *sequence* > > of > > swscale operations, with other operations (e.g. colorspace conversions a la > > vf_colorspace) mixed in. > > > > This will allow us to gain new functionality in a minimally invasive way, > > and > > will let API users start porting to the new API. This will also serve as a > > good > > "selling point" for the new API, allowing us to hopefully break up the > > legacy > > swscale API afterwards. > > > > ### Phase 2 > > > > After this is working, I want to cleanly separate swscale into two distinct > > components: > > > > 1. vertical/horizontal scaling > > 2. input/output conversions > > > > Right now, these operations both live inside the main SwsContext, even > > though > > they are conceptually orthogonal. Input handling is done entirely by the > > abstract callbacks lumToYV12 etc., while output conversion is currently > > "merged" with vertical scaling (yuv2planeX etc.). > > > > I want to cleanly separate these components so they can live inside > > independent > > contexts, and be considered as semantically distinct steps. (In particular, > > there should ideally be no more "unscaled special converters", instead this > > can > > be seen as a special case where there simply is no vertical/horizontal > > scaling > > step) > > > > The idea is for the colorspace conversion layer to sit in between the > > input/output converters and the horizontal/vertical scalers. This all would > > be > > orchestrated by the avscale_* abstraction. > > > > ## Implementation details > > > > To avoid performance loss from separating "merged" functions into their > > constituents, care needs to be taken such that all intermediate data, in > > addition to all involved look-up tables, will fit comfortably inside the L1 > > cache. The approach I propose, which is also (afaict) used by zscale, is to > > loop over line segments, applying each operation in sequence, on a small > > temporary buffer. > > > > e.g. > > > > hscale_row(pixel *dst, const pixel *src, int img_width) > > { > > const int SIZE = 256; // or some other small-ish figure, possibly a > > design > > // constant of the API so that SIMD implementations > > // can be appropriately unrolled > > > > pixel tmp[SIZE]; > > for (i = 0; i < img_width; i += SIZE) { > > int pixels = min(SIZE, img_width - i); > > > > { /* inside read input callback */ > > unpack_input(tmp, src, pixels); > > // the amount of separation here will depend on the performance > > apply_matrix3x3(tmp, yuv2rgb, pixels); > > apply_lut3x1d(tmp, gamma_lut, pixels); > > ... > > } > > > > hscale(dst, tmp, filter, pixels); > > > > src += pixels; > > dst += scale_factor(pixels); > > } > > } > > > > This function can then output rows into a ring buffer for use inside the > > vertical scaler, after which the same procedure happens (in reverse) for the > > final output pass. > > > > Possibly, we also want to additionally limit the size of a row for the > > horizontal scaler, to allow arbitrary large input images. > > I did a simple benchmark to compare the performance between libswscale and > libyuv. With Apple M1 arm64, libyuv is about 10 times faster than libswscale > for > unscaled rgba to yuv420p. After recently aarch64 neon optimizations, libyuv is > still 5 times faster than libswscale. The situation isn’t much better with > scaled > conversion. > > Sure libswscale has more features and can be more precise than libyuv. Hope > we can catch up the performance after your refactor.
AFAICT, libyuv does not do any dithering nor advanced filtering, which swscale is capable of. They also do processing at a pretty low bit depth, e.g. converting from 8-bit yuv420p straight to 8-bit rgba before scaling at, you guessed it, 8-bit resolution. (libswscale would use 15-bit here) That said, there's probably still something we can/should learn from this implementation. > > > > > ## Comments / feedback? > > > > Does the above approach seem reasonable? How do people feel about > > introducing > > a new API vs. trying to hammer the existing API into the shape I want it to > > be? > > > > I've attached an example of what <avscale.h> could end up looking like. If > > there is broad agreement on this design, I will move on to an > > implementation. > > <avscale.h>_______________________________________________ > > ffmpeg-devel mailing list > > ffmpeg-devel@ffmpeg.org > > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > > > To unsubscribe, visit link above, or email > > ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe". > > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe". _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".