Re: [FFmpeg-devel] [PATCH V2] avutil/tx: add check against (*ctx)
>Ruiling Song (12019-05-16): >> ctx is a pointer to pointer here. >> >> Signed-off-by: Ruiling Song >> --- >> libavutil/tx.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/libavutil/tx.c b/libavutil/tx.c >> index 934ef27c81..1690604040 100644 >> --- a/libavutil/tx.c >> +++ b/libavutil/tx.c >> @@ -697,7 +697,7 @@ static int gen_mdct_exptab(AVTXContext *s, int len4, >> double scale) >> >> av_cold void av_tx_uninit(AVTXContext **ctx) >> { > >> -if (!ctx) >> +if (!ctx || !(*ctx)) > >That would protect somebody stupid enough to call av_tx_uninit(NULL) >instead of av_tx_uninit(&var). A hard crass is completely warranted in >this case. An assert would be acceptable. Actually that is what the original code does. What you appear to want is if (!*ctx) which protects against multi-free and is useful in that it can be called unconditionally in cleanup code (assuming initial null assignments) and crashes in what you describe as the "stupid" case. >> return; >> >> av_free((*ctx)->pfatab); > >Regards, Regards John Cox ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] HEVC decoder for Raspberry Pi
Hi I have been developing a hevc decoder for Raspberry Pi for some time now. As active development has now pretty much ceased and the code is believed stable it seems a good time to try presenting it to the group. You can find the current code on branch test/4.1.0/rpi_main in repo https://github.com/jc-kynesim/rpi-ffmpeg.git. It is based off tag n4.1 so if you diff it against n4.1 you should get a patch. This code has been in use by the Raspberry Pi version of Kodi for over two years now. If you think it would be a good idea to add this to the main ffmpeg distribution then I am willing to put reasonable effort into beating it into an appropriate shape. If not then it contains a reasonable number of ARM asm functions and other code that you might like to take/adapt for the current decoder. You will find the config scripts I have been using and a few notes in the pi-util directory if you wish to try building it for yourself. Just in case it isn't obvious: this will only run on a Pi. Slightly less obviously you need a Pi2 or better as the Pi0 & Pi1 don't have neon and are just too slow anyway. Notes on the hevc_rpi decoder & associated support code --- There are 3 main parts to the existing code: 1) The decoder - this is all in libavcodec as rpi_hevc*. 2) A few filters to deal with Sand frames and a small patch to automatically select the sand->i420 converter when required. 3) A kludge in ffmpeg.c to display the decoded video. This could & should be converted into a proper ffmpeg display module. Decoder --- The decoder is a modified version of the existing ffmpeg hevc decoder. Generally it is ~100% faster than the existing ffmpeg hevc s/w decoder. More complex bitstreams can be up to ~200% faster but particularly easy streams can cut its advantage down to ~50%. This means that a Pi3+ can display nearly all 8-bit 1080p30 streams and with some overclocking it can display most lower bitrate 10-bit 1080p30 streams - this latter case is not helped by the requirement to downsample to 8-bit before display on a Pi. It has had co-processor offload added for inter-pred and large block residual transform. Various parts have had optimized ARM NEON assembler added and the existing ARM asm sections have been profiled and re-optimized for A53. The main C code has been substantially reworked at its lower levels in an attempt to optimize it and minimize memory bandwidth. To some extent code paths that deal with frame types that it doesn't support have been pruned. It outputs frames in Broadcom Sand format. This is a somewhat annoying layout that doesn't fit into ffmpegs standard frame descriptions. It has vertical stripes of 128 horizontal pixels (64 in 10 bit forms) with Y for the stripe followed by interleaved U & V, that is then followed by the Y for the next stripe, etc. The final stripe is always padded to stripe-width. This is used in an attempt to help with cache locality and cut down on the number of dram bank switches. It is annoying to use for inter-pred with conventional processing but the way the Pi QPU (which is used for inter-pred) works means that it has negligible downsides here and the improved memory performance exceeds the overhead of the increased complexity in the rest of the code. Frames must be allocated out of GPU memory (as otherwise they can't be accessed by the co-processors). Utility functions (in rpi_zc.c) have been written to make this easier. As the frames are already in GPU memory they can be displayed by the Pi h/w without any further copying. Known non-features -- Frame allocation should probably be done in some other way in order to fit into the standard framework better. Sand frames are currently declared as software frames, there is an argument that they should be hardware frames but they aren't really. There must be a better way of auto-selecting the hevc_rpi decoder over the normal s/w hevc decoder, but I became confused by the existing h/w acceleration framework and what I wanted to do didn't seem to fit in neatly. Display should be a proper device rather than a kludge in ffmpeg.c Regards John Cox ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] HEVC decoder for Raspberry Pi
Hi >Hi > >On Tue, Nov 13, 2018 at 03:52:18PM +0000, John Cox wrote: >> Hi >> >> I have been developing a hevc decoder for Raspberry Pi for some time >> now. As active development has now pretty much ceased and the code is >> believed stable it seems a good time to try presenting it to the group. >> >> You can find the current code on branch test/4.1.0/rpi_main in repo >> https://github.com/jc-kynesim/rpi-ffmpeg.git. It is based off tag n4.1 >> so if you diff it against n4.1 you should get a patch. >> >> This code has been in use by the Raspberry Pi version of Kodi for over >> two years now. >> >> If you think it would be a good idea to add this to the main ffmpeg >> distribution then I am willing to put reasonable effort into beating it >> into an appropriate shape. >> >> If not then it contains a reasonable number of ARM asm functions and >> other code that you might like to take/adapt for the current decoder. >> >> You will find the config scripts I have been using and a few notes in >> the pi-util directory if you wish to try building it for yourself. >> >> Just in case it isn't obvious: this will only run on a Pi. Slightly >> less obviously you need a Pi2 or better as the Pi0 & Pi1 don't have neon >> and are just too slow anyway. > >others may have other oppinions, but i think optimized code in FFmpeg >for Pi would be a good idea. >How to integrate this best though i do not know. And i cant know as >i have just quickly scrolled over the changes not really looked in detail Well if you want help with understanding what I've done feel free to email me and I'll do my best to explain. >But its certainly better to have hw optimizations in main git and >not have a seperate repository that needs to be maintained seperatly >for each platform ... and that the user has to find also ... and then >3rd party apps could have even more issues here if they wanted to use >optimized libs ... As I said I'm happy to put in reasonable amounts of work to make this happen. If we do want to go ahead then may I suggest that the most efficient way of proceeding would be that I take advice from one experienced person who understands the current hevc code (Michael?) by email until the work is mostly done and then return to the list for final polish? Regards JC ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] HEVC decoder for Raspberry Pi
Hi >On Wed, Nov 14, 2018 at 11:35:50AM +0000, John Cox wrote: >> Hi >> >> >Hi >> > >> >On Tue, Nov 13, 2018 at 03:52:18PM +, John Cox wrote: >> >> Hi >> >> >> >> I have been developing a hevc decoder for Raspberry Pi for some time >> >> now. As active development has now pretty much ceased and the code is >> >> believed stable it seems a good time to try presenting it to the group. >> >> >> >> You can find the current code on branch test/4.1.0/rpi_main in repo >> >> https://github.com/jc-kynesim/rpi-ffmpeg.git. It is based off tag n4.1 >> >> so if you diff it against n4.1 you should get a patch. >> >> >> >> This code has been in use by the Raspberry Pi version of Kodi for over >> >> two years now. >> >> >> >> If you think it would be a good idea to add this to the main ffmpeg >> >> distribution then I am willing to put reasonable effort into beating it >> >> into an appropriate shape. >> >> >> >> If not then it contains a reasonable number of ARM asm functions and >> >> other code that you might like to take/adapt for the current decoder. >> >> >> >> You will find the config scripts I have been using and a few notes in >> >> the pi-util directory if you wish to try building it for yourself. >> >> >> >> Just in case it isn't obvious: this will only run on a Pi. Slightly >> >> less obviously you need a Pi2 or better as the Pi0 & Pi1 don't have neon >> >> and are just too slow anyway. >> > >> >others may have other oppinions, but i think optimized code in FFmpeg >> >for Pi would be a good idea. >> >How to integrate this best though i do not know. And i cant know as >> >i have just quickly scrolled over the changes not really looked in detail >> >> Well if you want help with understanding what I've done feel free to >> email me and I'll do my best to explain. >> >> >But its certainly better to have hw optimizations in main git and >> >not have a seperate repository that needs to be maintained seperatly >> >for each platform ... and that the user has to find also ... and then >> >3rd party apps could have even more issues here if they wanted to use >> >optimized libs ... >> >> As I said I'm happy to put in reasonable amounts of work to make this >> happen. If we do want to go ahead then may I suggest that the most >> efficient way of proceeding would be that I take advice from one >> experienced person who understands the current hevc code (Michael?) by >> email until the work is mostly done and then return to the list for >> final polish? > >well, there are multiple ways this could be integrated, and its not >really my decission which way to go. Whats important is that before >doing substantial work you ensure that theres noone around who has >an issue with the choice before. > >Now one way it could be integrated would be as a seperate decoder That is how I've currently built it and therefore probably the easiest option. >another is inside the hevc decoder It started life there but became a very uneasy fit with too many ifdefs. >a 3rd is, similar to the hwaccel stuff >and a 4th would be that the decoder could be an external lib that >is used through hwaccel similar to other hwaccel libs Possibly - this is where I wanted advice as my attempts to understand how that lot is meant to work simply ended in confusion or a feeling that what I wanted to do was a very bad fit with the current framework - some of the issue with that is in vps/sps/pps setup where I build somewhat different tables to the common code that is used by most other h/w decodes. >you need to obtain the communities preferrance here not just my >oppinion ... >especially comments from people activly working on hwaccel stuff >are needed here I welcome their comments >But there is surely code in this change which can be integrated >and which would not change depending on the higher level integration >design. An example would be the asm that you already mentioned >You could split that out into patches and submit these I'd prefer to get the whole thing in, but if someone else wants to cherry-pick my changes then they are completely welcome. >another thing that can be worked on may be to reduce code duplication. Yup Regards JC ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] How to do (HEVC) decoder fallback?
Hi I have an HEVC decoder built from the standard ffmpeg hevc decoder. It has been heavily optimised for the Raspberry Pi and uses the support processors (QPU & VPU) of that chip to achieve plausible speed (on a Pi3 it can normally decode 10Mbit/sec 30fps 8-bit 4:2:0 1080p and has a decent go at 10-bit 1080p but you will need some overclock to get reliable 30fps) It only supports 8 & 10bit, 4:2:0 HEVC with a max width of 2048, and ouputs frames in a somewhat odd Broadcom format (sand) which doesn't fit any of the existing FFmpeg models as it is arranged in 128 byte wide vertical stripes rather than any sort of planar format. I also have a few functions that deal with sand conversion to raw 420 for conformance testing. What I want to do is to add this in such a way that ffmpeg will use it if the incoming stream is one it can deal with but will fall back to the standard hevc decoder if it can't. I've looked at the h/w accel route, but at first sight (I'll admit to becoming quite confused here) that appears to (a) want the hwaccel to produce the same format frames as the base deecoder would (which it doesn't) and (b) to use the same vps/sps/pps processing as the base decoder (and I've modified that a bit). What I would really like is for there to be some sort of fallback route for software decoders that share the same AVCodecID s.t. if one fails init then the next one is tried but that doesn't seem to be possible with the current setup. Am I missing something? As it stands the code is built into the main hevc decoder code with a lot of ifdefs & if (rpi_enable), but I think it would be better off in its own decoder. If you want to look at the current state of the art then you can find it in https://github.com/jc-kynesim/rpi-ffmpeg.git on branch test/wpp_1 - I do have a separated decoder version but I'd like to find out how I should integrate it before I commit it. Many thanks John Cox ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [RFC] swscale RGB24->YUV420P
Hi The Pi has a use for a fast RGB24->YUV420P path for encoding camera video. There is an existing BGR24 converter but if I build a RGB24 converter using the same logic (rearrange the conversion matrix and use the same code) I get a fate fail on filter-fps-cfr (and possibly others) which appears to decode a file to RGB24, convert to YUV420P and take the CRC of that so almost any change to the conversion breaks this (unrelated?) test. My initial assumption was that if the code to conversion in libswscale/rgb2rgb_template:bgr24toyv12_c was good enough for BGR24->YUV then it should be good enough for RGB24->YUV too. However it breaks this fate case - what is an acceptable way to resolve this? A further question assuming that the above can be resolved - I have also written aarch64 asm for this (RGB24/BGR24->YUV420P). It costs nothing in the asm to round the output values to nearest rather than just rounding down as the C template does and doing so improves the accuracy reported by tests/swscale - however at that point the asm and the C don't produce identical results. I assume that the x86 asm for BGR24 conversion does match the C. What is the best thing to do here? I've tested by hand with libswscale/test/swscale but fate integration would be obviously better - I'm currently a bit lost in fate, where/how should I do this? Many thanks John Cox ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
Re: [FFmpeg-devel] [RFC] swscale RGB24->YUV420P
On Wed, 16 Aug 2023 19:37:02 +0200, you wrote: >On Wed, Aug 16, 2023 at 05:15:23PM +0100, John Cox wrote: >> Hi >> >> The Pi has a use for a fast RGB24->YUV420P path for encoding camera >> video. There is an existing BGR24 converter but if I build a RGB24 >> converter using the same logic (rearrange the conversion matrix and use >> the same code) I get a fate fail on filter-fps-cfr (and possibly others) >> which appears to decode a file to RGB24, convert to YUV420P and take the >> CRC of that so almost any change to the conversion breaks this >> (unrelated?) test. >> >> My initial assumption was that if the code to conversion in >> libswscale/rgb2rgb_template:bgr24toyv12_c was good enough for BGR24->YUV >> then it should be good enough for RGB24->YUV too. However it breaks this >> fate case - what is an acceptable way to resolve this? > >update the checksum (if needed), and put the code under appropriate bitexact >flags checks >(there may be remaining issues but hard to say without seeing and being >abel to test the code) Thanks for the prompt answer. The current test invocation goes: /home/jc/work/rpi/ffmpeg2/out/x86/ffmpeg -nostdin -nostats -noauto_conversion_filters -cpuflags all -auto_conversion_filters -hwaccel none -threads 1 -thread_type frame+slice -i /home/jc/rpi/conform/fate-suite/qtrle/apple-animation-variable-fps-bug.mov -r 30 -vsync cfr -pix_fmt yuv420p -bitexact -f framecrc - Which appears, at first sight, to already have the required bitexact flag in it, however it doesn't get passed to the swscale context - in order for that to happen I need something like: /home/jc/work/rpi/ffmpeg2/out/x86/ffmpeg -fflags bitexact -nostdin -nostats -noauto_conversion_filters -cpuflags all -auto_conversion_filters -hwaccel none -threads 1 -thread_type frame+slice -i /home/jc/rpi/conform/fate-suite/qtrle/apple-animation-variable-fps-bug.mov -r 30 -vsync cfr -vf scale=sws_flags=bitexact -pix_fmt yuv420p -bitexact -f framecrc - i.e. adding an explicit "-vf scale=sws_flags=bitexact". Is this the correct answer or is it a bug that the auto conversion fails to respect the existing bitexact flag? >> A further question assuming that the above can be resolved - I have also >> written aarch64 asm for this (RGB24/BGR24->YUV420P). It costs nothing in >> the asm to round the output values to nearest rather than just rounding >> down as the C template does and doing so improves the accuracy reported >> by tests/swscale - however at that point the asm and the C don't produce >> identical results. I assume that the x86 asm for BGR24 conversion does >> match the C. What is the best thing to do here? > >The more differences there are between implementations the more annoying >it is but there is a bitexact flag that allows differences Thanks John Cox >thx > >[...] ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH v1 0/6] swscale: Add dedicated RGB->YUV unscaled functions & aarch64 asm
This patch set expands the set of dedicated RGB->YUV unscaled functions to help with encoding camera output on a Pi. Obviously there are other uses but that was the motivation. It enforces the general bitexact path for the fate tests that depend on it. It renames the existing bgr function as bgr... so we don't end up with the counterintuative situation where BGR is handled by rgb... and BGR would be handled by rgb.. Adds RGB functions Improves the rounding in the dedicated function as that improves its score when tested with test/swscale and fixes it to allow any width (contrary to the comment any height was already allowed). Adds XRGB->YUV functions to complete the set Adds Aarch64 neon for BGR24 & RGB24 I haven't built fate tests for this as I'm not quite sure what the appropriate tests would be. The x86 asm doesn't match either the C template with improved rounding or the previous template (I'm not quite sure what it does but it produces a different score out of tests/swscale to either method) so a simple results match isn't going to work. Regards John Cox John Cox (6): fate-filter-fps: Set swscale bitexact for tests that do conversions swscale: Rename BGR24->YUV conversion functions as bgr... swscale: Add explicit rgb24->yv12 conversion swscale: RGB24->YUV allow odd widths & improve C rounding swscale: Add unscaled XRGB->YUV420P functions swscale: Add aarch64 functions for RGB24->YUV420P libswscale/aarch64/rgb2rgb.c | 8 + libswscale/aarch64/rgb2rgb_neon.S | 356 ++ libswscale/bayer_template.c | 2 +- libswscale/rgb2rgb.c | 25 +++ libswscale/rgb2rgb.h | 23 ++ libswscale/rgb2rgb_template.c | 174 +-- libswscale/swscale_unscaled.c | 114 +- libswscale/x86/rgb2rgb_template.c | 13 +- tests/fate/filter-video.mak | 4 +- 9 files changed, 694 insertions(+), 25 deletions(-) -- 2.39.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH v1 1/6] fate-filter-fps: Set swscale bitexact for tests that do conversions
-bitexact as a general flag doesn't affect swscale so add swscale option too to get correct CRCs in all circumstances. Signed-off-by: John Cox --- tests/fate/filter-video.mak | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tests/fate/filter-video.mak b/tests/fate/filter-video.mak index 789ec6414c..811a96d124 100644 --- a/tests/fate/filter-video.mak +++ b/tests/fate/filter-video.mak @@ -391,8 +391,8 @@ fate-filter-fps-start-drop: CMD = framecrc -lavfi testsrc2=r=7:d=3.5,fps=3:start fate-filter-fps-start-fill: CMD = framecrc -lavfi testsrc2=r=7:d=1.5,setpts=PTS+14,fps=3:start_time=1.5 FATE_FILTER_SAMPLES-$(call FILTERDEMDEC, FPS SCALE, MOV, QTRLE) += fate-filter-fps-cfr fate-filter-fps -fate-filter-fps-cfr: CMD = framecrc -auto_conversion_filters -i $(TARGET_SAMPLES)/qtrle/apple-animation-variable-fps-bug.mov -r 30 -vsync cfr -pix_fmt yuv420p -fate-filter-fps: CMD = framecrc -auto_conversion_filters -i $(TARGET_SAMPLES)/qtrle/apple-animation-variable-fps-bug.mov -vf fps=30 -pix_fmt yuv420p +fate-filter-fps-cfr: CMD = framecrc -auto_conversion_filters -i $(TARGET_SAMPLES)/qtrle/apple-animation-variable-fps-bug.mov -r 30 -vsync cfr -vf scale=sws_flags=bitexact -pix_fmt yuv420p +fate-filter-fps: CMD = framecrc -auto_conversion_filters -i $(TARGET_SAMPLES)/qtrle/apple-animation-variable-fps-bug.mov -vf fps=30,scale=sws_flags=bitexact -pix_fmt yuv420p FATE_FILTER_ALPHAEXTRACT_ALPHAMERGE := $(addprefix fate-filter-alphaextract_alphamerge_, rgb yuv) FATE_FILTER_VSYNTH_PGMYUV-$(call ALLYES, SCALE_FILTER FORMAT_FILTER SPLIT_FILTER ALPHAEXTRACT_FILTER ALPHAMERGE_FILTER) += $(FATE_FILTER_ALPHAEXTRACT_ALPHAMERGE) -- 2.39.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH v1 2/6] swscale: Rename BGR24->YUV conversion functions as bgr...
Rename swscale conversion functions for converting BGR24 frames to YUV as bgr24toyuv12 rather than rgb24toyuv12 as that is just confusing and would be even more confusing with the addition of RGB24 converters. Signed-off-by: John Cox --- libswscale/bayer_template.c | 2 +- libswscale/rgb2rgb.c | 2 +- libswscale/rgb2rgb.h | 4 ++-- libswscale/rgb2rgb_template.c | 4 ++-- libswscale/swscale_unscaled.c | 2 +- libswscale/x86/rgb2rgb_template.c | 8 6 files changed, 11 insertions(+), 11 deletions(-) diff --git a/libswscale/bayer_template.c b/libswscale/bayer_template.c index 46b5a4984d..06d917c97f 100644 --- a/libswscale/bayer_template.c +++ b/libswscale/bayer_template.c @@ -188,7 +188,7 @@ * invoke ff_rgb24toyv12 for 2x2 pixels */ #define rgb24toyv12_2x2(src, dstY, dstU, dstV, luma_stride, src_stride, rgb2yuv) \ -ff_rgb24toyv12(src, dstY, dstV, dstU, 2, 2, luma_stride, 0, src_stride, rgb2yuv) +ff_bgr24toyv12(src, dstY, dstV, dstU, 2, 2, luma_stride, 0, src_stride, rgb2yuv) static void BAYER_RENAME(rgb24_copy)(const uint8_t *src, int src_stride, uint8_t *dst, int dst_stride, int width) { diff --git a/libswscale/rgb2rgb.c b/libswscale/rgb2rgb.c index e98fdac8ea..8707917800 100644 --- a/libswscale/rgb2rgb.c +++ b/libswscale/rgb2rgb.c @@ -78,7 +78,7 @@ void (*yuy2toyv12)(const uint8_t *src, uint8_t *ydst, uint8_t *udst, uint8_t *vdst, int width, int height, int lumStride, int chromStride, int srcStride); -void (*ff_rgb24toyv12)(const uint8_t *src, uint8_t *ydst, +void (*ff_bgr24toyv12)(const uint8_t *src, uint8_t *ydst, uint8_t *udst, uint8_t *vdst, int width, int height, int lumStride, int chromStride, int srcStride, diff --git a/libswscale/rgb2rgb.h b/libswscale/rgb2rgb.h index f3951d523e..305b830920 100644 --- a/libswscale/rgb2rgb.h +++ b/libswscale/rgb2rgb.h @@ -76,7 +76,7 @@ void rgb15tobgr15(const uint8_t *src, uint8_t *dst, int src_size); void rgb12tobgr12(const uint8_t *src, uint8_t *dst, int src_size); voidrgb12to15(const uint8_t *src, uint8_t *dst, int src_size); -void ff_rgb24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst, +void ff_bgr24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst, uint8_t *vdst, int width, int height, int lumStride, int chromStride, int srcStride, int32_t *rgb2yuv); @@ -124,7 +124,7 @@ extern void (*yuv422ptouyvy)(const uint8_t *ysrc, const uint8_t *usrc, const uin * Chrominance data is only taken from every second line, others are ignored. * FIXME: Write high quality version. */ -extern void (*ff_rgb24toyv12)(const uint8_t *src, uint8_t *ydst, uint8_t *udst, uint8_t *vdst, +extern void (*ff_bgr24toyv12)(const uint8_t *src, uint8_t *ydst, uint8_t *udst, uint8_t *vdst, int width, int height, int lumStride, int chromStride, int srcStride, int32_t *rgb2yuv); diff --git a/libswscale/rgb2rgb_template.c b/libswscale/rgb2rgb_template.c index 42c69801ba..8ef4a2cf5d 100644 --- a/libswscale/rgb2rgb_template.c +++ b/libswscale/rgb2rgb_template.c @@ -646,7 +646,7 @@ static inline void uyvytoyv12_c(const uint8_t *src, uint8_t *ydst, * others are ignored in the C version. * FIXME: Write HQ version. */ -void ff_rgb24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst, +void ff_bgr24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst, uint8_t *vdst, int width, int height, int lumStride, int chromStride, int srcStride, int32_t *rgb2yuv) { @@ -979,7 +979,7 @@ static av_cold void rgb2rgb_init_c(void) yuv422ptouyvy = yuv422ptouyvy_c; yuy2toyv12 = yuy2toyv12_c; planar2x = planar2x_c; -ff_rgb24toyv12 = ff_rgb24toyv12_c; +ff_bgr24toyv12 = ff_bgr24toyv12_c; interleaveBytes= interleaveBytes_c; deinterleaveBytes = deinterleaveBytes_c; vu9_to_vu12= vu9_to_vu12_c; diff --git a/libswscale/swscale_unscaled.c b/libswscale/swscale_unscaled.c index 9af2e7ecc3..32e0d7f63c 100644 --- a/libswscale/swscale_unscaled.c +++ b/libswscale/swscale_unscaled.c @@ -1641,7 +1641,7 @@ static int bgr24ToYv12Wrapper(SwsContext *c, const uint8_t *src[], int srcStride[], int srcSliceY, int srcSliceH, uint8_t *dst[], int dstStride[]) { -ff_rgb24toyv12( +ff_bgr24toyv12( src[0], dst[0] + srcSliceY * dstStride[0], dst[1] + (srcSliceY >> 1) * dstStride[1], diff --git a/libswscale/x86/rgb2rgb_template.c b/libswscale/x86/rgb2rgb_template.c index 4aba25dd51..dc2b4e205a 100644 --- a/libswscale/x86/rgb2rgb_template.c +++ b/libswscale/x86/rgb2rgb_template.c @@ -1544,7 +1544,7 @@ static inline void RENAME(uyvy
[FFmpeg-devel] [PATCH v1 3/6] swscale: Add explicit rgb24->yv12 conversion
Add a rgb24->yuv420p conversion. Uses the same code as the existing bgr24->yuv converter but permutes the conversion array to swap R & B coefficients. Signed-off-by: John Cox --- libswscale/rgb2rgb.c | 5 + libswscale/rgb2rgb.h | 7 +++ libswscale/rgb2rgb_template.c | 38 ++- libswscale/swscale_unscaled.c | 24 +- 4 files changed, 68 insertions(+), 6 deletions(-) diff --git a/libswscale/rgb2rgb.c b/libswscale/rgb2rgb.c index 8707917800..de90e5193f 100644 --- a/libswscale/rgb2rgb.c +++ b/libswscale/rgb2rgb.c @@ -83,6 +83,11 @@ void (*ff_bgr24toyv12)(const uint8_t *src, uint8_t *ydst, int width, int height, int lumStride, int chromStride, int srcStride, int32_t *rgb2yuv); +void (*ff_rgb24toyv12)(const uint8_t *src, uint8_t *ydst, + uint8_t *udst, uint8_t *vdst, + int width, int height, + int lumStride, int chromStride, int srcStride, + int32_t *rgb2yuv); void (*planar2x)(const uint8_t *src, uint8_t *dst, int width, int height, int srcStride, int dstStride); void (*interleaveBytes)(const uint8_t *src1, const uint8_t *src2, uint8_t *dst, diff --git a/libswscale/rgb2rgb.h b/libswscale/rgb2rgb.h index 305b830920..f7a76a92ba 100644 --- a/libswscale/rgb2rgb.h +++ b/libswscale/rgb2rgb.h @@ -79,6 +79,9 @@ voidrgb12to15(const uint8_t *src, uint8_t *dst, int src_size); void ff_bgr24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst, uint8_t *vdst, int width, int height, int lumStride, int chromStride, int srcStride, int32_t *rgb2yuv); +void ff_rgb24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst, + uint8_t *vdst, int width, int height, int lumStride, + int chromStride, int srcStride, int32_t *rgb2yuv); /** * Height should be a multiple of 2 and width should be a multiple of 16. @@ -128,6 +131,10 @@ extern void (*ff_bgr24toyv12)(const uint8_t *src, uint8_t *ydst, uint8_t *udst, int width, int height, int lumStride, int chromStride, int srcStride, int32_t *rgb2yuv); +extern void (*ff_rgb24toyv12)(const uint8_t *src, uint8_t *ydst, uint8_t *udst, uint8_t *vdst, + int width, int height, + int lumStride, int chromStride, int srcStride, + int32_t *rgb2yuv); extern void (*planar2x)(const uint8_t *src, uint8_t *dst, int width, int height, int srcStride, int dstStride); diff --git a/libswscale/rgb2rgb_template.c b/libswscale/rgb2rgb_template.c index 8ef4a2cf5d..e57bfa6545 100644 --- a/libswscale/rgb2rgb_template.c +++ b/libswscale/rgb2rgb_template.c @@ -646,13 +646,14 @@ static inline void uyvytoyv12_c(const uint8_t *src, uint8_t *ydst, * others are ignored in the C version. * FIXME: Write HQ version. */ -void ff_bgr24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst, +static void rgb24toyv12_x(const uint8_t *src, uint8_t *ydst, uint8_t *udst, uint8_t *vdst, int width, int height, int lumStride, - int chromStride, int srcStride, int32_t *rgb2yuv) + int chromStride, int srcStride, int32_t *rgb2yuv, + const uint8_t x[9]) { -int32_t ry = rgb2yuv[RY_IDX], gy = rgb2yuv[GY_IDX], by = rgb2yuv[BY_IDX]; -int32_t ru = rgb2yuv[RU_IDX], gu = rgb2yuv[GU_IDX], bu = rgb2yuv[BU_IDX]; -int32_t rv = rgb2yuv[RV_IDX], gv = rgb2yuv[GV_IDX], bv = rgb2yuv[BV_IDX]; +int32_t ry = rgb2yuv[x[0]], gy = rgb2yuv[x[1]], by = rgb2yuv[x[2]]; +int32_t ru = rgb2yuv[x[3]], gu = rgb2yuv[x[4]], bu = rgb2yuv[x[5]]; +int32_t rv = rgb2yuv[x[6]], gv = rgb2yuv[x[7]], bv = rgb2yuv[x[8]]; int y; const int chromWidth = width >> 1; @@ -707,6 +708,32 @@ void ff_bgr24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst, } } +static const uint8_t x_bgr[9] = { +RY_IDX, GY_IDX, BY_IDX, +RU_IDX, GU_IDX, BU_IDX, +RV_IDX, GV_IDX, BV_IDX, +}; + +static const uint8_t x_rgb[9] = { + BY_IDX, GY_IDX, RY_IDX, + BU_IDX, GU_IDX, RU_IDX, + BV_IDX, GV_IDX, RV_IDX, +}; + +void ff_bgr24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst, + uint8_t *vdst, int width, int height, int lumStride, + int chromStride, int srcStride, int32_t *rgb2yuv) +{ +rgb24toyv12_x(src, ydst, udst, vdst, width, height, lumStride, chromStride, srcStride, rgb2yuv, x_bgr); +} + +void ff_rgb24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst, + uint8_t *vdst, int width, int height, int lumStride, + int chromStride, int srcStride, int32_t *rgb2yuv) +{ +rgb24toyv12_x
[FFmpeg-devel] [PATCH v1 4/6] swscale: RGB24->YUV allow odd widths & improve C rounding
Allow odd widths for conversion it costs very little and simplifies setup slightly. x86 asm will fall back to the C code if width is odd. Round to nearest rather than just down. This reduces the Y error reported by tests/swscale from 3 to 1. x86 asm doesn't mirror the C so exact correspondence isn't an issue there. Signed-off-by: John Cox --- libswscale/rgb2rgb_template.c | 42 ++- libswscale/swscale_unscaled.c | 5 ++-- libswscale/x86/rgb2rgb_template.c | 5 3 files changed, 32 insertions(+), 20 deletions(-) diff --git a/libswscale/rgb2rgb_template.c b/libswscale/rgb2rgb_template.c index e57bfa6545..5503e58a29 100644 --- a/libswscale/rgb2rgb_template.c +++ b/libswscale/rgb2rgb_template.c @@ -656,6 +656,8 @@ static void rgb24toyv12_x(const uint8_t *src, uint8_t *ydst, uint8_t *udst, int32_t rv = rgb2yuv[x[6]], gv = rgb2yuv[x[7]], bv = rgb2yuv[x[8]]; int y; const int chromWidth = width >> 1; +const int32_t ky = ((16 << 1) + 1) << (RGB2YUV_SHIFT - 1); +const int32_t kc = ((128 << 1) + 1) << (RGB2YUV_SHIFT - 1); for (y = 0; y < height; y += 2) { int i; @@ -664,9 +666,9 @@ static void rgb24toyv12_x(const uint8_t *src, uint8_t *ydst, uint8_t *udst, unsigned int g = src[6 * i + 1]; unsigned int r = src[6 * i + 2]; -unsigned int Y = ((ry * r + gy * g + by * b) >> RGB2YUV_SHIFT) + 16; -unsigned int V = ((rv * r + gv * g + bv * b) >> RGB2YUV_SHIFT) + 128; -unsigned int U = ((ru * r + gu * g + bu * b) >> RGB2YUV_SHIFT) + 128; +unsigned int Y = (ry * r + gy * g + by * b + ky) >> RGB2YUV_SHIFT; +unsigned int V = (rv * r + gv * g + bv * b + kc) >> RGB2YUV_SHIFT; +unsigned int U = (ru * r + gu * g + bu * b + kc) >> RGB2YUV_SHIFT; udst[i] = U; vdst[i] = V; @@ -676,30 +678,36 @@ static void rgb24toyv12_x(const uint8_t *src, uint8_t *ydst, uint8_t *udst, g = src[6 * i + 4]; r = src[6 * i + 5]; -Y = ((ry * r + gy * g + by * b) >> RGB2YUV_SHIFT) + 16; +Y = ((ry * r + gy * g + by * b + ky) >> RGB2YUV_SHIFT); ydst[2 * i + 1] = Y; } -ydst += lumStride; -src += srcStride; - -if (y+1 == height) -break; - -for (i = 0; i < chromWidth; i++) { +if ((width & 1) != 0) { unsigned int b = src[6 * i + 0]; unsigned int g = src[6 * i + 1]; unsigned int r = src[6 * i + 2]; -unsigned int Y = ((ry * r + gy * g + by * b) >> RGB2YUV_SHIFT) + 16; +unsigned int Y = (ry * r + gy * g + by * b + ky) >> RGB2YUV_SHIFT; +unsigned int V = (rv * r + gv * g + bv * b + kc) >> RGB2YUV_SHIFT; +unsigned int U = (ru * r + gu * g + bu * b + kc) >> RGB2YUV_SHIFT; +udst[i] = U; +vdst[i] = V; ydst[2 * i] = Y; +} +ydst += lumStride; +src += srcStride; -b = src[6 * i + 3]; -g = src[6 * i + 4]; -r = src[6 * i + 5]; +if (y+1 == height) +break; -Y = ((ry * r + gy * g + by * b) >> RGB2YUV_SHIFT) + 16; -ydst[2 * i + 1] = Y; +for (i = 0; i < width; i++) { +unsigned int b = src[3 * i + 0]; +unsigned int g = src[3 * i + 1]; +unsigned int r = src[3 * i + 2]; + +unsigned int Y = (ry * r + gy * g + by * b + ky) >> RGB2YUV_SHIFT; + +ydst[i] = Y; } udst += chromStride; vdst += chromStride; diff --git a/libswscale/swscale_unscaled.c b/libswscale/swscale_unscaled.c index 751bdcb2e4..e10f967755 100644 --- a/libswscale/swscale_unscaled.c +++ b/libswscale/swscale_unscaled.c @@ -1994,7 +1994,6 @@ void ff_get_unscaled_swscale(SwsContext *c) const enum AVPixelFormat dstFormat = c->dstFormat; const int flags = c->flags; const int dstH = c->dstH; -const int dstW = c->dstW; int needsDither; needsDither = isAnyRGB(dstFormat) && @@ -2052,12 +2051,12 @@ void ff_get_unscaled_swscale(SwsContext *c) /* bgr24toYV12 */ if (srcFormat == AV_PIX_FMT_BGR24 && (dstFormat == AV_PIX_FMT_YUV420P || dstFormat == AV_PIX_FMT_YUVA420P) && -!(flags & (SWS_ACCURATE_RND | SWS_BITEXACT)) && !(dstW&1)) +!(flags & (SWS_ACCURATE_RND | SWS_BITEXACT))) c->convert_unscaled = bgr24ToYv12Wrapper; /* rgb24toYV12 */ if (srcFormat == AV_PIX_FMT_RGB24 && (dstFormat == AV_PIX_FMT_YUV420P || dstFormat == AV_PIX_FMT_YUVA420P) && -!(flags & (SWS_ACCURATE_RND | SWS_BITEXACT)) && !(dstW&1)) +!(flags & (
[FFmpeg-devel] [PATCH v1 5/6] swscale: Add unscaled XRGB->YUV420P functions
Add simple C functions for converting XRGB to YUV420P. Same logic as the RGB24 functions but dropping the A channel. Signed-off-by: John Cox --- libswscale/rgb2rgb.c | 20 +++ libswscale/rgb2rgb.h | 16 + libswscale/rgb2rgb_template.c | 106 ++ libswscale/swscale_unscaled.c | 89 4 files changed, 231 insertions(+) diff --git a/libswscale/rgb2rgb.c b/libswscale/rgb2rgb.c index de90e5193f..b976341e70 100644 --- a/libswscale/rgb2rgb.c +++ b/libswscale/rgb2rgb.c @@ -88,6 +88,26 @@ void (*ff_rgb24toyv12)(const uint8_t *src, uint8_t *ydst, int width, int height, int lumStride, int chromStride, int srcStride, int32_t *rgb2yuv); +void (*ff_rgbxtoyv12)(const uint8_t *src, uint8_t *ydst, + uint8_t *udst, uint8_t *vdst, + int width, int height, + int lumStride, int chromStride, int srcStride, + int32_t *rgb2yuv); +void (*ff_bgrxtoyv12)(const uint8_t *src, uint8_t *ydst, + uint8_t *udst, uint8_t *vdst, + int width, int height, + int lumStride, int chromStride, int srcStride, + int32_t *rgb2yuv); +void (*ff_xrgbtoyv12)(const uint8_t *src, uint8_t *ydst, + uint8_t *udst, uint8_t *vdst, + int width, int height, + int lumStride, int chromStride, int srcStride, + int32_t *rgb2yuv); +void (*ff_xbgrtoyv12)(const uint8_t *src, uint8_t *ydst, + uint8_t *udst, uint8_t *vdst, + int width, int height, + int lumStride, int chromStride, int srcStride, + int32_t *rgb2yuv); void (*planar2x)(const uint8_t *src, uint8_t *dst, int width, int height, int srcStride, int dstStride); void (*interleaveBytes)(const uint8_t *src1, const uint8_t *src2, uint8_t *dst, diff --git a/libswscale/rgb2rgb.h b/libswscale/rgb2rgb.h index f7a76a92ba..0015b1568a 100644 --- a/libswscale/rgb2rgb.h +++ b/libswscale/rgb2rgb.h @@ -135,6 +135,22 @@ extern void (*ff_rgb24toyv12)(const uint8_t *src, uint8_t *ydst, uint8_t *udst, int width, int height, int lumStride, int chromStride, int srcStride, int32_t *rgb2yuv); +extern void (*ff_rgbxtoyv12)(const uint8_t *src, uint8_t *ydst, uint8_t *udst, uint8_t *vdst, + int width, int height, + int lumStride, int chromStride, int srcStride, + int32_t *rgb2yuv); +extern void (*ff_bgrxtoyv12)(const uint8_t *src, uint8_t *ydst, uint8_t *udst, uint8_t *vdst, + int width, int height, + int lumStride, int chromStride, int srcStride, + int32_t *rgb2yuv); +extern void (*ff_xrgbtoyv12)(const uint8_t *src, uint8_t *ydst, uint8_t *udst, uint8_t *vdst, + int width, int height, + int lumStride, int chromStride, int srcStride, + int32_t *rgb2yuv); +extern void (*ff_xbgrtoyv12)(const uint8_t *src, uint8_t *ydst, uint8_t *udst, uint8_t *vdst, + int width, int height, + int lumStride, int chromStride, int srcStride, + int32_t *rgb2yuv); extern void (*planar2x)(const uint8_t *src, uint8_t *dst, int width, int height, int srcStride, int dstStride); diff --git a/libswscale/rgb2rgb_template.c b/libswscale/rgb2rgb_template.c index 5503e58a29..22326807c5 100644 --- a/libswscale/rgb2rgb_template.c +++ b/libswscale/rgb2rgb_template.c @@ -742,6 +742,108 @@ void ff_rgb24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst, rgb24toyv12_x(src, ydst, udst, vdst, width, height, lumStride, chromStride, srcStride, rgb2yuv, x_rgb); } +static void rgbxtoyv12_x(const uint8_t *src, uint8_t *ydst, uint8_t *udst, + uint8_t *vdst, int width, int height, int lumStride, + int chromStride, int srcStride, int32_t *rgb2yuv, + const uint8_t x[9]) +{ +int32_t ry = rgb2yuv[x[0]], gy = rgb2yuv[x[1]], by = rgb2yuv[x[2]]; +int32_t ru = rgb2yuv[x[3]], gu = rgb2yuv[x[4]], bu = rgb2yuv[x[5]]; +int32_t rv = rgb2yuv[x[6]], gv = rgb2yuv[x[7]], bv = rgb2yuv[x[8]]; +int y; +const int chromWidth = width >
[FFmpeg-devel] [PATCH v1 6/6] swscale: Add aarch64 functions for RGB24->YUV420P
Neon RGB24->YUV420P and BGR24->YUV420P functions. Works on 16 pixel blocks and can do any width or height, though for widths less than 32 or so the C is likely faster. Signed-off-by: John Cox --- libswscale/aarch64/rgb2rgb.c | 8 + libswscale/aarch64/rgb2rgb_neon.S | 356 ++ 2 files changed, 364 insertions(+) diff --git a/libswscale/aarch64/rgb2rgb.c b/libswscale/aarch64/rgb2rgb.c index a9bf6ff9e0..b2d68c1df3 100644 --- a/libswscale/aarch64/rgb2rgb.c +++ b/libswscale/aarch64/rgb2rgb.c @@ -30,6 +30,12 @@ void ff_interleave_bytes_neon(const uint8_t *src1, const uint8_t *src2, uint8_t *dest, int width, int height, int src1Stride, int src2Stride, int dstStride); +void ff_bgr24toyv12_neon(const uint8_t *src, uint8_t *ydst, uint8_t *udst, + uint8_t *vdst, int width, int height, int lumStride, + int chromStride, int srcStride, int32_t *rgb2yuv); +void ff_rgb24toyv12_neon(const uint8_t *src, uint8_t *ydst, uint8_t *udst, + uint8_t *vdst, int width, int height, int lumStride, + int chromStride, int srcStride, int32_t *rgb2yuv); av_cold void rgb2rgb_init_aarch64(void) { @@ -37,5 +43,7 @@ av_cold void rgb2rgb_init_aarch64(void) if (have_neon(cpu_flags)) { interleaveBytes = ff_interleave_bytes_neon; +ff_rgb24toyv12 = ff_rgb24toyv12_neon; +ff_bgr24toyv12 = ff_bgr24toyv12_neon; } } diff --git a/libswscale/aarch64/rgb2rgb_neon.S b/libswscale/aarch64/rgb2rgb_neon.S index d81110ec57..b15e69a3bd 100644 --- a/libswscale/aarch64/rgb2rgb_neon.S +++ b/libswscale/aarch64/rgb2rgb_neon.S @@ -77,3 +77,359 @@ function ff_interleave_bytes_neon, export=1 0: ret endfunc + +// Expand rgb2 into r0+r1/g0+g1/b0+b1 +.macro XRGB3Y r0, g0, b0, r1, g1, b1, r2, g2, b2 +uxtl\r0\().8h, \r2\().8b +uxtl\g0\().8h, \g2\().8b +uxtl\b0\().8h, \b2\().8b + +uxtl2 \r1\().8h, \r2\().16b +uxtl2 \g1\().8h, \g2\().16b +uxtl2 \b1\().8h, \b2\().16b +.endm + +// Expand rgb2 into r0+r1/g0+g1/b0+b1 +// and pick every other el to put back into rgb2 for chroma +.macro XRGB3YC r0, g0, b0, r1, g1, b1, r2, g2, b2 +XRGB3Y \r0, \g0, \b0, \r1, \g1, \b1, \r2, \g2, \b2 + +bic \r2\().8h, #0xff, LSL #8 +bic \g2\().8h, #0xff, LSL #8 +bic \b2\().8h, #0xff, LSL #8 +.endm + +.macro SMLAL3 d0, d1, s0, s1, s2, c0, c1, c2 +smull \d0\().4s, \s0\().4h, \c0 +smlal \d0\().4s, \s1\().4h, \c1 +smlal \d0\().4s, \s2\().4h, \c2 +smull2 \d1\().4s, \s0\().8h, \c0 +smlal2 \d1\().4s, \s1\().8h, \c1 +smlal2 \d1\().4s, \s2\().8h, \c2 +.endm + +// d0 may be s0 +// s0, s2 corrupted +.macro SHRN_Y d0, s0, s1, s2, s3, k128h +shrn\s0\().4h, \s0\().4s, #12 +shrn2 \s0\().8h, \s1\().4s, #12 +add \s0\().8h, \s0\().8h, \k128h\().8h // +128 (>> 3 = 16) +sqrshrun\d0\().8b, \s0\().8h, #3 +shrn\s2\().4h, \s2\().4s, #12 +shrn2 \s2\().8h, \s3\().4s, #12 +add \s2\().8h, \s2\().8h, \k128h\().8h +sqrshrun2 \d0\().16b, v28.8h, #3 +.endm + +.macro SHRN_C d0, s0, s1, k128b +shrn\s0\().4h, \s0\().4s, #14 +shrn2 \s0\().8h, \s1\().4s, #14 +sqrshrn \s0\().8b, \s0\().8h, #1 +add \d0\().8b, \s0\().8b, \k128b\().8b // +128 +.endm + +.macro STB2V s0, n, a +st1 {\s0\().b}[(\n+0)], [\a], #1 +st1 {\s0\().b}[(\n+1)], [\a], #1 +.endm + +.macro STB4V s0, n, a +STB2V \s0, (\n+0), \a +STB2V \s0, (\n+2), \a +.endm + + +// void ff_bgr24toyv12_neon( +// const uint8_t *src, // x0 +// uint8_t *ydst, // x1 +// uint8_t *udst, // x2 +// uint8_t *vdst, // x3 +// int width, // w4 +// int height, // w5 +// int lumStride, // w6 +// int chromStride,// w7 +// int srcStr, // [sp, #0] +// int32_t *rgb2yuv); // [sp, #8] + +function ff_bgr24toyv12_neon, export=1 +ldr x15, [sp, #8] +ld3 {v3.s, v4.s, v5.s}[0], [x15], #12 +ld3 {v3.s, v4.s, v5.s}[1], [x15], #12 +ld3 {v3.s, v4.s, v5.s}[2], [x15] +mov v6.16b, v3.16b +mov v3.16b, v5.16b +mov v5.16b, v6.16b +b
Re: [FFmpeg-devel] [PATCH v1 3/6] swscale: Add explicit rgb24->yv12 conversion
On Sun, 20 Aug 2023 19:16:14 +0200, you wrote: >On Sun, Aug 20, 2023 at 03:10:19PM +0000, John Cox wrote: >> Add a rgb24->yuv420p conversion. Uses the same code as the existing >> bgr24->yuv converter but permutes the conversion array to swap R & B >> coefficients. >> >> Signed-off-by: John Cox >> --- >> libswscale/rgb2rgb.c | 5 + >> libswscale/rgb2rgb.h | 7 +++ >> libswscale/rgb2rgb_template.c | 38 ++- >> libswscale/swscale_unscaled.c | 24 +- >> 4 files changed, 68 insertions(+), 6 deletions(-) >> >> diff --git a/libswscale/rgb2rgb.c b/libswscale/rgb2rgb.c >> index 8707917800..de90e5193f 100644 >> --- a/libswscale/rgb2rgb.c >> +++ b/libswscale/rgb2rgb.c >> @@ -83,6 +83,11 @@ void (*ff_bgr24toyv12)(const uint8_t *src, uint8_t *ydst, >> int width, int height, >> int lumStride, int chromStride, int srcStride, >> int32_t *rgb2yuv); >> +void (*ff_rgb24toyv12)(const uint8_t *src, uint8_t *ydst, >> + uint8_t *udst, uint8_t *vdst, >> + int width, int height, >> + int lumStride, int chromStride, int srcStride, >> + int32_t *rgb2yuv); >> void (*planar2x)(const uint8_t *src, uint8_t *dst, int width, int height, >> int srcStride, int dstStride); >> void (*interleaveBytes)(const uint8_t *src1, const uint8_t *src2, uint8_t >> *dst, >> diff --git a/libswscale/rgb2rgb.h b/libswscale/rgb2rgb.h >> index 305b830920..f7a76a92ba 100644 >> --- a/libswscale/rgb2rgb.h >> +++ b/libswscale/rgb2rgb.h >> @@ -79,6 +79,9 @@ voidrgb12to15(const uint8_t *src, uint8_t *dst, int >> src_size); >> void ff_bgr24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst, >>uint8_t *vdst, int width, int height, int lumStride, >>int chromStride, int srcStride, int32_t *rgb2yuv); >> +void ff_rgb24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst, >> + uint8_t *vdst, int width, int height, int lumStride, >> + int chromStride, int srcStride, int32_t *rgb2yuv); >> >> /** >> * Height should be a multiple of 2 and width should be a multiple of 16. >> @@ -128,6 +131,10 @@ extern void (*ff_bgr24toyv12)(const uint8_t *src, >> uint8_t *ydst, uint8_t *udst, >>int width, int height, >>int lumStride, int chromStride, int srcStride, >>int32_t *rgb2yuv); >> +extern void (*ff_rgb24toyv12)(const uint8_t *src, uint8_t *ydst, uint8_t >> *udst, uint8_t *vdst, >> + int width, int height, >> + int lumStride, int chromStride, int srcStride, >> + int32_t *rgb2yuv); >> extern void (*planar2x)(const uint8_t *src, uint8_t *dst, int width, int >> height, >> int srcStride, int dstStride); >> >> diff --git a/libswscale/rgb2rgb_template.c b/libswscale/rgb2rgb_template.c >> index 8ef4a2cf5d..e57bfa6545 100644 >> --- a/libswscale/rgb2rgb_template.c >> +++ b/libswscale/rgb2rgb_template.c > > >> @@ -646,13 +646,14 @@ static inline void uyvytoyv12_c(const uint8_t *src, >> uint8_t *ydst, >> * others are ignored in the C version. >> * FIXME: Write HQ version. >> */ >> -void ff_bgr24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst, >> +static void rgb24toyv12_x(const uint8_t *src, uint8_t *ydst, uint8_t *udst, > >this probably should be inline Could do, and I will if you deem it important, but the only bit that inline is going to help is the matrix coefficient loading and that happens once outside the main loops. >also i see now "FIXME: Write HQ version." above here. Do you really want to >add a low quality rgb24toyv12 ? >(it is vissible on the diagonal border (cyan / red )) in > ./ffmpeg -f lavfi -i testsrc=size=5632x3168 -pix_fmt yuv420p -vframes 1 > -qscale 1 -strict -1 new.jpg > > also on smaller sizes but for some reason its clearer on the big one zoomed > in 400% with gimp >(the gimp test was done with the whole patchset not after this patch) On the whole - yes - in the encode path on the Pi that I'm writing this for speed is more important than quality - the existing path is too slow to be usable. And honestly - using your example above comparing (Windows photo viewer zoomed in s.t. pixels are clearly
Re: [FFmpeg-devel] [PATCH v1 3/6] swscale: Add explicit rgb24->yv12 conversion
On Sun, 20 Aug 2023 19:45:11 +0200, you wrote: >On Sun, Aug 20, 2023 at 07:16:14PM +0200, Michael Niedermayer wrote: >> On Sun, Aug 20, 2023 at 03:10:19PM +0000, John Cox wrote: >> > Add a rgb24->yuv420p conversion. Uses the same code as the existing >> > bgr24->yuv converter but permutes the conversion array to swap R & B >> > coefficients. >> > >> > Signed-off-by: John Cox >> > --- >> > libswscale/rgb2rgb.c | 5 + >> > libswscale/rgb2rgb.h | 7 +++ >> > libswscale/rgb2rgb_template.c | 38 ++- >> > libswscale/swscale_unscaled.c | 24 +- >> > 4 files changed, 68 insertions(+), 6 deletions(-) >> > >> > diff --git a/libswscale/rgb2rgb.c b/libswscale/rgb2rgb.c >> > index 8707917800..de90e5193f 100644 >> > --- a/libswscale/rgb2rgb.c >> > +++ b/libswscale/rgb2rgb.c >> > @@ -83,6 +83,11 @@ void (*ff_bgr24toyv12)(const uint8_t *src, uint8_t >> > *ydst, >> > int width, int height, >> > int lumStride, int chromStride, int srcStride, >> > int32_t *rgb2yuv); >> > +void (*ff_rgb24toyv12)(const uint8_t *src, uint8_t *ydst, >> > + uint8_t *udst, uint8_t *vdst, >> > + int width, int height, >> > + int lumStride, int chromStride, int srcStride, >> > + int32_t *rgb2yuv); >> > void (*planar2x)(const uint8_t *src, uint8_t *dst, int width, int height, >> > int srcStride, int dstStride); >> > void (*interleaveBytes)(const uint8_t *src1, const uint8_t *src2, uint8_t >> > *dst, >> > diff --git a/libswscale/rgb2rgb.h b/libswscale/rgb2rgb.h >> > index 305b830920..f7a76a92ba 100644 >> > --- a/libswscale/rgb2rgb.h >> > +++ b/libswscale/rgb2rgb.h >> > @@ -79,6 +79,9 @@ voidrgb12to15(const uint8_t *src, uint8_t *dst, int >> > src_size); >> > void ff_bgr24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst, >> >uint8_t *vdst, int width, int height, int lumStride, >> >int chromStride, int srcStride, int32_t *rgb2yuv); >> > +void ff_rgb24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst, >> > + uint8_t *vdst, int width, int height, int lumStride, >> > + int chromStride, int srcStride, int32_t *rgb2yuv); >> > >> > /** >> > * Height should be a multiple of 2 and width should be a multiple of 16. >> > @@ -128,6 +131,10 @@ extern void (*ff_bgr24toyv12)(const uint8_t *src, >> > uint8_t *ydst, uint8_t *udst, >> >int width, int height, >> >int lumStride, int chromStride, int >> > srcStride, >> >int32_t *rgb2yuv); >> > +extern void (*ff_rgb24toyv12)(const uint8_t *src, uint8_t *ydst, uint8_t >> > *udst, uint8_t *vdst, >> > + int width, int height, >> > + int lumStride, int chromStride, int >> > srcStride, >> > + int32_t *rgb2yuv); >> > extern void (*planar2x)(const uint8_t *src, uint8_t *dst, int width, int >> > height, >> > int srcStride, int dstStride); >> > >> > diff --git a/libswscale/rgb2rgb_template.c b/libswscale/rgb2rgb_template.c >> > index 8ef4a2cf5d..e57bfa6545 100644 >> > --- a/libswscale/rgb2rgb_template.c >> > +++ b/libswscale/rgb2rgb_template.c >> >> >> > @@ -646,13 +646,14 @@ static inline void uyvytoyv12_c(const uint8_t *src, >> > uint8_t *ydst, >> > * others are ignored in the C version. >> > * FIXME: Write HQ version. >> > */ >> > -void ff_bgr24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst, >> > +static void rgb24toyv12_x(const uint8_t *src, uint8_t *ydst, uint8_t >> > *udst, >> >> this probably should be inline >> >> also i see now "FIXME: Write HQ version." above here. Do you really want to >> add a low quality rgb24toyv12 ? >> (it is vissible on the diagonal border (cyan / red )) in >> ./ffmpeg -f lavfi -i testsrc=size=5632x3168 -pix_fmt yuv420p -vframes 1 >> -qscale 1 -strict -1 new.jpg >> >> also on smaller sizes but for some reason its clearer on the big one z
Re: [FFmpeg-devel] [PATCH v1 3/6] swscale: Add explicit rgb24->yv12 conversion
On Mon, 21 Aug 2023 21:15:37 +0200, you wrote: >On Sun, Aug 20, 2023 at 07:28:40PM +0100, John Cox wrote: >> On Sun, 20 Aug 2023 19:45:11 +0200, you wrote: >> >> >On Sun, Aug 20, 2023 at 07:16:14PM +0200, Michael Niedermayer wrote: >> >> On Sun, Aug 20, 2023 at 03:10:19PM +, John Cox wrote: >> >> > Add a rgb24->yuv420p conversion. Uses the same code as the existing >> >> > bgr24->yuv converter but permutes the conversion array to swap R & B >> >> > coefficients. >> >> > >> >> > Signed-off-by: John Cox >> >> > --- >> >> > libswscale/rgb2rgb.c | 5 + >> >> > libswscale/rgb2rgb.h | 7 +++ >> >> > libswscale/rgb2rgb_template.c | 38 ++- >> >> > libswscale/swscale_unscaled.c | 24 +- >> >> > 4 files changed, 68 insertions(+), 6 deletions(-) >> >> > >> >> > diff --git a/libswscale/rgb2rgb.c b/libswscale/rgb2rgb.c >> >> > index 8707917800..de90e5193f 100644 >> >> > --- a/libswscale/rgb2rgb.c >> >> > +++ b/libswscale/rgb2rgb.c >> >> > @@ -83,6 +83,11 @@ void (*ff_bgr24toyv12)(const uint8_t *src, uint8_t >> >> > *ydst, >> >> > int width, int height, >> >> > int lumStride, int chromStride, int srcStride, >> >> > int32_t *rgb2yuv); >> >> > +void (*ff_rgb24toyv12)(const uint8_t *src, uint8_t *ydst, >> >> > + uint8_t *udst, uint8_t *vdst, >> >> > + int width, int height, >> >> > + int lumStride, int chromStride, int srcStride, >> >> > + int32_t *rgb2yuv); >> >> > void (*planar2x)(const uint8_t *src, uint8_t *dst, int width, int >> >> > height, >> >> > int srcStride, int dstStride); >> >> > void (*interleaveBytes)(const uint8_t *src1, const uint8_t *src2, >> >> > uint8_t *dst, >> >> > diff --git a/libswscale/rgb2rgb.h b/libswscale/rgb2rgb.h >> >> > index 305b830920..f7a76a92ba 100644 >> >> > --- a/libswscale/rgb2rgb.h >> >> > +++ b/libswscale/rgb2rgb.h >> >> > @@ -79,6 +79,9 @@ voidrgb12to15(const uint8_t *src, uint8_t *dst, >> >> > int src_size); >> >> > void ff_bgr24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst, >> >> >uint8_t *vdst, int width, int height, int >> >> > lumStride, >> >> >int chromStride, int srcStride, int32_t >> >> > *rgb2yuv); >> >> > +void ff_rgb24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst, >> >> > + uint8_t *vdst, int width, int height, int >> >> > lumStride, >> >> > + int chromStride, int srcStride, int32_t >> >> > *rgb2yuv); >> >> > >> >> > /** >> >> > * Height should be a multiple of 2 and width should be a multiple of >> >> > 16. >> >> > @@ -128,6 +131,10 @@ extern void (*ff_bgr24toyv12)(const uint8_t *src, >> >> > uint8_t *ydst, uint8_t *udst, >> >> >int width, int height, >> >> >int lumStride, int chromStride, int >> >> > srcStride, >> >> >int32_t *rgb2yuv); >> >> > +extern void (*ff_rgb24toyv12)(const uint8_t *src, uint8_t *ydst, >> >> > uint8_t *udst, uint8_t *vdst, >> >> > + int width, int height, >> >> > + int lumStride, int chromStride, int >> >> > srcStride, >> >> > + int32_t *rgb2yuv); >> >> > extern void (*planar2x)(const uint8_t *src, uint8_t *dst, int width, >> >> > int height, >> >> > int srcStride, int dstStride); >> >> > >> >> > diff --git a/libswscale/rgb2rgb_template.c >> >> > b/libswscale/rgb2rgb_template.c >> >> > index 8ef4a2cf5d..e57bfa6545 100644 >> >> > --- a/libswscale/rgb2rgb_template.c >> >> > +++ b/libswscale/rgb2rgb_template.c >> >> >&g
[FFmpeg-devel] Does rtspenc actually support AVFMT_GLOBALHEADER?
Hi Does rtspenc actually support AVFMT_GLOBALHEADER? It is specified in the FFOutputFormat flags but I can't see anywhere in the code where extradata is referenced like it is in other output formats which support that flag. I ask because I have an encoder that supports the flag and when set removes SPS/PPS from the stream and puts them in extradata instead which I believe is the correct behavior - if it isn't then that is my problem and I'd appreciate clarification of what is meant to occur. The transmitted RTSP stream then doesn't contain SPS/PPS. Removal of AVFMT_GLOBALHEADER from the flags in rtspenc.c fixes my problem and I'll very happily submit a patch to that effect, but first I'd like to know if that is in fact the root of my problem - my understanding of the RTSP code is very limited and I'd appreciate advice from someone who knows something about it. Many thanks John Cox ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
Re: [FFmpeg-devel] Does rtspenc actually support AVFMT_GLOBALHEADER?
On Mon, 19 Aug 2024 at 19:32, Martin Storsjö wrote: > > On Mon, 19 Aug 2024, John Cox wrote: > > > Does rtspenc actually support AVFMT_GLOBALHEADER? It is specified in the > > FFOutputFormat flags but I can't see anywhere in the code where > > extradata is referenced like it is in other output formats which support > > that flag. > > > > I ask because I have an encoder that supports the flag and when set > > removes SPS/PPS from the stream and puts them in extradata instead which > > I believe is the correct behavior - if it isn't then that is my problem > > and I'd appreciate clarification of what is meant to occur. The > > transmitted RTSP stream then doesn't contain SPS/PPS. > > That's correct, the SPS/PPS gets transmitted in the SDP description, not > in-band. Many thanks for the info. I thought something like that should occur but I couldn't find it. Now I know where I should be looking. John Cox ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
Re: [FFmpeg-devel] [PATCH] libavdevice: Add KMS/DRM output device
On Mon, 18 Jan 2021 23:37:09 +, you wrote: >On 16/01/2021 22:12, Nicolas Caramelli wrote: >> This patch adds KMS/DRM output device for rendering a video stream >> using KMS/DRM dumb buffer. >> The proposed implementation is very basic, only bgr0 pixel format is >> currently supported (the most common format with KMS/DRM). >> To enable this output device you need to configure FFmpeg with >> --enable-libdrm. >> Example: ffmpeg -re -i INPUT -pix_fmt bgr0 -f kmsdumb /dev/dri/card0 > >If you want to render things to a normal display device why not use a normal >video player? Or even ffplay? > >IMO something like this would be of more value as a simple video player >example with the documentation rather than including it as weirdly constrained >library code which will see very little use. > >(Note that I would argue against adding more general display output devices >which are already present, like fb and xv, because they are of essentially no >value to libavdevice users. Removing legacy code is harder, though.) I take your point but I personally have found it very useful to have simple display devices on the output of ffmpeg for testing purposes. Though I guess that if I want that then the device should be bundled with the application rather than in a library. John Cox ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (especially ARM)
Hi I've just done a fair bit of work on hevc_cabac decode for the Rasberry Pi2 and I think that the patch is generally applicable. Patch is attached but you may prefer to take it from git: https://github.com/jc-kynesim/rpi-ffmpeg.git branch: test/ff_hevc_cabac_3 commit: 423e160e639d301feb2b4ba220199d112def0164 On the Pi2 playing a 10Mbit 1080p H.265 clip (A bit of the Hobbit) it reduces the time in ff_hevc_hls_residual_coding (until transform) from ~26Gcycles to ~18Gcycles and it almost halves the time spent in the "core" bit of the function (from decoding the greater1 bits to the end of decode). This was measured using the CPU cycle counter. Tests done at Rasberry Pi suggests that on their ffmpeg branch it reduces overall CPU loading by ~10% whislt playing H.265. I haven't profiled it on any other platform - but I would expect useful improvements on most streams on most platforms. I have not yet run fate over it as I haven't yet finished downloading the samples (the internet connection here isn't wildly fast), but I have run it against the H265.1 conformance streams on both x86 and ARM and it causes no regressions. Known unknowns / possible issues: 1) I haven't tested it on anything with 64-bit ints (I don't have an appropriate m/c) - whilst I've coded in a manner that should hopefully be OK there I can see that there might be issues. 2) Only tested on gcc 4.8 and later (5.1 & 5.3). I've used an anonymous union to avoid changing other cabac code - I could believe this was a no-no and I'll have to change that. 3) Uses clz which doesn't seem to exist in the ffmpeg int libs (though ctz does) I'll happily accept suggestions as to what is considered better practice for these points. Regards John Cox 0001-H.265-residual-decode-performance-improvements.patch Description: Binary data ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (especially ARM)
Hi >On Tue, Jan 19, 2016 at 7:46 AM, John Cox wrote: > >> Hi >> >> I've just done a fair bit of work on hevc_cabac decode for the Rasberry >> Pi2 and I think that the patch is generally applicable. Patch is >> attached but you may prefer to take it from git: > > >Cool! Two non-technical comments first, I'll try to make time to review >in-depth/technically soon: > >1: > >> +#define UNCHECKED_BITSTREAM_READER 1 > >I don't think that's right, and is a security issue. I added that line as (nearly) every other decoder in liavcodec has it - in particular h264_cabac.c has it. Going forward: Checking bitstream position on every load is terribly wasteful - if at all possible a better idea is to allocate more space than is required in the input bitstream buffer so some overrun is permssible without seg fault and only check position at the end of every block or other medium sized unit. (You can nearly always work out what the worst case overread can be.) >2: your indentation of function declarations is weird. E.g.: > >+static inline uint32_t get_greaterx_bits(HEVCContext * const s, const >unsigned int n_end, int * const levels, >+int * const pprev_subset_coded, int * const psum, >+const unsigned int idx0_gt1, const unsigned int idx_gt2) > >We tend to indent the second line so it aligns with the opening bracket of >the first line. Fair enough >Alike, your indentation of const variable declarations: > >+uint8_t * const state0 = s->HEVClc->cabac_state + idx0_gt1; > >doesn't need a space between * and const. If that is required style then I'll abide by it, but I think that detracts noticably from readability. >Like I said, all non-technical, I'll do technical bits soon if I find time. >If other people like it and I haven't responded yet, just commit it and we >can address them post-push. Thanks JC >Ronald >___ >ffmpeg-devel mailing list >ffmpeg-devel@ffmpeg.org >http://ffmpeg.org/mailman/listinfo/ffmpeg-devel ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (especially ARM)
On Tue, 19 Jan 2016 15:59:39 + (UTC), you wrote: >John Cox kynesim.co.uk> writes: > >> >> +#define UNCHECKED_BITSTREAM_READER 1 >> > >> >I don't think that's right, and is a security issue. >> >> I added that line as (nearly) every other decoder in >> liavcodec has it - > >Sure? OK - not all: h263dec.c h264.c h264_cabac.c h264_cavlc.c huffyuvdec.c ituh263dec.c mpegl2dec.c mpeg12.c mpeg4videodec.c mpeg4video_parser.c But that probably covers 90% of the video streams decoded with ffmpeg >> in particular h264_cabac.c has it. > >Extensive testing was done before it was added. Testing that it doesn't seg-fault no matter what the input or some other sort of testing? >Could you confirm how much of the speedup comes >only from this change? Not an awful lot - a few % of the total improvement, but I was looking for everything I can get. I'll happily take it out of this patch if it is controversial. >While we definitely all welcome a noticeable speedup >of hevc decoding (and while my opinion on your patch >has limited relevance) I believe that the patch >absolutely has to be split: First step would be to >have a split between changes in the general code and >changes to arm assembly, I believe the first patch >then may be split further. Happy to split out the arm asm. Splitting the rest of it will be harder if you want it to continue working at all intermediate points. >I am a little surprised that you wrote some asm >functions that are slower than what the compiler >produces: Did you analyze this? Yeah - they aren't much, if at all, slower but unless they are actively faster it seems silly to use difficult to maintain asm where the C will do. In the end it came down to the asm constraining the order in which stuff happens in the surrounding code and that wasn't always good. Regards JC >Carl Eugen > >___ >ffmpeg-devel mailing list >ffmpeg-devel@ffmpeg.org >http://ffmpeg.org/mailman/listinfo/ffmpeg-devel ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (especially ARM)
>John Cox kynesim.co.uk> writes: > >> On Tue, 19 Jan 2016 15:59:39 + (UTC), you wrote: >> >> >John Cox kynesim.co.uk> writes: >> > >> >> >> +#define UNCHECKED_BITSTREAM_READER 1 >> >> > >> >> >I don't think that's right, and is a security issue. >> >> >> >> I added that line as (nearly) every other decoder in >> >> liavcodec has it - >> > >> >Sure? >> >> OK - not all: >> >> h263dec.c >> h264.c >> h264_cabac.c >> h264_cavlc.c >> huffyuvdec.c >> ituh263dec.c >> mpegl2dec.c >> mpeg12.c >> mpeg4videodec.c >> mpeg4video_parser.c >> >> But that probably covers 90% of the video streams >> decoded with ffmpeg > >The three decoders mpegvideo, h263/asp and h264 are >not "(nearly) every other decoder", sorry! Sorry - I (obviously) misremembered the number of hits I got when I last did that search. >> >> in particular h264_cabac.c has it. >> > >> >Extensive testing was done before it was added. >> >> Testing that it doesn't seg-fault no matter what the >> input or some other sort of testing? > >Yes, tests that show that fuzzed input does not crash >the decoder are needed. > >But afaict, the change is unrelated to the rest of your >patch and should be discussed separately (imo). Yup - perfectly happy to put that can of worms to one side. >> >Could you confirm how much of the speedup comes >> >only from this change? >> >> Not an awful lot - a few % of the total improvement, but >> I was looking for everything I can get. I'll happily >> take it out of this patch if it is controversial. > >I wouldn't say controversial (I am all for it, sorry if >this wasn't clear) but I think it can be discussed after >your speedup was committed. Yup - at this point it is simply a distraction >> >While we definitely all welcome a noticeable speedup >> >of hevc decoding (and while my opinion on your patch >> >has limited relevance) I believe that the patch >> >absolutely has to be split: First step would be to >> >have a split between changes in the general code and >> >changes to arm assembly, I believe the first patch >> >then may be split further. >> >> Happy to split out the arm asm. > >Please do, my suggestion would be to start with the >changes to the C code. But it may be wise to wait for a >real review first. I've done enough review processes to know that waiting till the comments die down before doing anything is the way to go :-) JC >Carl Eugen > >___ >ffmpeg-devel mailing list >ffmpeg-devel@ffmpeg.org >http://ffmpeg.org/mailman/listinfo/ffmpeg-devel ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (especially ARM)
>On 1/19/2016 9:46 AM, John Cox wrote: >> +// Helper fns >> +#ifndef hevc_mem_bits32 >> +static av_always_inline uint32_t hevc_mem_bits32(const void * buf, const >> unsigned int offset) >> +{ >> +return AV_RB32((const uint8_t *)buf + (offset >> 3)) << (offset & 7); >> +} >> +#endif >> + >> +#if AV_GCC_VERSION_AT_LEAST(3,4) && !defined(hevc_clz32) >> +#define hevc_clz32 hevc_clz32_builtin >> +static av_always_inline unsigned int hevc_clz32_builtin(const uint32_t x) >> +{ >> +// __builtin_clz says it works on ints - so adjust if int is >32 bits >> long >> +return __builtin_clz(x) - (sizeof(int) * 8 - 32); > >Why aren't you simply using ff_clz? Because it doesn't exist? or at least I can't find it. >> +} >> +#endif >> + >> +// It is unlikely that we will ever need this but include for completeness > >There are at least two compilers we support that don't define __GNUC__, so >it would be used. >And in any case, isn't all this duplicating ff_clz, which is available in >libavutil/inthmath.h? Are you sure of that? I can find ff_ctz but no ff_clz... I would happily be wrong. [snip] JC ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (especially ARM)
On Tue, 19 Jan 2016 14:09:22 -0300, you wrote: >On 1/19/2016 2:05 PM, John Cox wrote: >>> On 1/19/2016 9:46 AM, John Cox wrote: >>>> +// Helper fns >>>> +#ifndef hevc_mem_bits32 >>>> +static av_always_inline uint32_t hevc_mem_bits32(const void * buf, const >>>> unsigned int offset) >>>> +{ >>>> +return AV_RB32((const uint8_t *)buf + (offset >> 3)) << (offset & 7); >>>> +} >>>> +#endif >>>> + >>>> +#if AV_GCC_VERSION_AT_LEAST(3,4) && !defined(hevc_clz32) >>>> +#define hevc_clz32 hevc_clz32_builtin >>>> +static av_always_inline unsigned int hevc_clz32_builtin(const uint32_t x) >>>> +{ >>>> +// __builtin_clz says it works on ints - so adjust if int is >32 bits >>>> long >>>> +return __builtin_clz(x) - (sizeof(int) * 8 - 32); >>> >>> Why aren't you simply using ff_clz? >> >> Because it doesn't exist? or at least I can't find it. >> >>>> +} >>>> +#endif >>>> + >>>> +// It is unlikely that we will ever need this but include for completeness >>> >>> There are at least two compilers we support that don't define __GNUC__, so >>> it would be used. >>> And in any case, isn't all this duplicating ff_clz, which is available in >>> libavutil/inthmath.h? >> >> Are you sure of that? I can find ff_ctz but no ff_clz... >> I would happily be wrong. > >I assume you're writing this patch for the ffmpeg 2.8 branch or older, which >you shouldn't. >Always use the master branch. You'll find ff_clz there. Yes/no - the code I wrote had to work against 2.8 as that is what Rasperry Pi are using at the moment. This patch is meant to be against master so I can/will happily remove that code. (And I had the wrong version checked out when commenting previously) By the way - can you tell me what the behaviour of ff_clz is when ints are 64 bits long or is that never the case? Does it count up to 63 (I am aware that the behaviour applied against 0 may be undefined) or does it just work on the low 32 bits? (I assume the former) Thanks JC ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (especially ARM)
>On 1/19/2016 2:24 PM, John Cox wrote: >> On Tue, 19 Jan 2016 14:09:22 -0300, you wrote: >> >>> On 1/19/2016 2:05 PM, John Cox wrote: >>>>> On 1/19/2016 9:46 AM, John Cox wrote: >>>>>> +// Helper fns >>>>>> +#ifndef hevc_mem_bits32 >>>>>> +static av_always_inline uint32_t hevc_mem_bits32(const void * buf, >>>>>> const unsigned int offset) >>>>>> +{ >>>>>> +return AV_RB32((const uint8_t *)buf + (offset >> 3)) << (offset & >>>>>> 7); >>>>>> +} >>>>>> +#endif >>>>>> + >>>>>> +#if AV_GCC_VERSION_AT_LEAST(3,4) && !defined(hevc_clz32) >>>>>> +#define hevc_clz32 hevc_clz32_builtin >>>>>> +static av_always_inline unsigned int hevc_clz32_builtin(const uint32_t >>>>>> x) >>>>>> +{ >>>>>> +// __builtin_clz says it works on ints - so adjust if int is >32 >>>>>> bits long >>>>>> +return __builtin_clz(x) - (sizeof(int) * 8 - 32); >>>>> >>>>> Why aren't you simply using ff_clz? >>>> >>>> Because it doesn't exist? or at least I can't find it. >>>> >>>>>> +} >>>>>> +#endif >>>>>> + >>>>>> +// It is unlikely that we will ever need this but include for >>>>>> completeness >>>>> >>>>> There are at least two compilers we support that don't define __GNUC__, so >>>>> it would be used. >>>>> And in any case, isn't all this duplicating ff_clz, which is available in >>>>> libavutil/inthmath.h? >>>> >>>> Are you sure of that? I can find ff_ctz but no ff_clz... >>>> I would happily be wrong. >>> >>> I assume you're writing this patch for the ffmpeg 2.8 branch or older, >>> which you shouldn't. >>> Always use the master branch. You'll find ff_clz there. >> >> Yes/no - the code I wrote had to work against 2.8 as that is what >> Rasperry Pi are using at the moment. This patch is meant to be against >> master so I can/will happily remove that code. (And I had the wrong >> version checked out when commenting previously) >> >> By the way - can you tell me what the behaviour of ff_clz is when ints >> are 64 bits long or is that never the case? Does it count up to 63 (I >> am aware that the behaviour applied against 0 may be undefined) or does >> it just work on the low 32 bits? (I assume the former) > >The generic version checks sizeof(unsigned), so the former. >The GNU specific version using the builtin is meant to work with an unsigned >int and not a fixed width data type, so it's probably safe to assume it will. In that case then it would appear that the definition of ff_log2 is wrong as that seems to assume a max 31: #if HAVE_FAST_CLZ #if AV_GCC_VERSION_AT_LEAST(3,4) #ifndef ff_log2 # define ff_log2(x) (31 - __builtin_clz((x)|1)) # ifndef ff_log2_16bit # define ff_log2_16bit av_log2 # endif #endif /* ff_log2 */ #endif /* AV_GCC_VERSION_AT_LEAST(3,4) */ #endif Regards JC ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (especially ARM)
On Wed, 20 Jan 2016 13:26:05 +0100, you wrote: >Hi, > >2016-01-19 13:46 GMT+01:00 John Cox : >> I've just done a fair bit of work on hevc_cabac decode for the Rasberry >> Pi2 and I think that the patch is generally applicable. Patch is >> attached but you may prefer to take it from git: > >This work is certainly impressive, and most people would have come >only with some of the "tricks" you used. >Although it already represents quite a bit of work, I echo others' >suggestions to have more incremental changes. > >> I have not yet run fate over it as I haven't yet finished downloading >> the samples (the internet connection here isn't wildly fast), but I have >> run it against the H265.1 conformance streams on both x86 and ARM and it >> causes no regressions. > >Your patch fails on the later fate tests linked to range extensions >(RExt sequences) on Win64. I didn't investigate why. Random thoughts: >transform_skip, cross-channel residual, some bypass-coded elements (eg >SAO). Yeah - that does fail (and I'm not sure why either at the moment) - I only tested against the published H.265.1 conformance suite and that doesn't contain the RExt tests. Do you believe that master ffmpeg produces the right answer for these tests? I didn't spot any RExt logic in the scale code when I rewrote it (it does affect how numbers are processed there) and it warns that it isn't supported when ffmpeg runs. Having said that I would still have expected my code to produce the same result as the old code so I'll look into it. >> 3) Uses clz which doesn't seem to exist in the ffmpeg int libs (though >> ctz does) > >That could be a patch in and by itself. Apparently ff_clz is now on master - but wasn't in 2.8 (which is what RPi need) >So, referring to your changes, it would be nice to have the following >changes split in their own patches: >1) significant coeff flag decoding, which probably is the largest gain >(and therefore would be even nicer if further sliced): > a) for instance, you avoid an indirection by flattening/merging >context tables; > b) other parts, which I fear may not translate that well for other >platforms (at least without matching x86 code), or sequences >2) you use native sized integers in some places (not sure if that can >cause issues); >3) bypass-coded stuff is a fairly large change (both in terms of code, >review and impacting the cabac struct also used by h264); it would be >nice knowing how much you gain here >4) the replacing of !!something by something when the flag is already 0/1 >5) coefficient saturation I don't have formal numbers for everything but from the profiling I did in development: The by22 code gained me an overall factor of two in the abs level decode - the gains do depend a lot on the quantity of residual - you gain a lot more on I-frames than you do otherwise as they tend to have much longer residuals. The higher the bitrate the more useful this code is. But as you note it didn't use vast amounts of time relative to everything else anyway. The reworking / simplification of the loop(s) around the abs level decode and the scaling gave me the biggest single improvement. After that the reworking of get_sig_ceoff_flag_idxs was a useful gain Special caseing the single coeff path gave a similar gain After that the scale rework - now probably 75% faster than it was previously but it wasn't taking a huge amount of time. And after that all the other bits - my experience with optimising this sort of code (I did a lot of work on a TI H.264 implementation in the past) is that no single change is going to do everything, you just have to polish everything until it goes fast enough. >3) is indeed the largest chunk. I don't know what your profiling >indicated, but the original code didn't seem that high-profile. But I >haven't split it to see what it actually provided, but overall numbers >look good: > >I quickly hacked (quickly being the keyword as it also means poor and >potentially resulting in faulty conclusion) something that is close to >2) + 4) for reference. >Benching REF+1)a) vs REF+1), it did seem slower on Win64/Haswell for >significant flag decoding by a few cycles (around 1% of the codeblock) >Benching REF+1)a) vs your patch, I see around 3% improvement with >something that is fairly more optimized overall than ffmpeg's master, >ie ff_hevc_hls_residual_coding is a lot more prevalent, which is >probably also the case in your rpi2 benchmarks. Sorry - I don't quite understand what you've said here. >Note: I don't think I'll review next iterations of the patch(set) with >any shape of diligence, but some of the above parts (1.a, 4 and 5) are >ok if not the cause of the fate issues. > >Best regards, Thanks JC ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (especially ARM)
On Wed, 20 Jan 2016 13:26:05 +0100, you wrote: >Hi, > >2016-01-19 13:46 GMT+01:00 John Cox : >> I've just done a fair bit of work on hevc_cabac decode for the Rasberry >> Pi2 and I think that the patch is generally applicable. Patch is >> attached but you may prefer to take it from git: > >This work is certainly impressive, and most people would have come >only with some of the "tricks" you used. >Although it already represents quite a bit of work, I echo others' >suggestions to have more incremental changes. > >> I have not yet run fate over it as I haven't yet finished downloading >> the samples (the internet connection here isn't wildly fast), but I have >> run it against the H265.1 conformance streams on both x86 and ARM and it >> causes no regressions. > >Your patch fails on the later fate tests linked to range extensions >(RExt sequences) on Win64. I didn't investigate why. Random thoughts: >transform_skip, cross-channel residual, some bypass-coded elements (eg >SAO). Thanks for that - bug in my persistent rice processing. Apparently untested by the main conformance suite. Code now passes fate (x86 anyway). [snip] Regards JC ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (v2)
Hi v2 of my hevc residual patch I've fixed the fate regression I've split it into more pieces Now uses ff_clz Some reformating of function headers The patches can also be found on https://github.com/jc-kynesim/rpi-ffmpeg.git on branch test/ff_hevc_cabac_4 from tag ff_hevc_cabac_4_base Note that I will be going on holiday from the end of Friday (UK time) till the 1st Feb and will be unable to edit code or read this list during that period. Regards JC 0001-cabac-Ensure-2-byte-cabac-loads-are-on-2-byte-boundr.patch Description: Binary data 0002-cabac_functions-Cound-zeros-with-ctz-if-it-is-fast.patch Description: Binary data 0003-cabac_functions-Allow-more-functions-to-be-overridde.patch Description: Binary data 0004-hevc_cabac-Optimize-ff_hevc_hls_residual_coding.patch Description: Binary data 0005-hevc_cabac-Add-bulk-bypass-decoding.patch Description: Binary data 0006-hevc_cabac-Add-ARM-asm-functions.patch Description: Binary data ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (v2)
>On Fri, Jan 22, 2016 at 01:41:11AM +0100, Michael Niedermayer wrote: >> On Thu, Jan 21, 2016 at 10:45:55AM +0000, John Cox wrote: >> > Hi >> > >> > v2 of my hevc residual patch >> > >> > I've fixed the fate regression >> > I've split it into more pieces >> > Now uses ff_clz >> > Some reformating of function headers >> > >> > The patches can also be found on >> > https://github.com/jc-kynesim/rpi-ffmpeg.git on branch >> > test/ff_hevc_cabac_4 from tag ff_hevc_cabac_4_base >> > >> > Note that I will be going on holiday from the end of Friday (UK time) >> > till the 1st Feb and will be unable to edit code or read this list >> > during that period. >> >> seems failing here (with qemu) >> --cc='ccache arm-linux-gnueabi-gcc-4.5' --extra-cflags='-mfpu=neon >> -mfloat-abi=softfp' --cpu=cortex-a8 --arch=armv7 --target-os=linux >> --enable-cross-compile --disable-iconv --disable-pthreads >> --enable-neon-clobber-test >> tried without --enable-neon-clobber-test too >> >> qemu-arm version 1.1.0, Copyright (c) 2003-2008 >> also tried qemu-arm version 1.6.50 >> >> arm-linux-gnueabi-gcc-4.5 (Ubuntu/Linaro 4.5.3-12ubuntu2) 4.5.3 >> >> also tried your branch > >fate-hevc passes with patch 1-5, so the issue is likely in the last > >[...] Thanks - I'll fix it JC ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (v2)
On Fri, 22 Jan 2016 01:57:58 +0100, you wrote: >On Fri, Jan 22, 2016 at 01:41:11AM +0100, Michael Niedermayer wrote: >> On Thu, Jan 21, 2016 at 10:45:55AM +0000, John Cox wrote: >> > Hi >> > >> > v2 of my hevc residual patch >> > >> > I've fixed the fate regression >> > I've split it into more pieces >> > Now uses ff_clz >> > Some reformating of function headers >> > >> > The patches can also be found on >> > https://github.com/jc-kynesim/rpi-ffmpeg.git on branch >> > test/ff_hevc_cabac_4 from tag ff_hevc_cabac_4_base >> > >> > Note that I will be going on holiday from the end of Friday (UK time) >> > till the 1st Feb and will be unable to edit code or read this list >> > during that period. >> >> seems failing here (with qemu) >> --cc='ccache arm-linux-gnueabi-gcc-4.5' --extra-cflags='-mfpu=neon >> -mfloat-abi=softfp' --cpu=cortex-a8 --arch=armv7 --target-os=linux >> --enable-cross-compile --disable-iconv --disable-pthreads >> --enable-neon-clobber-test >> tried without --enable-neon-clobber-test too >> >> qemu-arm version 1.1.0, Copyright (c) 2003-2008 >> also tried qemu-arm version 1.6.50 >> >> arm-linux-gnueabi-gcc-4.5 (Ubuntu/Linaro 4.5.3-12ubuntu2) 4.5.3 >> >> also tried your branch > >fate-hevc passes with patch 1-5, so the issue is likely in the last > >[...] Yup - bug in the arm update_rice (again - sorry). Now passes fate on ARM too (now I've learnt how to run fate on my Pi in a finite time). New version of patch 6 attached - all others should still be good Regards JC 0006-hevc_cabac-Add-ARM-asm-functions-v2.patch Description: Binary data ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (especially ARM)
On Fri, 22 Jan 2016 12:18:29 +0100, you wrote: >Hi, > >2016-01-20 15:27 GMT+01:00 John Cox : >> The by22 code gained me an overall factor of two in the abs level decode >> - the gains do depend a lot on the quantity of residual - you gain a lot >> more on I-frames than you do otherwise as they tend to have much longer >> residuals. The higher the bitrate the more useful this code is. But as >> you note it didn't use vast amounts of time relative to everything else >> anyway. >> >> The reworking / simplification of the loop(s) around the abs level >> decode and the scaling gave me the biggest single improvement. > >The thing is, it provided no gain on no Win64 system I had at hand. Or >very minor, once I switched off things. The amount of new/changed code >would make it worth discussing, were it not for actual gains on arm. I think on ARM that things fitted with its register limit more often - either way it was useful. Much of the simplificatin work was structural so it was possible for me to extract simple functions to code in asm. >> After that the reworking of get_sig_ceoff_flag_idxs was a useful gain > >Yes, this is the most agreeable part of the non-applied parts. > >> Special caseing the single coeff path gave a similar gain > >This is a big slowdown on Win64 and UHD-bluray like sequences, but >that can be switched off in that case. I'm a bit surprised that it generated a big slowdown - some cache must be running just on the edge, but yes if you normally have hi-bitrate stuff then it isn't wanted. On my test streams the bitrates were normally quite low - quite unlike what I would expect from blu-ray sequences. Default it to off on x86 but on on ARM? >> After that the scale rework - now probably 75% faster than it was >> previously but it wasn't taking a huge amount of time. > >The work is done, I don't mind. > >> And after that all the other bits - my experience with optimising this >> sort of code (I did a lot of work on a TI H.264 implementation in the >> past) is that no single change is going to do everything, you just have >> to polish everything until it goes fast enough. > >Sure. There may be positive interactions, but my own figures showed >the sigmap/greater than flags were the only ones worth optimizing on >Win64. Very plausibly >> Sorry - I don't quite understand what you've said here. > >Doesn't matter anymore, I think I have just laid out the parts >actually mattering, and for haswell/Win64 (ie x86_64). I think you've cleared up my misunderstanding in the expanded comments above. >I'll reply more in depth to the new patchset, but not until you're on >holidays. Which should leave me more time for reviewing it, so all the >better. Good oh. JC ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (v2)
On Fri, 22 Jan 2016 14:42:27 +0100, you wrote: > [snip] >> >fate-hevc passes with patch 1-5, so the issue is likely in the last >> > >> >[...] >> >> Yup - bug in the arm update_rice (again - sorry). Now passes fate on >> ARM too (now I've learnt how to run fate on my Pi in a finite time). >> >> New version of patch 6 attached - all others should still be good > >fate passes on qemu now Hurrah! Many thanks. Sorry about the false starts. >also you may want to add yourself to the MAINTAINERs file (in a patch) >for the parts you added I'll happily add myself once I have some substantial code on master to maintain. Regards JC ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (especially ARM)
Hi >Hi, > >2016-01-22 14:29 GMT+01:00 John Cox : >>>This is a big slowdown on Win64 and UHD-bluray like sequences, but >>>that can be switched off in that case. >> >> I'm a bit surprised that it generated a big slowdown - some cache must >> be running just on the edge, but yes if you normally have hi-bitrate >> stuff then it isn't wanted. On my test streams the bitrates were >> normally quite low - quite unlike what I would expect from blu-ray >> sequences. > >Initial (4 sequences): >6553 decicycles in g, 8387110 runs, 1498 skips >5916 decicycles in g,33546118 runs, 8314 skips >5028 decicycles in g,67101499 runs, 7365 skips >4729 decicycles in g,33548420 runs, 6012 skips > >Deactivating USE_N_END_1: >4746 decicycles in g,16774296 runs, 2920 skips >5373 decicycles in g,33545629 runs, 8803 skips >4141 decicycles in g,67098928 runs, 9936 skips >3869 decicycles in g,33544593 runs, 9839 skips > >But I see the first one surprisingly having half the iterations (but >this has almost converged at this point). >So 10-20%. Coo - that is big. How are you profiling that and with what streams? >I think it has more to do with cache pressure, both code, which >increases from 8 to 9.5KB, and data, with already "large" tables in a >loop that may need to tight. I agreee (and it is what I was trying to suggest in my previous comment). It also suggests that on x86 you might benefit from non-inlined cabac_gets to keep the code size small. >> Default it to off on x86 but on on ARM? > >Yes, I think so. Is ARCH_X86/ARM an appropriate switch for this? Regards JC ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (v2)
On Fri, 22 Jan 2016 18:52:23 +0100, you wrote: >Hi, > >2016-01-21 11:45 GMT+01:00 John Cox : >> Hi >> >> v2 of my hevc residual patch > >I'll review the bit not related to significant coeffs first, because I >think it is the most performance-sensitive. Also there are bits that >could be moved to other patches, at least some are related to the >later bypass patch. Here's a list you'll see detailed below: >- coefficient saturation, which I think is OK to commit >- bypass-related stuff >- boolean stuff (!!stuff), which I think is OK to commit >- cosmetics (like renaming a variable or introducing a shorthand) >- sig(nificant coefficients )map > >The fact is I've benchmarked parts of the code and seeing slowdowns as >well as speedups on x86_64, hence why it would be nice to be able to >test and evaluate each of those parts separately. Fair enough - though given that your slowdowns are almost certainly cache-related the whole may be quite different from the sum of the parts. >> +// Helper fns >> +#ifndef hevc_mem_bits32 >> +static av_always_inline uint32_t hevc_mem_bits32(const void * buf, >> const unsigned int offset) >> +{ >> +return AV_RB32((const uint8_t *)buf + (offset >> 3)) << (offset & 7); >> +} >> +#endif >> + >> +#if !defined(hevc_clz32) >> +#define hevc_clz32 hevc_clz32_builtin >> +static av_always_inline unsigned int hevc_clz32_builtin(const uint32_t x) >> +{ >> +// ff_clz says works on ints (probably) - so adjust if int is >32 bits >> long >> +// the fact that x is passed in as uint32_t will have cleared the top >> bits >> +return ff_clz(x) - (sizeof(int) * 8 - 32); >> +} >> +#endif >> + >> +#define bypass_start(s) >> +#define bypass_finish(s) > >bypass-related? > >> void ff_hevc_save_states(HEVCContext *s, int ctb_addr_ts) >> { >> if (s->ps.pps->entropy_coding_sync_enabled_flag && >> @@ -863,19 +928,19 @@ int ff_hevc_cbf_luma_decode(HEVCContext *s, int >> trafo_depth) >> return GET_CABAC(elem_offset[CBF_LUMA] + !trafo_depth); >> } >> >> -static int hevc_transform_skip_flag_decode(HEVCContext *s, int c_idx) >> +static int hevc_transform_skip_flag_decode(HEVCContext *s, int c_idx_nz) >> { >> -return GET_CABAC(elem_offset[TRANSFORM_SKIP_FLAG] + !!c_idx); >> +return GET_CABAC(elem_offset[TRANSFORM_SKIP_FLAG] + c_idx_nz); >> } >> >> -static int explicit_rdpcm_flag_decode(HEVCContext *s, int c_idx) >> +static int explicit_rdpcm_flag_decode(HEVCContext *s, int c_idx_nz) >> { >> -return GET_CABAC(elem_offset[EXPLICIT_RDPCM_FLAG] + !!c_idx); >> +return GET_CABAC(elem_offset[EXPLICIT_RDPCM_FLAG] + c_idx_nz); >> } >> >> -static int explicit_rdpcm_dir_flag_decode(HEVCContext *s, int c_idx) >> +static int explicit_rdpcm_dir_flag_decode(HEVCContext *s, int c_idx_nz) >> { >> -return GET_CABAC(elem_offset[EXPLICIT_RDPCM_DIR_FLAG] + !!c_idx); >> +return GET_CABAC(elem_offset[EXPLICIT_RDPCM_DIR_FLAG] + c_idx_nz); >> } > >Boolean stuff. Ideally, the whole boolean stuff topic would be better >as a separate patch, with which I would be OK. > >> int ff_hevc_log2_res_scale_abs(HEVCContext *s, int idx) { >> @@ -891,14 +956,14 @@ int ff_hevc_res_scale_sign_flag(HEVCContext *s, int >> idx) { >> return GET_CABAC(elem_offset[RES_SCALE_SIGN_FLAG] + idx); >> } >> >> -static av_always_inline void >> last_significant_coeff_xy_prefix_decode(HEVCContext *s, int c_idx, >> +static av_always_inline void >> last_significant_coeff_xy_prefix_decode(HEVCContext *s, int c_idx_nz, >> int log2_size, int >> *last_scx_prefix, int *last_scy_prefix) >> { >> int i = 0; >> int max = (log2_size << 1) - 1; >> int ctx_offset, ctx_shift; >> >> -if (!c_idx) { >> +if (!c_idx_nz) { >> ctx_offset = 3 * (log2_size - 2) + ((log2_size - 1) >> 2); >> ctx_shift = (log2_size + 1) >> 2; >> } else { >> @@ -929,22 +994,16 @@ static av_always_inline int >> last_significant_coeff_suffix_decode(HEVCContext *s, >> return value; >> } >> >> -static av_always_inline int >> significant_coeff_group_flag_decode(HEVCContext *s, int c_idx, int >> ctx_cg) >> +static av_always_inline int >> significant_coeff_group_flag_decode(HEVCContext *s, int c_idx_nz, int >> ctx_cg) > >cosmetics? I renamed the variable, because c_idx can have values 0..2
[FFmpeg-devel] Allocating a single YUV buffer rather than 3?
Hi In order to get a copy-free display on my target h/w I need to have my decode output YUV planes contiguous. The default allocater gets each plane separately (so they aren't or at least aren't always). Is there a simple preferred way of getting this to work? I've got slightly lost in the maze of twisty little frame/buffer allocation functions and a pointer to the right place would be extremely helpful. If methods vary by decoder/format then I'm only really interested in H.265 8-bit 4:2:0 at the moment. Many thanks JC ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (v2)
Hi On Tue, 2 Feb 2016 12:52:15 +0100, you wrote: >Hi, > >as a motus operandi for this review, I have no time for a proper one, >or at least not fitting with John's timeframe. I'll try to close as >many pending discussions, and would prefer if someone else completed >the review/validation/commit. Thanks >2016-01-22 19:33 GMT+01:00 John Cox : >> Fair enough - though given that your slowdowns are almost certainly >> cache-related the whole may be quite different from the sum of the >> parts. > >True, they don't always translate to anything noticeable, but that's >the best tool we have to objectively decide. Yes, but it isn't always a good one. I have spent substantial time in the past optimising TI DSP based codecs and it was not uncommon that some patches would make life slightly slower until enough of them were applied and then the whole thing suddenly gained a jump in speed. Either way I'm not averse to splitting stuff up and, at least on ARM, none of the patches caused a slowdown. >>>cosmetics? >> >> I renamed the variable, because c_idx can have values 0..2 and c_idx_nz >> is a boolean with 0..1 and in the rewrite of the inc var it is important >> that we are using the _nz variant so having the var named appropriately >> seemed sensible. > >I don't really mind mixing some form of cosmetics (=supposedly without >code generation consequences) although other people prefer splitting >for easier review and regression testing. > >This is not a blocking item for me, just that finding the most >appropriate commit would be nice. My point was that I changed the inputs to that fn and so I changed the vars name to make the point clearer - it should be part of the c_idx_nz patch. >>>I suppose branch prediction could help here, but less likely than for >>>get_cabac_sig_coeff_flag_idxs. >>> >>>Also for this and some others: why inline over av_always_inline? >> No particularly good reason for this one - though for any fn that might >> be called from multiple places there is a strong argument for just >> "inline" as it allows the compiler to make a judgment call based on how >> big L1 cache is and how bad the call penalty. > >Anyway, those kinds of micro-optimizations I'm suggesting need more >testing (sequences, platforms), so let's roll with this. > >>>AV_WN64 of 0x0101010101010101ULL, or even a memset, as it would be >>>inlined by the compiler, given its size, and done the best possible >>>way. >> >> levels is int *, not char * > >Ok, sorry, then 0x00010001ULL. But you can ignore this, it'll >probably make no difference outside of a micro-benchmark. My experience with compilers is that this is the sort of thing that they can and will do off their own bat. (Certainly MS C has been unrolling this sort of memset loop for the past two decades and I'd be stunned if gcc doesn't too), >>>Saturation, could be a separate patch, with which I would be ok. > >btw and iirc, a comment indicated assumptions on what is a "legit" >(instead of conforming ) bitstream/coeffs, making a conscious >decision. > >I know Ronald, ffvp9's author, specifically decided to handle >equivalent cases in bitstreams (hint) from Argon Designs. I have no >opinion, but others might. > >>>Related to but not strictly bypass ? >> >> Not bypass per se, more the general optimisation of abs_level_remaining >> - it is pulled out because I had a slightly better arm asm version of >> the fn. So it could go in that patch, but this allows other asm to >> override it if they so desire. > >What I meant: would better be there than in another commit. > >>>Doing: >>>if (get_cabac(c, state0 + ctx_map[n])) >>>*p++ = n; >>>while (--n != 0) { >>>if (get_cabac(c, state0 + ctx_map[n])) >>>*p++ = n; >>>} >>>is most likely faster, probably also on arm, if the branch prediction is >>>good. >> >> Not convinced. That will increase code size (as get_cabac will inline) >> which can be pure poison as you have found out with USE_N_END_1. > >That's 300B, not 1.5KB. And I *know* it can help, just not on all >platforms and sequences. The same decision was made for ffh264's >equivalent, iirc. I'll have to take your word for it but it seems very strange to me that fn(x); while(cond) fn(x); is faster than do { fn(x); } while (cond); I guess that it might be a branch prediction thing, but the second form uses no more conditions and the first and is shorter. (And the compiler always has the option of unrolling
Re: [FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (v2)
On Tue, 2 Feb 2016 12:52:15 +0100, you wrote: >Hi, > >as a motus operandi for this review, I have no time for a proper one, >or at least not fitting with John's timeframe. I'll try to close as >many pending discussions, and would prefer if someone else completed >the review/validation/commit. Do we have another volunteer? >[snip] Many thanks JC ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH] configure fix arm inline defines
Hi I believe there is a bug in the arm feature detection for inline asm in configure and I have a patch for it. Currently using a command line like: ./configure --enable-cross-compile --arch=arm --cpu=cortex-a7 --target-os=linux --cross-prefix=arm-linux-gnueabihf- gives in config.h: #define HAVE_ARMV5TE 1 #define HAVE_ARMV6 1 #define HAVE_ARMV6T2 1 #define HAVE_ARMV8 0 #define HAVE_NEON 1 #define HAVE_VFP 1 #define HAVE_VFPV3 1 #define HAVE_SETEND 1 ... #define HAVE_ARMV5TE_EXTERNAL 1 #define HAVE_ARMV6_EXTERNAL 1 #define HAVE_ARMV6T2_EXTERNAL 1 #define HAVE_ARMV8_EXTERNAL 0 #define HAVE_NEON_EXTERNAL 0 #define HAVE_VFP_EXTERNAL 1 #define HAVE_VFPV3_EXTERNAL 1 #define HAVE_SETEND_EXTERNAL 1 ... #define HAVE_ARMV5TE_INLINE 0 #define HAVE_ARMV6_INLINE 0 #define HAVE_ARMV6T2_INLINE 0 #define HAVE_ARMV8_INLINE 0 #define HAVE_NEON_INLINE 0 #define HAVE_VFP_INLINE 0 #define HAVE_VFPV3_INLINE 0 #define HAVE_SETEND_INLINE 0 With the patch below you get ... #define HAVE_ARMV5TE 1 #define HAVE_ARMV6 1 #define HAVE_ARMV6T2 1 #define HAVE_ARMV8 0 #define HAVE_NEON 1 #define HAVE_VFP 1 #define HAVE_VFPV3 1 #define HAVE_SETEND 1 ... #define HAVE_ARMV5TE_EXTERNAL 1 #define HAVE_ARMV6_EXTERNAL 1 #define HAVE_ARMV6T2_EXTERNAL 1 #define HAVE_ARMV8_EXTERNAL 0 #define HAVE_NEON_EXTERNAL 0 #define HAVE_VFP_EXTERNAL 1 #define HAVE_VFPV3_EXTERNAL 1 #define HAVE_SETEND_EXTERNAL 1 ... #define HAVE_ARMV5TE_INLINE 1 #define HAVE_ARMV6_INLINE 1 #define HAVE_ARMV6T2_INLINE 1 #define HAVE_ARMV8_INLINE 0 #define HAVE_NEON_INLINE 0 #define HAVE_VFP_INLINE 1 #define HAVE_VFPV3_INLINE 1 #define HAVE_SETEND_INLINE 1 If I want to get Neon enabled as well then I need to have a --mfpu=neon on the command line too. I'm not sure how to get it there unless I pass it as extra flags. This patch adds quotes around the asm that is in the __asm__ statement Regards John Cox diff --git a/configure b/configure index 22eeca22a5..4dbee8d349 100755 --- a/configure +++ b/configure @@ -1040,7 +1040,7 @@ EOF check_insn(){ log check_insn "$@" -check_inline_asm ${1}_inline "$2" +check_inline_asm ${1}_inline "\"$2\"" check_as ${1}_external "$2" } ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH] use av_clip_uintp2_c where clip is variable
Hi I enclose a patch that changes av_clip_uintp2 to av_clip_uintp2_c where the bit depth is variable. This fixes compilation issues if HAVE_ARMV6_INLINE is 1 and therefore allows arm inline detection to be fixed too. Regards John Cox variable_clip.patch Description: Binary data ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] Patch: Replace quotes for inline asm detection.
>On 5/30/2018 10:32 PM, Michael Niedermayer wrote: >> On Wed, May 30, 2018 at 09:48:51AM -0700, Frank Liberato wrote: >>> Please find attached a one line patch: >>> >>> >>>> Commit 8c893aa3cd5 removed quotes that were required to detect >>>> inline asm in clank: >>>> >>>> check_insn armv5te qadd r0, r0, r0 >>>> .../test.c:1:34: error: expected string literal in 'asm' >>>> void foo(void){ __asm__ volatile(qadd r0, r0, r0); } >>>> >>>> The correct code is: >>>> >>>> void foo(void){ __asm__ volatile("qadd r0, r0, r0"); } >>> >>> >>> Thanks >>> Frank >> >>> configure |2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> 2d51797903ad2f3cab321e72bf5e7209116c3dae >>> 0001-Replace-quotes-for-inline-asm-detection.patch >>> From 58c96127b6f1510b956b2280049d1c3778e3cab4 Mon Sep 17 00:00:00 2001 >>> From: "liber...@chromium.org" >>> Date: Tue, 29 May 2018 11:35:04 -0700 >>> Subject: [PATCH] Replace quotes for inline asm detection. >>> >>> Commit 8c893aa3cd5 removed quotes that were required to detect >>> inline asm in clank: >>> >>> check_insn armv5te qadd r0, r0, r0 >>> .../test.c:1:34: error: expected string literal in 'asm' >>> void foo(void){ __asm__ volatile(qadd r0, r0, r0); } >>> >>> The correct code is: >>> >>> void foo(void){ __asm__ volatile("qadd r0, r0, r0"); } >>> --- >>> configure | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/configure b/configure >>> index 22eeca22a5..4dbee8d349 100755 >>> --- a/configure >>> +++ b/configure >>> @@ -1040,7 +1040,7 @@ EOF >>> >>> check_insn(){ >>> log check_insn "$@" >>> -check_inline_asm ${1}_inline "$2" >>> +check_inline_asm ${1}_inline "\"$2\"" >>> check_as ${1}_external "$2" >>> } >> >> This seems to break my arm qemu build: > >That'd be because vf_amplify is calling av_clip_uintp2() with a non >immediate value. The arm optimized function makes an immediate value as >second argument a requirement, so av_clip_uintp2_c() should be used >there instead. > >This means 3c56d673418/8c893aa3cd5 broke detection of arm inline asm >features for your qemu builds as well, and this patch restores that >functionality. > >> >> In file included from src/libavutil/intmath.h:30:0, >> from src/libavutil/common.h:106, >> from src/libavutil/avutil.h:296, >> from src/libavutil/imgutils.h:30, >> from src/libavfilter/vf_amplify.c:21: >> src/libavutil/arm/intmath.h: In function ‘amplify_frame’: >> src/libavutil/arm/intmath.h:77:5: warning: asm operand 2 probably doesn’t >> match constraints [enabled by default] >> src/libavutil/arm/intmath.h:77:5: error: impossible constraint in ‘asm’ >> make: *** [libavfilter/vf_amplify.o] Error 1 >> make: *** Waiting for unfinished jobs >> src/libavfilter/src_movie.c: In function ‘open_stream’: >> src/libavfilter/src_movie.c:175:5: warning: ‘refcounted_frames’ is >> deprecated (declared at src/libavcodec/avcodec.h:2345) >> [-Wdeprecated-declarations] >> src/libavfilter/src_movie.c: In function ‘movie_push_frame’: >> src/libavfilter/src_movie.c:529:9: warning: ‘avcodec_decode_video2’ is >> deprecated (declared at src/libavcodec/avcodec.h:4756) >> [-Wdeprecated-declarations] >> src/libavfilter/src_movie.c:532:9: warning: ‘avcodec_decode_audio4’ is >> deprecated (declared at src/libavcodec/avcodec.h:4707) >> [-Wdeprecated-declarations] >> src/libavfilter/vaf_spectrumsynth.c: In function ‘try_push_frame’: >> src/libavfilter/vaf_spectrumsynth.c:429:12: warning: ‘end’ may be used >> uninitialized in this function [-Wuninitialized] >> src/libavfilter/vaf_spectrumsynth.c:428:14: warning: ‘start’ may be used >> uninitialized in this function [-Wuninitialized] >> src/libavfilter/vaf_spectrumsynth.c: In function ‘try_push_frames’: >> src/libavfilter/vaf_spectrumsynth.c:437:9: warning: ‘ret’ may be used >> uninitialized in this function [-Wuninitialized] >> >> arm-linux-gnueabi-gcc-4.6 (Debian 4.6.3-15) 4.6.3 master is now patched s.t. these should compile with HAVE_ARMV6_INLINE set Regards John Cox ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH v2] configure fix arm inline defines
Hi Actually this is the same patch as before but master has been fixed s.t. enabling arm inline asm no longer breaks it: I believe there is a bug in the arm feature detection for inline asm in configure and I have a patch for it. Currently using a command line like: ./configure --enable-cross-compile --arch=arm --cpu=cortex-a7 --target-os=linux --cross-prefix=arm-linux-gnueabihf- gives in config.h: #define HAVE_ARMV5TE 1 #define HAVE_ARMV6 1 #define HAVE_ARMV6T2 1 #define HAVE_ARMV8 0 #define HAVE_NEON 1 #define HAVE_VFP 1 #define HAVE_VFPV3 1 #define HAVE_SETEND 1 ... #define HAVE_ARMV5TE_EXTERNAL 1 #define HAVE_ARMV6_EXTERNAL 1 #define HAVE_ARMV6T2_EXTERNAL 1 #define HAVE_ARMV8_EXTERNAL 0 #define HAVE_NEON_EXTERNAL 0 #define HAVE_VFP_EXTERNAL 1 #define HAVE_VFPV3_EXTERNAL 1 #define HAVE_SETEND_EXTERNAL 1 ... #define HAVE_ARMV5TE_INLINE 0 #define HAVE_ARMV6_INLINE 0 #define HAVE_ARMV6T2_INLINE 0 #define HAVE_ARMV8_INLINE 0 #define HAVE_NEON_INLINE 0 #define HAVE_VFP_INLINE 0 #define HAVE_VFPV3_INLINE 0 #define HAVE_SETEND_INLINE 0 With the patch below you get ... #define HAVE_ARMV5TE 1 #define HAVE_ARMV6 1 #define HAVE_ARMV6T2 1 #define HAVE_ARMV8 0 #define HAVE_NEON 1 #define HAVE_VFP 1 #define HAVE_VFPV3 1 #define HAVE_SETEND 1 ... #define HAVE_ARMV5TE_EXTERNAL 1 #define HAVE_ARMV6_EXTERNAL 1 #define HAVE_ARMV6T2_EXTERNAL 1 #define HAVE_ARMV8_EXTERNAL 0 #define HAVE_NEON_EXTERNAL 0 #define HAVE_VFP_EXTERNAL 1 #define HAVE_VFPV3_EXTERNAL 1 #define HAVE_SETEND_EXTERNAL 1 ... #define HAVE_ARMV5TE_INLINE 1 #define HAVE_ARMV6_INLINE 1 #define HAVE_ARMV6T2_INLINE 1 #define HAVE_ARMV8_INLINE 0 #define HAVE_NEON_INLINE 0 #define HAVE_VFP_INLINE 1 #define HAVE_VFPV3_INLINE 1 #define HAVE_SETEND_INLINE 1 If I want to get Neon enabled as well then I need to have a --mfpu=neon on the command line too. I'm not sure how to get it there unless I pass it as extra flags. This patch adds quotes around the asm that is in the __asm__ statement Regards John Cox diff --git a/configure b/configure index 22eeca22a5..4dbee8d349 100755 --- a/configure +++ b/configure @@ -1040,7 +1040,7 @@ EOF check_insn(){ log check_insn "$@" -check_inline_asm ${1}_inline "$2" +check_inline_asm ${1}_inline "\"$2\"" check_as ${1}_external "$2" } ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH v2] configure fix arm inline defines
>Hi > >Actually this is the same patch as before but master has been fixed s.t. >enabling arm inline asm no longer breaks it: > >I believe there is a bug in the arm feature detection for inline asm in >configure and I have a patch for it. > >Currently using a command line like: > >./configure --enable-cross-compile --arch=arm --cpu=cortex-a7 >--target-os=linux --cross-prefix=arm-linux-gnueabihf- > >gives in config.h: > >#define HAVE_ARMV5TE 1 >#define HAVE_ARMV6 1 >#define HAVE_ARMV6T2 1 >#define HAVE_ARMV8 0 >#define HAVE_NEON 1 >#define HAVE_VFP 1 >#define HAVE_VFPV3 1 >#define HAVE_SETEND 1 >... >#define HAVE_ARMV5TE_EXTERNAL 1 >#define HAVE_ARMV6_EXTERNAL 1 >#define HAVE_ARMV6T2_EXTERNAL 1 >#define HAVE_ARMV8_EXTERNAL 0 >#define HAVE_NEON_EXTERNAL 0 >#define HAVE_VFP_EXTERNAL 1 >#define HAVE_VFPV3_EXTERNAL 1 >#define HAVE_SETEND_EXTERNAL 1 >... >#define HAVE_ARMV5TE_INLINE 0 >#define HAVE_ARMV6_INLINE 0 >#define HAVE_ARMV6T2_INLINE 0 >#define HAVE_ARMV8_INLINE 0 >#define HAVE_NEON_INLINE 0 >#define HAVE_VFP_INLINE 0 >#define HAVE_VFPV3_INLINE 0 >#define HAVE_SETEND_INLINE 0 > >With the patch below you get > >... >#define HAVE_ARMV5TE 1 >#define HAVE_ARMV6 1 >#define HAVE_ARMV6T2 1 >#define HAVE_ARMV8 0 >#define HAVE_NEON 1 >#define HAVE_VFP 1 >#define HAVE_VFPV3 1 >#define HAVE_SETEND 1 >... >#define HAVE_ARMV5TE_EXTERNAL 1 >#define HAVE_ARMV6_EXTERNAL 1 >#define HAVE_ARMV6T2_EXTERNAL 1 >#define HAVE_ARMV8_EXTERNAL 0 >#define HAVE_NEON_EXTERNAL 0 >#define HAVE_VFP_EXTERNAL 1 >#define HAVE_VFPV3_EXTERNAL 1 >#define HAVE_SETEND_EXTERNAL 1 >... >#define HAVE_ARMV5TE_INLINE 1 >#define HAVE_ARMV6_INLINE 1 >#define HAVE_ARMV6T2_INLINE 1 >#define HAVE_ARMV8_INLINE 0 >#define HAVE_NEON_INLINE 0 >#define HAVE_VFP_INLINE 1 >#define HAVE_VFPV3_INLINE 1 >#define HAVE_SETEND_INLINE 1 > >If I want to get Neon enabled as well then I need to have a --mfpu=neon >on the command line too. I'm not sure how to get it there unless I pass >it as extra flags. > >This patch adds quotes around the asm that is in the __asm__ statement > >Regards > >John Cox > >diff --git a/configure b/configure >index 22eeca22a5..4dbee8d349 100755 >--- a/configure >+++ b/configure >@@ -1040,7 +1040,7 @@ EOF > > check_insn(){ > log check_insn "$@" >-check_inline_asm ${1}_inline "$2" >+check_inline_asm ${1}_inline "\"$2\"" > check_as ${1}_external "$2" > } >___ >ffmpeg-devel mailing list >ffmpeg-devel@ffmpeg.org >http://ffmpeg.org/mailman/listinfo/ffmpeg-devel Ping This fixes the regression whereby no arm inline asm is ever enabled. There is still the neon inline regression, but that will be another patch. Master now compiles OK with arm inline asm enabled. (Which it didn't 1st time this patch was suggested) JC ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH] configure: fix inline neon regression
Hi This patch fixes the regression whereby inline neon is not enabled Actually I'm a bit unsure about this patch (despite the fact I'm submitting it). It does do its job in that if you specify an armv7a cpu then it will try to enable neon, but it is a bit mucky due to uncertainties about exactly what capabilities each cpu actually has. Really configure probably wants a --fpu= option, but my understanding of how it is meant to work isn't up to that, so for the moment if the fpu type is specified by the user then I expect it to turn up in cextra_flags. I'll also note that probe_arm_arch ends up setting subarch to armv7-a when the other bits of the script expect armv7a (although gcc wants armv7-a in -march). Again I am confused by this but I'm not sure what the right answer is let alone the correct fix. Maybe whoever wrote this bit of configure could revisit it? Regards John Cox neon_inline.patch Description: Binary data ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH] avfilter/vf_bwdif: Add capability to deinterlace NV12
As bwdif takes no account of horizontally adjacent pixels the same code can be used on planes that have multiple components as is used on single component planes. Update the filtering code to cope with multi-component planes and add NV12 to the list of supported formats. Signed-off-by: John Cox --- libavfilter/vf_bwdif.c | 16 +--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c index 353cd0b61a..e07783ff70 100644 --- a/libavfilter/vf_bwdif.c +++ b/libavfilter/vf_bwdif.c @@ -115,19 +115,28 @@ static void filter(AVFilterContext *ctx, AVFrame *dstpic, YADIFContext *yadif = &bwdif->yadif; ThreadData td = { .frame = dstpic, .parity = parity, .tff = tff }; int i; +int last_plane = -1; for (i = 0; i < yadif->csp->nb_components; i++) { int w = dstpic->width; int h = dstpic->height; +const AVComponentDescriptor * const comp = yadif->csp->comp + i; + +// If the last plane was the same as this plane assume we've dealt +// with all the pels already +if (last_plane == comp->plane) +continue; +last_plane = comp->plane; if (i == 1 || i == 2) { w = AV_CEIL_RSHIFT(w, yadif->csp->log2_chroma_w); h = AV_CEIL_RSHIFT(h, yadif->csp->log2_chroma_h); } -td.w = w; -td.h = h; -td.plane = i; +// comp step is in bytes but td.w is in pels +td.w = w * comp->step / ((comp->depth + 7) / 8); +td.h = h; +td.plane = comp->plane; ff_filter_execute(ctx, filter_slice, &td, NULL, FFMIN((h+3)/4, ff_filter_get_nb_threads(ctx))); @@ -162,6 +171,7 @@ static const enum AVPixelFormat pix_fmts[] = { AV_PIX_FMT_YUVA420P9, AV_PIX_FMT_YUVA422P9, AV_PIX_FMT_YUVA444P9, AV_PIX_FMT_YUVA420P10, AV_PIX_FMT_YUVA422P10, AV_PIX_FMT_YUVA444P10, AV_PIX_FMT_YUVA420P16, AV_PIX_FMT_YUVA422P16, AV_PIX_FMT_YUVA444P16, +AV_PIX_FMT_NV12, AV_PIX_FMT_GBRP, AV_PIX_FMT_GBRP9, AV_PIX_FMT_GBRP10, AV_PIX_FMT_GBRP12, AV_PIX_FMT_GBRP14, AV_PIX_FMT_GBRP16, AV_PIX_FMT_GBRAP, AV_PIX_FMT_GBRAP16, -- 2.40.1 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH 00/15] avfilter/vf_bwdif: Add aarch64 neon functions
Also adds a filter_line3 method which on aarch64 neon yields approx 30% speedup over 2xfilter_line and a memcpy John Cox (15): avfilter/vf_bwdif: Add outline for aarch neon functions avfilter/vf_bwdif: Add common macros and consts for aarch64 neon avfilter/vf_bwdif: Export C filter_intra avfilter/vf_bwdif: Add neon for filter_intra tests/checkasm: Add test for vf_bwdif filter_intra avfilter/vf_bwdif: Add clip and spatial macros for aarch64 neon avfilter/vf_bwdif: Export C filter_edge avfilter/vf_bwdif: Add neon for filter_edge tests/checkasm: Add test for vf_bwdif filter_edge avfilter/vf_bwdif: Export C filter_line avfilter/vf_bwdif: Add neon for filter_line avfilter/vf_bwdif: Add a filter_line3 method for optimisation avfilter/vf_bwdif: Add neon for filter_line3 tests/checkasm: Add test for vf_bwdif filter_line3 avfilter/vf_bwdif: Block filter slices into a multiple of 4 lines libavfilter/aarch64/Makefile| 2 + libavfilter/aarch64/vf_bwdif_init_aarch64.c | 125 libavfilter/aarch64/vf_bwdif_neon.S | 780 libavfilter/bwdif.h | 20 + libavfilter/vf_bwdif.c | 70 +- tests/checkasm/vf_bwdif.c | 172 + 6 files changed, 1154 insertions(+), 15 deletions(-) create mode 100644 libavfilter/aarch64/vf_bwdif_init_aarch64.c create mode 100644 libavfilter/aarch64/vf_bwdif_neon.S -- 2.39.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH 01/15] avfilter/vf_bwdif: Add outline for aarch neon functions
Outline but no actual functions. Signed-off-by: John Cox --- libavfilter/aarch64/Makefile| 2 ++ libavfilter/aarch64/vf_bwdif_init_aarch64.c | 39 + libavfilter/aarch64/vf_bwdif_neon.S | 25 + libavfilter/bwdif.h | 1 + libavfilter/vf_bwdif.c | 2 ++ 5 files changed, 69 insertions(+) create mode 100644 libavfilter/aarch64/vf_bwdif_init_aarch64.c create mode 100644 libavfilter/aarch64/vf_bwdif_neon.S diff --git a/libavfilter/aarch64/Makefile b/libavfilter/aarch64/Makefile index b58daa3a3f..b68209bc94 100644 --- a/libavfilter/aarch64/Makefile +++ b/libavfilter/aarch64/Makefile @@ -1,3 +1,5 @@ +OBJS-$(CONFIG_BWDIF_FILTER) += aarch64/vf_bwdif_init_aarch64.o OBJS-$(CONFIG_NLMEANS_FILTER)+= aarch64/vf_nlmeans_init.o +NEON-OBJS-$(CONFIG_BWDIF_FILTER) += aarch64/vf_bwdif_neon.o NEON-OBJS-$(CONFIG_NLMEANS_FILTER) += aarch64/vf_nlmeans_neon.o diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c b/libavfilter/aarch64/vf_bwdif_init_aarch64.c new file mode 100644 index 00..86d53b2ca1 --- /dev/null +++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c @@ -0,0 +1,39 @@ +/* + * bwdif aarch64 NEON optimisations + * + * Copyright (c) 2023 John Cox + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libavutil/common.h" +#include "libavfilter/bwdif.h" +#include "libavutil/aarch64/cpu.h" + +void +ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth) +{ +const int cpu_flags = av_get_cpu_flags(); + +if (bit_depth != 8) +return; + +if (!have_neon(cpu_flags)) +return; + +} + diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S new file mode 100644 index 00..639ab22998 --- /dev/null +++ b/libavfilter/aarch64/vf_bwdif_neon.S @@ -0,0 +1,25 @@ +/* + * bwdif aarch64 NEON optimisations + * + * Copyright (c) 2023 John Cox + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + + +#include "libavutil/aarch64/asm.S" + diff --git a/libavfilter/bwdif.h b/libavfilter/bwdif.h index 5749345f78..6a0f70487a 100644 --- a/libavfilter/bwdif.h +++ b/libavfilter/bwdif.h @@ -39,5 +39,6 @@ typedef struct BWDIFContext { void ff_bwdif_init_filter_line(BWDIFContext *bwdif, int bit_depth); void ff_bwdif_init_x86(BWDIFContext *bwdif, int bit_depth); +void ff_bwdif_init_aarch64(BWDIFContext *bwdif, int bit_depth); #endif /* AVFILTER_BWDIF_H */ diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c index e278cf1217..39a51429ac 100644 --- a/libavfilter/vf_bwdif.c +++ b/libavfilter/vf_bwdif.c @@ -369,6 +369,8 @@ av_cold void ff_bwdif_init_filter_line(BWDIFContext *s, int bit_depth) #if ARCH_X86 ff_bwdif_init_x86(s, bit_depth); +#elif ARCH_AARCH64 +ff_bwdif_init_aarch64(s, bit_depth); #endif } -- 2.39.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH 02/15] avfilter/vf_bwdif: Add common macros and consts for aarch64 neon
Add macros for dual scalar half->single multiply and accumulate Add macro for shift, saturate and shorten single to byte Add filter constants Signed-off-by: John Cox --- libavfilter/aarch64/vf_bwdif_neon.S | 46 + 1 file changed, 46 insertions(+) diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S index 639ab22998..a8f0ed525a 100644 --- a/libavfilter/aarch64/vf_bwdif_neon.S +++ b/libavfilter/aarch64/vf_bwdif_neon.S @@ -23,3 +23,49 @@ #include "libavutil/aarch64/asm.S" +.macro SQSHRUNN b, s0, s1, s2, s3, n +sqshrun \s0\().4h, \s0\().4s, #\n - 8 +sqshrun2\s0\().8h, \s1\().4s, #\n - 8 +sqshrun \s1\().4h, \s2\().4s, #\n - 8 +sqshrun2\s1\().8h, \s3\().4s, #\n - 8 +uzp2\b\().16b, \s0\().16b, \s1\().16b +.endm + +.macro SMULL4K a0, a1, a2, a3, s0, s1, k +smull \a0\().4s, \s0\().4h, \k +smull2 \a1\().4s, \s0\().8h, \k +smull \a2\().4s, \s1\().4h, \k +smull2 \a3\().4s, \s1\().8h, \k +.endm + +.macro UMULL4K a0, a1, a2, a3, s0, s1, k +umull \a0\().4s, \s0\().4h, \k +umull2 \a1\().4s, \s0\().8h, \k +umull \a2\().4s, \s1\().4h, \k +umull2 \a3\().4s, \s1\().8h, \k +.endm + +.macro UMLAL4K a0, a1, a2, a3, s0, s1, k +umlal \a0\().4s, \s0\().4h, \k +umlal2 \a1\().4s, \s0\().8h, \k +umlal \a2\().4s, \s1\().4h, \k +umlal2 \a3\().4s, \s1\().8h, \k +.endm + +.macro UMLSL4K a0, a1, a2, a3, s0, s1, k +umlsl \a0\().4s, \s0\().4h, \k +umlsl2 \a1\().4s, \s0\().8h, \k +umlsl \a2\().4s, \s1\().4h, \k +umlsl2 \a3\().4s, \s1\().8h, \k +.endm + +// static const uint16_t coef_lf[2] = { 4309, 213 }; +// static const uint16_t coef_hf[3] = { 5570, 3801, 1016 }; +// static const uint16_t coef_sp[2] = { 5077, 981 }; + +.align 16 +coeffs: +.hword 4309 * 4, 213 * 4 // lf[0]*4 = v0.h[0] +.hword 5570, 3801, 1016, -3801 // hf[0] = v0.h[2], -hf[1] = v0.h[5] +.hword 5077, 981 // sp[0] = v0.h[6] + -- 2.39.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH 03/15] avfilter/vf_bwdif: Export C filter_intra
Needed for tail fixup of neon code Signed-off-by: John Cox --- libavfilter/bwdif.h| 3 +++ libavfilter/vf_bwdif.c | 6 +++--- 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/libavfilter/bwdif.h b/libavfilter/bwdif.h index 6a0f70487a..ae6f6ce223 100644 --- a/libavfilter/bwdif.h +++ b/libavfilter/bwdif.h @@ -41,4 +41,7 @@ void ff_bwdif_init_filter_line(BWDIFContext *bwdif, int bit_depth); void ff_bwdif_init_x86(BWDIFContext *bwdif, int bit_depth); void ff_bwdif_init_aarch64(BWDIFContext *bwdif, int bit_depth); +void ff_bwdif_filter_intra_c(void *dst1, void *cur1, int w, int prefs, int mrefs, + int prefs3, int mrefs3, int parity, int clip_max); + #endif /* AVFILTER_BWDIF_H */ diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c index 39a51429ac..035fc58670 100644 --- a/libavfilter/vf_bwdif.c +++ b/libavfilter/vf_bwdif.c @@ -122,8 +122,8 @@ typedef struct ThreadData { next2++; \ } -static void filter_intra(void *dst1, void *cur1, int w, int prefs, int mrefs, - int prefs3, int mrefs3, int parity, int clip_max) +void ff_bwdif_filter_intra_c(void *dst1, void *cur1, int w, int prefs, int mrefs, + int prefs3, int mrefs3, int parity, int clip_max) { uint8_t *dst = dst1; uint8_t *cur = cur1; @@ -362,7 +362,7 @@ av_cold void ff_bwdif_init_filter_line(BWDIFContext *s, int bit_depth) s->filter_line = filter_line_c_16bit; s->filter_edge = filter_edge_16bit; } else { -s->filter_intra = filter_intra; +s->filter_intra = ff_bwdif_filter_intra_c; s->filter_line = filter_line_c; s->filter_edge = filter_edge; } -- 2.39.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH 04/15] avfilter/vf_bwdif: Add neon for filter_intra
Signed-off-by: John Cox --- libavfilter/aarch64/vf_bwdif_init_aarch64.c | 17 +++ libavfilter/aarch64/vf_bwdif_neon.S | 53 + 2 files changed, 70 insertions(+) diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c b/libavfilter/aarch64/vf_bwdif_init_aarch64.c index 86d53b2ca1..3ffaa07ab3 100644 --- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c +++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c @@ -24,6 +24,22 @@ #include "libavfilter/bwdif.h" #include "libavutil/aarch64/cpu.h" +void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, int mrefs, +int prefs3, int mrefs3, int parity, int clip_max); + + +static void filter_intra_helper(void *dst1, void *cur1, int w, int prefs, int mrefs, +int prefs3, int mrefs3, int parity, int clip_max) +{ +const int w0 = clip_max != 255 ? 0 : w & ~15; + +ff_bwdif_filter_intra_neon(dst1, cur1, w0, prefs, mrefs, prefs3, mrefs3, parity, clip_max); + +if (w0 < w) +ff_bwdif_filter_intra_c((char *)dst1 + w0, (char *)cur1 + w0, +w - w0, prefs, mrefs, prefs3, mrefs3, parity, clip_max); +} + void ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth) { @@ -35,5 +51,6 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth) if (!have_neon(cpu_flags)) return; +s->filter_intra = filter_intra_helper; } diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S index a8f0ed525a..b863b3447d 100644 --- a/libavfilter/aarch64/vf_bwdif_neon.S +++ b/libavfilter/aarch64/vf_bwdif_neon.S @@ -69,3 +69,56 @@ coeffs: .hword 5570, 3801, 1016, -3801 // hf[0] = v0.h[2], -hf[1] = v0.h[5] .hword 5077, 981 // sp[0] = v0.h[6] +// +// +// void ff_bwdif_filter_intra_neon( +// void *dst1, // x0 +// void *cur1, // x1 +// int w, // w2 +// int prefs, // w3 +// int mrefs, // w4 +// int prefs3, // w5 +// int mrefs3, // w6 +// int parity, // w7 unused +// int clip_max) // [sp, #0] unused + +function ff_bwdif_filter_intra_neon, export=1 +cmp w2, #0 +ble 99f + +ldr q0, coeffs + +//for (x = 0; x < w; x++) { +10: + +//interpol = (coef_sp[0] * (cur[mrefs] + cur[prefs]) - coef_sp[1] * (cur[mrefs3] + cur[prefs3])) >> 13; +ldr q31, [x1, w4, SXTW] +ldr q30, [x1, w3, SXTW] +ldr q29, [x1, w6, SXTW] +ldr q28, [x1, w5, SXTW] + +uaddl v20.8h, v31.8b, v30.8b +uaddl2 v21.8h, v31.16b, v30.16b + +UMULL4K v2, v3, v4, v5, v20, v21, v0.h[6] + +uaddl v20.8h, v29.8b, v28.8b +uaddl2 v21.8h, v29.16b, v28.16b + +UMLSL4K v2, v3, v4, v5, v20, v21, v0.h[7] + +//dst[0] = av_clip(interpol, 0, clip_max); +SQSHRUNNv2, v2, v3, v4, v5, 13 +str q2, [x0], #16 + +//dst++; +//cur++; +//} + +subsw2, w2, #16 +add x1, x1, #16 +bgt 10b + +99: +ret +endfunc -- 2.39.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH 05/15] tests/checkasm: Add test for vf_bwdif filter_intra
Signed-off-by: John Cox --- tests/checkasm/vf_bwdif.c | 37 + 1 file changed, 37 insertions(+) diff --git a/tests/checkasm/vf_bwdif.c b/tests/checkasm/vf_bwdif.c index 46224bb575..034bbabb4c 100644 --- a/tests/checkasm/vf_bwdif.c +++ b/tests/checkasm/vf_bwdif.c @@ -20,6 +20,7 @@ #include "checkasm.h" #include "libavcodec/internal.h" #include "libavfilter/bwdif.h" +#include "libavutil/mem_internal.h" #define WIDTH 256 @@ -81,4 +82,40 @@ void checkasm_check_vf_bwdif(void) BODY(uint16_t, 10); report("bwdif10"); } + +if (check_func(ctx_8.filter_intra, "bwdif8.intra")) { +LOCAL_ALIGNED_16(uint8_t, cur0, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, cur1, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, dst0, [WIDTH*3]); +LOCAL_ALIGNED_16(uint8_t, dst1, [WIDTH*3]); +const int stride = WIDTH; +const int mask = (1<<8)-1; + +declare_func(void, void *dst1, void *cur1, int w, int prefs, int mrefs, + int prefs3, int mrefs3, int parity, int clip_max); + +randomize_buffers( cur0, cur1, mask, 11*WIDTH); +memset(dst0, 0xba, WIDTH * 3); +memset(dst1, 0xba, WIDTH * 3); + +call_ref(dst0 + stride, + cur0 + stride * 4, WIDTH, + stride, -stride, stride * 3, -stride * 3, + 0, mask); +call_new(dst1 + stride, + cur0 + stride * 4, WIDTH, + stride, -stride, stride * 3, -stride * 3, + 0, mask); + +if (memcmp(dst0, dst1, WIDTH*3) +|| memcmp( cur0, cur1, WIDTH*11)) +fail(); + +bench_new(dst1 + stride, + cur0 + stride * 4, WIDTH, + stride, -stride, stride * 3, -stride * 3, + 0, mask); + +report("bwdif8.intra"); +} } -- 2.39.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH 06/15] avfilter/vf_bwdif: Add clip and spatial macros for aarch64 neon
Signed-off-by: John Cox --- libavfilter/aarch64/vf_bwdif_neon.S | 59 + 1 file changed, 59 insertions(+) diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S index b863b3447d..6c5d1598f4 100644 --- a/libavfilter/aarch64/vf_bwdif_neon.S +++ b/libavfilter/aarch64/vf_bwdif_neon.S @@ -59,6 +59,65 @@ umlsl2 \a3\().4s, \s1\().8h, \k .endm +// int b = m2s1 - m1; +// int f = p2s1 - p1; +// int dc = c0s1 - m1; +// int de = c0s1 - p1; +// int sp_max = FFMIN(p1 - c0s1, m1 - c0s1); +// sp_max = FFMIN(sp_max, FFMAX(-b,-f)); +// int sp_min = FFMIN(c0s1 - p1, c0s1 - m1); +// sp_min = FFMIN(sp_min, FFMAX(b,f)); +// diff = diff == 0 ? 0 : FFMAX3(diff, sp_min, sp_max); +.macro SPAT_CHECK diff, m2s1, m1, c0s1, p1, p2s1, t0, t1, t2, t3 +uqsub \t0\().16b, \p1\().16b, \c0s1\().16b +uqsub \t2\().16b, \m1\().16b, \c0s1\().16b +umin\t2\().16b, \t0\().16b, \t2\().16b + +uqsub \t1\().16b, \m1\().16b, \m2s1\().16b +uqsub \t3\().16b, \p1\().16b, \p2s1\().16b +umax\t3\().16b, \t3\().16b, \t1\().16b +umin\t3\().16b, \t3\().16b, \t2\().16b + +uqsub \t0\().16b, \c0s1\().16b, \p1\().16b +uqsub \t2\().16b, \c0s1\().16b, \m1\().16b +umin\t2\().16b, \t0\().16b, \t2\().16b + +uqsub \t1\().16b, \m2s1\().16b, \m1\().16b +uqsub \t0\().16b, \p2s1\().16b, \p1\().16b +umax\t0\().16b, \t0\().16b, \t1\().16b +umin\t2\().16b, \t2\().16b, \t0\().16b + +cmeq\t1\().16b, \diff\().16b, #0 +umax\diff\().16b, \diff\().16b, \t3\().16b +umax\diff\().16b, \diff\().16b, \t2\().16b +bic \diff\().16b, \diff\().16b, \t1\().16b +.endm + +// i0 = s0; +// if (i0 > d0 + diff0) +// i0 = d0 + diff0; +// else if (i0 < d0 - diff0) +// i0 = d0 - diff0; +// +// i0 = s0 is safe +.macro DIFF_CLIP i0, s0, d0, diff, t0, t1 +uqadd \t0\().16b, \d0\().16b, \diff\().16b +uqsub \t1\().16b, \d0\().16b, \diff\().16b +umin\i0\().16b, \s0\().16b, \t0\().16b +umax\i0\().16b, \i0\().16b, \t1\().16b +.endm + +// i0 = FFABS(m1 - p1) > td0 ? i1 : i2; +// DIFF_CLIP +// +// i0 = i1 is safe +.macro INTERPOL i0, i1, i2, m1, d0, p1, td0, diff, t0, t1, t2 +uabd\t0\().16b, \m1\().16b, \p1\().16b +cmhi\t0\().16b, \t0\().16b, \td0\().16b +bsl \t0\().16b, \i1\().16b, \i2\().16b +DIFF_CLIP \i0, \t0, \d0, \diff, \t1, \t2 +.endm + // static const uint16_t coef_lf[2] = { 4309, 213 }; // static const uint16_t coef_hf[3] = { 5570, 3801, 1016 }; // static const uint16_t coef_sp[2] = { 5077, 981 }; -- 2.39.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH 13/15] avfilter/vf_bwdif: Add neon for filter_line3
Signed-off-by: John Cox --- libavfilter/aarch64/vf_bwdif_init_aarch64.c | 28 ++ libavfilter/aarch64/vf_bwdif_neon.S | 278 2 files changed, 306 insertions(+) diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c b/libavfilter/aarch64/vf_bwdif_init_aarch64.c index 21e67884ab..f52bc4b9b4 100644 --- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c +++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c @@ -36,6 +36,33 @@ void ff_bwdif_filter_line_neon(void *dst1, void *prev1, void *cur1, void *next1, int prefs3, int mrefs3, int prefs4, int mrefs4, int parity, int clip_max); +void ff_bwdif_filter_line3_neon(void * dst1, int d_stride, +const void * prev1, const void * cur1, const void * next1, int s_stride, +int w, int parity, int clip_max); + + +static void filter_line3_helper(void * dst1, int d_stride, +const void * prev1, const void * cur1, const void * next1, int s_stride, +int w, int parity, int clip_max) +{ +// Asm works on 16 byte chunks +// If w is a multiple of 16 then all is good - if not then if width rounded +// up to nearest 16 will fit in both src & dst strides then allow the asm +// to write over the padding bytes as that is almost certainly faster than +// having to invoke the C version to clean up the tail. +const int w1 = FFALIGN(w, 16); +const int w0 = clip_max != 255 ? 0 : + d_stride <= w1 && s_stride <= w1 ? w : w & ~15; + +ff_bwdif_filter_line3_neon(dst1, d_stride, + prev1, cur1, next1, s_stride, + w0, parity, clip_max); + +if (w0 < w) +ff_bwdif_filter_line3_c((char *)dst1 + w0, d_stride, +(const char *)prev1 + w0, (const char *)cur1 + w0, (const char *)next1 + w0, s_stride, +w - w0, parity, clip_max); +} static void filter_line_helper(void *dst1, void *prev1, void *cur1, void *next1, int w, int prefs, int mrefs, int prefs2, int mrefs2, @@ -93,5 +120,6 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth) s->filter_intra = filter_intra_helper; s->filter_line = filter_line_helper; s->filter_edge = filter_edge_helper; +s->filter_line3 = filter_line3_helper; } diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S index 675e97d966..bcffbe5793 100644 --- a/libavfilter/aarch64/vf_bwdif_neon.S +++ b/libavfilter/aarch64/vf_bwdif_neon.S @@ -128,6 +128,284 @@ coeffs: .hword 5570, 3801, 1016, -3801 // hf[0] = v0.h[2], -hf[1] = v0.h[5] .hword 5077, 981 // sp[0] = v0.h[6] +// === +// +// void ff_bwdif_filter_line3_neon( +// void * dst1, // x0 +// int d_stride,// w1 +// const void * prev1, // x2 +// const void * cur1, // x3 +// const void * next1, // x4 +// int s_stride,// w5 +// int w, // w6 +// int parity, // w7 +// int clip_max); // [sp, #0] (Ignored) + +function ff_bwdif_filter_line3_neon, export=1 +// Sanity check w +cmp w6, #0 +ble 99f + +// #define prev2 cur +//const uint8_t * restrict next2 = parity ? prev : next; +cmp w7, #0 +cselx17, x2, x4, ne + +// We want all the V registers - save all the ones we must +stp d14, d15, [sp, #-64]! +stp d8, d9, [sp, #48] +stp d10, d11, [sp, #32] +stp d12, d13, [sp, #16] + +ldr q0, coeffs + +// Some rearrangement of initial values for nice layout of refs in regs +mov w10, w6 // w10 = loop count +neg w9, w5 // w9 = mref +lsl w8, w9, #1// w8 = mref2 +add w7, w9, w9, LSL #1// w7 = mref3 +lsl w6, w9, #2// w6 = mref4 +mov w11, w5 // w11 = pref +lsl w12, w5, #1// w12 = pref2 +add w13, w5, w5, LSL #1// w13 = pref3 +lsl w14, w5, #2// w14 = pref4 +add w15, w5, w5, LSL #2// w15 = pref5 +add w16, w14, w12 // w16 = pref6 + +lsl w5, w1, #1// w5 = d_stride * 2 + +// for (x = 0; x
[FFmpeg-devel] [PATCH 07/15] avfilter/vf_bwdif: Export C filter_edge
Needed for tail fixup of neon code Signed-off-by: John Cox --- libavfilter/bwdif.h| 4 libavfilter/vf_bwdif.c | 8 2 files changed, 8 insertions(+), 4 deletions(-) diff --git a/libavfilter/bwdif.h b/libavfilter/bwdif.h index ae6f6ce223..ae1616d366 100644 --- a/libavfilter/bwdif.h +++ b/libavfilter/bwdif.h @@ -41,6 +41,10 @@ void ff_bwdif_init_filter_line(BWDIFContext *bwdif, int bit_depth); void ff_bwdif_init_x86(BWDIFContext *bwdif, int bit_depth); void ff_bwdif_init_aarch64(BWDIFContext *bwdif, int bit_depth); +void ff_bwdif_filter_edge_c(void *dst1, void *prev1, void *cur1, void *next1, +int w, int prefs, int mrefs, int prefs2, int mrefs2, +int parity, int clip_max, int spat); + void ff_bwdif_filter_intra_c(void *dst1, void *cur1, int w, int prefs, int mrefs, int prefs3, int mrefs3, int parity, int clip_max); diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c index 035fc58670..bec83111b4 100644 --- a/libavfilter/vf_bwdif.c +++ b/libavfilter/vf_bwdif.c @@ -150,9 +150,9 @@ static void filter_line_c(void *dst1, void *prev1, void *cur1, void *next1, FILTER2() } -static void filter_edge(void *dst1, void *prev1, void *cur1, void *next1, -int w, int prefs, int mrefs, int prefs2, int mrefs2, -int parity, int clip_max, int spat) +void ff_bwdif_filter_edge_c(void *dst1, void *prev1, void *cur1, void *next1, +int w, int prefs, int mrefs, int prefs2, int mrefs2, +int parity, int clip_max, int spat) { uint8_t *dst = dst1; uint8_t *prev = prev1; @@ -364,7 +364,7 @@ av_cold void ff_bwdif_init_filter_line(BWDIFContext *s, int bit_depth) } else { s->filter_intra = ff_bwdif_filter_intra_c; s->filter_line = filter_line_c; -s->filter_edge = filter_edge; +s->filter_edge = ff_bwdif_filter_edge_c; } #if ARCH_X86 -- 2.39.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH 14/15] tests/checkasm: Add test for vf_bwdif filter_line3
Signed-off-by: John Cox --- tests/checkasm/vf_bwdif.c | 81 +++ 1 file changed, 81 insertions(+) diff --git a/tests/checkasm/vf_bwdif.c b/tests/checkasm/vf_bwdif.c index 5fdba09fdc..3399cacdf7 100644 --- a/tests/checkasm/vf_bwdif.c +++ b/tests/checkasm/vf_bwdif.c @@ -28,6 +28,10 @@ for (size_t i = 0; i < count; i++) \ buf0[i] = buf1[i] = rnd() & mask +#define randomize_overflow_check(buf0, buf1, mask, count) \ +for (size_t i = 0; i < count; i++) \ +buf0[i] = buf1[i] = (rnd() & 1) != 0 ? mask : 0; + #define BODY(type, depth) \ do { \ type prev0[9*WIDTH], prev1[9*WIDTH]; \ @@ -83,6 +87,83 @@ void checkasm_check_vf_bwdif(void) report("bwdif10"); } +if (!ctx_8.filter_line3) +ctx_8.filter_line3 = ff_bwdif_filter_line3_c; + +{ +LOCAL_ALIGNED_16(uint8_t, prev0, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, prev1, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, next0, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, next1, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, cur0, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, cur1, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, dst0, [WIDTH*3]); +LOCAL_ALIGNED_16(uint8_t, dst1, [WIDTH*3]); +const int stride = WIDTH; +const int mask = (1<<8)-1; +int parity; + +for (parity = 0; parity != 2; ++parity) { +if (check_func(ctx_8.filter_line3, "bwdif8.line3.rnd.p%d", parity)) { + +declare_func(void, void * dst1, int d_stride, + const void * prev1, const void * cur1, const void * next1, int prefs, + int w, int parity, int clip_max); + +randomize_buffers(prev0, prev1, mask, 11*WIDTH); +randomize_buffers(next0, next1, mask, 11*WIDTH); +randomize_buffers( cur0, cur1, mask, 11*WIDTH); + +call_ref(dst0, stride, + prev0 + stride * 4, cur0 + stride * 4, next0 + stride * 4, stride, + WIDTH, parity, mask); +call_new(dst1, stride, + prev1 + stride * 4, cur1 + stride * 4, next1 + stride * 4, stride, + WIDTH, parity, mask); + +if (memcmp(dst0, dst1, WIDTH*3) +|| memcmp(prev0, prev1, WIDTH*11) +|| memcmp(next0, next1, WIDTH*11) +|| memcmp( cur0, cur1, WIDTH*11)) +fail(); + +bench_new(dst1, stride, + prev1 + stride * 4, cur1 + stride * 4, next1 + stride * 4, stride, + WIDTH, parity, mask); +} +} + +// Use just 0s and ~0s to try to provoke bad cropping or overflow +// Parity makes no difference to this test so just test 0 +if (check_func(ctx_8.filter_line3, "bwdif8.line3.overflow")) { + +declare_func(void, void * dst1, int d_stride, + const void * prev1, const void * cur1, const void * next1, int prefs, + int w, int parity, int clip_max); + +randomize_overflow_check(prev0, prev1, mask, 11*WIDTH); +randomize_overflow_check(next0, next1, mask, 11*WIDTH); +randomize_overflow_check( cur0, cur1, mask, 11*WIDTH); + +call_ref(dst0, stride, + prev0 + stride * 4, cur0 + stride * 4, next0 + stride * 4, stride, + WIDTH, 0, mask); +call_new(dst1, stride, + prev1 + stride * 4, cur1 + stride * 4, next1 + stride * 4, stride, + WIDTH, 0, mask); + +if (memcmp(dst0, dst1, WIDTH*3) +|| memcmp(prev0, prev1, WIDTH*11) +|| memcmp(next0, next1, WIDTH*11) +|| memcmp( cur0, cur1, WIDTH*11)) +fail(); + +// No point to benching +} + +report("bwdif8.line3"); +} + { LOCAL_ALIGNED_16(uint8_t, prev0, [11*WIDTH]); LOCAL_ALIGNED_16(uint8_t, prev1, [11*WIDTH]); -- 2.39.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH 08/15] avfilter/vf_bwdif: Add neon for filter_edge
Signed-off-by: John Cox --- libavfilter/aarch64/vf_bwdif_init_aarch64.c | 20 libavfilter/aarch64/vf_bwdif_neon.S | 104 2 files changed, 124 insertions(+) diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c b/libavfilter/aarch64/vf_bwdif_init_aarch64.c index 3ffaa07ab3..e75cf2f204 100644 --- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c +++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c @@ -24,10 +24,29 @@ #include "libavfilter/bwdif.h" #include "libavutil/aarch64/cpu.h" +void ff_bwdif_filter_edge_neon(void *dst1, void *prev1, void *cur1, void *next1, + int w, int prefs, int mrefs, int prefs2, int mrefs2, + int parity, int clip_max, int spat); + void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, int mrefs, int prefs3, int mrefs3, int parity, int clip_max); +static void filter_edge_helper(void *dst1, void *prev1, void *cur1, void *next1, + int w, int prefs, int mrefs, int prefs2, int mrefs2, + int parity, int clip_max, int spat) +{ +const int w0 = clip_max != 255 ? 0 : w & ~15; + +ff_bwdif_filter_edge_neon(dst1, prev1, cur1, next1, w0, prefs, mrefs, prefs2, mrefs2, + parity, clip_max, spat); + +if (w0 < w) +ff_bwdif_filter_edge_c((char *)dst1 + w0, (char *)prev1 + w0, (char *)cur1 + w0, (char *)next1 + w0, + w - w0, prefs, mrefs, prefs2, mrefs2, + parity, clip_max, spat); +} + static void filter_intra_helper(void *dst1, void *cur1, int w, int prefs, int mrefs, int prefs3, int mrefs3, int parity, int clip_max) { @@ -52,5 +71,6 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth) return; s->filter_intra = filter_intra_helper; +s->filter_edge = filter_edge_helper; } diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S index 6c5d1598f4..a33b235882 100644 --- a/libavfilter/aarch64/vf_bwdif_neon.S +++ b/libavfilter/aarch64/vf_bwdif_neon.S @@ -128,6 +128,110 @@ coeffs: .hword 5570, 3801, 1016, -3801 // hf[0] = v0.h[2], -hf[1] = v0.h[5] .hword 5077, 981 // sp[0] = v0.h[6] +// +// +// void ff_bwdif_filter_edge_neon( +// void *dst1, // x0 +// void *prev1,// x1 +// void *cur1, // x2 +// void *next1,// x3 +// int w, // w4 +// int prefs, // w5 +// int mrefs, // w6 +// int prefs2, // w7 +// int mrefs2, // [sp, #0] +// int parity, // [sp, #8] +// int clip_max, // [sp, #16] unused +// int spat); // [sp, #24] + +function ff_bwdif_filter_edge_neon, export=1 +// Sanity check w +cmp w4, #0 +ble 99f + +// #define prev2 cur +// const uint8_t * restrict next2 = parity ? prev : next; + +ldr w8, [sp, #0] // mrefs2 + +ldr w17, [sp, #8] // parity +ldr w16, [sp, #24] // spat +cmp w17, #0 +cselx17, x1, x3, ne + +// for (x = 0; x < w; x++) { + +10: +//int m1 = cur[mrefs]; +//int d = (prev2[0] + next2[0]) >> 1; +//int p1 = cur[prefs]; +//int temporal_diff0 = FFABS(prev2[0] - next2[0]); +//int temporal_diff1 =(FFABS(prev[mrefs] - m1) + FFABS(prev[prefs] - p1)) >> 1; +//int temporal_diff2 =(FFABS(next[mrefs] - m1) + FFABS(next[prefs] - p1)) >> 1; +//int diff = FFMAX3(temporal_diff0 >> 1, temporal_diff1, temporal_diff2); +ldr q31, [x2] +ldr q21, [x17] +uhadd v16.16b, v31.16b, v21.16b // d0 = v16 +uabdv17.16b, v31.16b, v21.16b // td0 = v17 +ldr q24, [x2, w6, SXTW] // m1 = v24 +ldr q22, [x2, w5, SXTW] // p1 = v22 + +ldr q0, [x1, w6, SXTW] // prev[mrefs] +ldr q2, [x1, w5, SXTW] // prev[prefs] +ldr q1, [x3, w6, SXTW] // next[mrefs] +ldr q3, [x3, w5, SXTW] // next[prefs] + +ushrv29.16b, v17.16b, #1 + +uabdv31.16b, v0.16b, v24.16b +uabdv30.16b, v2.16b, v22.16b +uhadd v0.16b, v31.16b, v30.16b // td1 = q0 + +uabdv31.16b, v1.16b, v24.16b +uabdv30.16b, v3.16b, v22.16b +uhadd v1.16b, v31.16b, v30.16b
[FFmpeg-devel] [PATCH 15/15] avfilter/vf_bwdif: Block filter slices into a multiple of 4 lines
Round job start lines down to a multiple of 4. This means that if filter_line3 exists then filter_line will not sometimes be called once at the end of a slice depending on thread count. The final slice may do up to 3 extra lines but filter_edge is faster than filter_line so it is unlikely to create any noticable thread load variation. Signed-off-by: John Cox --- libavfilter/vf_bwdif.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c index 52bc676cf8..6701208efe 100644 --- a/libavfilter/vf_bwdif.c +++ b/libavfilter/vf_bwdif.c @@ -237,6 +237,13 @@ static void filter_edge_16bit(void *dst1, void *prev1, void *cur1, void *next1, FILTER2() } +// Round job start line down to multiple of 4 so that if filter_line3 exists +// and the frame is a multiple of 4 high then filter_line will never be called +static inline int job_start(const int jobnr, const int nb_jobs, const int h) +{ +return jobnr >= nb_jobs ? h : ((h * jobnr) / nb_jobs) & ~3; +} + static int filter_slice(AVFilterContext *ctx, void *arg, int jobnr, int nb_jobs) { BWDIFContext *s = ctx->priv; @@ -246,8 +253,8 @@ static int filter_slice(AVFilterContext *ctx, void *arg, int jobnr, int nb_jobs) int clip_max = (1 << (yadif->csp->comp[td->plane].depth)) - 1; int df = (yadif->csp->comp[td->plane].depth + 7) / 8; int refs = linesize / df; -int slice_start = (td->h * jobnr ) / nb_jobs; -int slice_end = (td->h * (jobnr+1)) / nb_jobs; +int slice_start = job_start(jobnr, nb_jobs, td->h); +int slice_end = job_start(jobnr + 1, nb_jobs, td->h); int y; for (y = slice_start; y < slice_end; y++) { @@ -310,7 +317,7 @@ static void filter(AVFilterContext *ctx, AVFrame *dstpic, td.plane = i; ff_filter_execute(ctx, filter_slice, &td, NULL, - FFMIN(h, ff_filter_get_nb_threads(ctx))); + FFMIN((h+3)/4, ff_filter_get_nb_threads(ctx))); } if (yadif->current_field == YADIF_FIELD_END) { yadif->current_field = YADIF_FIELD_NORMAL; -- 2.39.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH 09/15] tests/checkasm: Add test for vf_bwdif filter_edge
Signed-off-by: John Cox --- tests/checkasm/vf_bwdif.c | 54 +++ 1 file changed, 54 insertions(+) diff --git a/tests/checkasm/vf_bwdif.c b/tests/checkasm/vf_bwdif.c index 034bbabb4c..5fdba09fdc 100644 --- a/tests/checkasm/vf_bwdif.c +++ b/tests/checkasm/vf_bwdif.c @@ -83,6 +83,60 @@ void checkasm_check_vf_bwdif(void) report("bwdif10"); } +{ +LOCAL_ALIGNED_16(uint8_t, prev0, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, prev1, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, next0, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, next1, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, cur0, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, cur1, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, dst0, [WIDTH*3]); +LOCAL_ALIGNED_16(uint8_t, dst1, [WIDTH*3]); +const int stride = WIDTH; +const int mask = (1<<8)-1; +int spat; +int parity; + +for (spat = 0; spat != 2; ++spat) { +for (parity = 0; parity != 2; ++parity) { +if (check_func(ctx_8.filter_edge, "bwdif8.edge.s%d.p%d", spat, parity)) { + +declare_func(void, void *dst1, void *prev1, void *cur1, void *next1, +int w, int prefs, int mrefs, int prefs2, int mrefs2, +int parity, int clip_max, int spat); + +randomize_buffers(prev0, prev1, mask, 11*WIDTH); +randomize_buffers(next0, next1, mask, 11*WIDTH); +randomize_buffers( cur0, cur1, mask, 11*WIDTH); +memset(dst0, 0xba, WIDTH * 3); +memset(dst1, 0xba, WIDTH * 3); + +call_ref(dst0 + stride, + prev0 + stride * 4, cur0 + stride * 4, next0 + stride * 4, WIDTH, + stride, -stride, stride * 2, -stride * 2, + parity, mask, spat); +call_new(dst1 + stride, + prev1 + stride * 4, cur1 + stride * 4, next1 + stride * 4, WIDTH, + stride, -stride, stride * 2, -stride * 2, + parity, mask, spat); + +if (memcmp(dst0, dst1, WIDTH*3) +|| memcmp(prev0, prev1, WIDTH*11) +|| memcmp(next0, next1, WIDTH*11) +|| memcmp( cur0, cur1, WIDTH*11)) +fail(); + +bench_new(dst1 + stride, + prev1 + stride * 4, cur1 + stride * 4, next1 + stride * 4, WIDTH, + stride, -stride, stride * 2, -stride * 2, + parity, mask, spat); +} +} +} + +report("bwdif8.edge"); +} + if (check_func(ctx_8.filter_intra, "bwdif8.intra")) { LOCAL_ALIGNED_16(uint8_t, cur0, [11*WIDTH]); LOCAL_ALIGNED_16(uint8_t, cur1, [11*WIDTH]); -- 2.39.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH 10/15] avfilter/vf_bwdif: Export C filter_line
Needed for tail fixup of neon code Signed-off-by: John Cox --- libavfilter/bwdif.h| 5 + libavfilter/vf_bwdif.c | 10 +- 2 files changed, 10 insertions(+), 5 deletions(-) diff --git a/libavfilter/bwdif.h b/libavfilter/bwdif.h index ae1616d366..cce99953f3 100644 --- a/libavfilter/bwdif.h +++ b/libavfilter/bwdif.h @@ -48,4 +48,9 @@ void ff_bwdif_filter_edge_c(void *dst1, void *prev1, void *cur1, void *next1, void ff_bwdif_filter_intra_c(void *dst1, void *cur1, int w, int prefs, int mrefs, int prefs3, int mrefs3, int parity, int clip_max); +void ff_bwdif_filter_line_c(void *dst1, void *prev1, void *cur1, void *next1, +int w, int prefs, int mrefs, int prefs2, int mrefs2, +int prefs3, int mrefs3, int prefs4, int mrefs4, +int parity, int clip_max); + #endif /* AVFILTER_BWDIF_H */ diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c index bec83111b4..26349da1fd 100644 --- a/libavfilter/vf_bwdif.c +++ b/libavfilter/vf_bwdif.c @@ -132,10 +132,10 @@ void ff_bwdif_filter_intra_c(void *dst1, void *cur1, int w, int prefs, int mrefs FILTER_INTRA() } -static void filter_line_c(void *dst1, void *prev1, void *cur1, void *next1, - int w, int prefs, int mrefs, int prefs2, int mrefs2, - int prefs3, int mrefs3, int prefs4, int mrefs4, - int parity, int clip_max) +void ff_bwdif_filter_line_c(void *dst1, void *prev1, void *cur1, void *next1, +int w, int prefs, int mrefs, int prefs2, int mrefs2, +int prefs3, int mrefs3, int prefs4, int mrefs4, +int parity, int clip_max) { uint8_t *dst = dst1; uint8_t *prev = prev1; @@ -363,7 +363,7 @@ av_cold void ff_bwdif_init_filter_line(BWDIFContext *s, int bit_depth) s->filter_edge = filter_edge_16bit; } else { s->filter_intra = ff_bwdif_filter_intra_c; -s->filter_line = filter_line_c; +s->filter_line = ff_bwdif_filter_line_c; s->filter_edge = ff_bwdif_filter_edge_c; } -- 2.39.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH 11/15] avfilter/vf_bwdif: Add neon for filter_line
Signed-off-by: John Cox --- libavfilter/aarch64/vf_bwdif_init_aarch64.c | 21 ++ libavfilter/aarch64/vf_bwdif_neon.S | 215 2 files changed, 236 insertions(+) diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c b/libavfilter/aarch64/vf_bwdif_init_aarch64.c index e75cf2f204..21e67884ab 100644 --- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c +++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c @@ -31,6 +31,26 @@ void ff_bwdif_filter_edge_neon(void *dst1, void *prev1, void *cur1, void *next1, void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, int mrefs, int prefs3, int mrefs3, int parity, int clip_max); +void ff_bwdif_filter_line_neon(void *dst1, void *prev1, void *cur1, void *next1, + int w, int prefs, int mrefs, int prefs2, int mrefs2, + int prefs3, int mrefs3, int prefs4, int mrefs4, + int parity, int clip_max); + + +static void filter_line_helper(void *dst1, void *prev1, void *cur1, void *next1, + int w, int prefs, int mrefs, int prefs2, int mrefs2, + int prefs3, int mrefs3, int prefs4, int mrefs4, + int parity, int clip_max) +{ +const int w0 = clip_max != 255 ? 0 : w & ~15; + +ff_bwdif_filter_line_neon(dst1, prev1, cur1, next1, + w0, prefs, mrefs, prefs2, mrefs2, prefs3, mrefs3, prefs4, mrefs4, parity, clip_max); + +if (w0 < w) +ff_bwdif_filter_line_c((char *)dst1 + w0, (char *)prev1 + w0, (char *)cur1 + w0, (char *)next1 + w0, + w - w0, prefs, mrefs, prefs2, mrefs2, prefs3, mrefs3, prefs4, mrefs4, parity, clip_max); +} static void filter_edge_helper(void *dst1, void *prev1, void *cur1, void *next1, int w, int prefs, int mrefs, int prefs2, int mrefs2, @@ -71,6 +91,7 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth) return; s->filter_intra = filter_intra_helper; +s->filter_line = filter_line_helper; s->filter_edge = filter_edge_helper; } diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S index a33b235882..675e97d966 100644 --- a/libavfilter/aarch64/vf_bwdif_neon.S +++ b/libavfilter/aarch64/vf_bwdif_neon.S @@ -128,6 +128,221 @@ coeffs: .hword 5570, 3801, 1016, -3801 // hf[0] = v0.h[2], -hf[1] = v0.h[5] .hword 5077, 981 // sp[0] = v0.h[6] +// === +// +// void filter_line( +// void *dst1, // x0 +// void *prev1,// x1 +// void *cur1, // x2 +// void *next1,// x3 +// int w, // w4 +// int prefs, // w5 +// int mrefs, // w6 +// int prefs2, // w7 +// int mrefs2, // [sp, #0] +// int prefs3, // [sp, #8] +// int mrefs3, // [sp, #16] +// int prefs4, // [sp, #24] +// int mrefs4, // [sp, #32] +// int parity, // [sp, #40] +// int clip_max) // [sp, #48] + +function ff_bwdif_filter_line_neon, export=1 +// Sanity check w +cmp w4, #0 +ble 99f + +// Rearrange regs to be the same as line3 for ease of debug! +mov w10, w4 // w10 = loop count +mov w9, w6 // w9 = mref +mov w12, w7 // w12 = pref2 +mov w11, w5 // w11 = pref +ldr w8, [sp, #0] // w8 = mref2 +ldr w7, [sp, #16] // w7 = mref3 +ldr w6, [sp, #32] // w6 = mref4 +ldr w13, [sp, #8] // w13 = pref3 +ldr w14, [sp, #24] // w14 = pref4 + +mov x4, x3 +mov x3, x2 +mov x2, x1 + +// #define prev2 cur +//const uint8_t * restrict next2 = parity ? prev : next; +ldr w17, [sp, #40] // parity +cmp w17, #0 +cselx17, x2, x4, ne + +// We want all the V registers - save all the ones we must +stp d14, d15, [sp, #-64]! +stp d8, d9, [sp, #48] +stp d10, d11, [sp, #32] +stp d12, d13, [sp, #16] + +ldr q0, coeffs + +// for (x = 0; x < w; x++) { +// int diff0, diff2; +// int d0, d2; +// int temporal_diff0, temporal_diff2; +// +// int i1, i2; +// int j1, j2; +// int p6, p5, p4, p3, p2, p1,
[FFmpeg-devel] [PATCH 12/15] avfilter/vf_bwdif: Add a filter_line3 method for optimisation
Add an optional filter_line3 to the available optimisations. filter_line3 is equivalent to filter_line, memcpy, filter_line filter_line shares quite a number of loads and some calculations in common with its next iteration and testing shows that using aarch64 neon filter_line3s performance is 30% better than two filter_lines and a memcpy. Signed-off-by: John Cox --- libavfilter/bwdif.h| 7 +++ libavfilter/vf_bwdif.c | 31 +++ 2 files changed, 38 insertions(+) diff --git a/libavfilter/bwdif.h b/libavfilter/bwdif.h index cce99953f3..496cec72ef 100644 --- a/libavfilter/bwdif.h +++ b/libavfilter/bwdif.h @@ -35,6 +35,9 @@ typedef struct BWDIFContext { void (*filter_edge)(void *dst, void *prev, void *cur, void *next, int w, int prefs, int mrefs, int prefs2, int mrefs2, int parity, int clip_max, int spat); +void (*filter_line3)(void *dst, int dstride, + const void *prev, const void *cur, const void *next, int prefs, + int w, int parity, int clip_max); } BWDIFContext; void ff_bwdif_init_filter_line(BWDIFContext *bwdif, int bit_depth); @@ -53,4 +56,8 @@ void ff_bwdif_filter_line_c(void *dst1, void *prev1, void *cur1, void *next1, int prefs3, int mrefs3, int prefs4, int mrefs4, int parity, int clip_max); +void ff_bwdif_filter_line3_c(void * dst1, int d_stride, + const void * prev1, const void * cur1, const void * next1, int s_stride, + int w, int parity, int clip_max); + #endif /* AVFILTER_BWDIF_H */ diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c index 26349da1fd..52bc676cf8 100644 --- a/libavfilter/vf_bwdif.c +++ b/libavfilter/vf_bwdif.c @@ -150,6 +150,31 @@ void ff_bwdif_filter_line_c(void *dst1, void *prev1, void *cur1, void *next1, FILTER2() } +#define NEXT_LINE()\ +dst += d_stride; \ +prev += prefs; \ +cur += prefs; \ +next += prefs; + +void ff_bwdif_filter_line3_c(void * dst1, int d_stride, + const void * prev1, const void * cur1, const void * next1, int s_stride, + int w, int parity, int clip_max) +{ +const int prefs = s_stride; +uint8_t * dst = dst1; +const uint8_t * prev = prev1; +const uint8_t * cur = cur1; +const uint8_t * next = next1; + +ff_bwdif_filter_line_c(dst, (void*)prev, (void*)cur, (void*)next, w, + prefs, -prefs, prefs * 2, - prefs * 2, prefs * 3, -prefs * 3, prefs * 4, -prefs * 4, parity, clip_max); +NEXT_LINE(); +memcpy(dst, cur, w); +NEXT_LINE(); +ff_bwdif_filter_line_c(dst, (void*)prev, (void*)cur, (void*)next, w, + prefs, -prefs, prefs * 2, - prefs * 2, prefs * 3, -prefs * 3, prefs * 4, -prefs * 4, parity, clip_max); +} + void ff_bwdif_filter_edge_c(void *dst1, void *prev1, void *cur1, void *next1, int w, int prefs, int mrefs, int prefs2, int mrefs2, int parity, int clip_max, int spat) @@ -244,6 +269,11 @@ static int filter_slice(AVFilterContext *ctx, void *arg, int jobnr, int nb_jobs) refs << 1, -(refs << 1), td->parity ^ td->tff, clip_max, (y < 2) || ((y + 3) > td->h) ? 0 : 1); +} else if (s->filter_line3 && y + 2 < slice_end && y + 6 < td->h) { +s->filter_line3(dst, td->frame->linesize[td->plane], +prev, cur, next, linesize, td->w, +td->parity ^ td->tff, clip_max); +y += 2; } else { s->filter_line(dst, prev, cur, next, td->w, refs, -refs, refs << 1, -(refs << 1), @@ -357,6 +387,7 @@ static int config_props(AVFilterLink *link) av_cold void ff_bwdif_init_filter_line(BWDIFContext *s, int bit_depth) { +s->filter_line3 = 0; if (bit_depth > 8) { s->filter_intra = filter_intra_16bit; s->filter_line = filter_line_c_16bit; -- 2.39.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
Re: [FFmpeg-devel] [PATCH 00/15] avfilter/vf_bwdif: Add aarch64 neon functions
Hi >On Thu, 29 Jun 2023, John Cox wrote: > >> Also adds a filter_line3 method which on aarch64 neon yields approx 30% >> speedup over 2xfilter_line and a memcpy >> >> John Cox (15): >> avfilter/vf_bwdif: Add outline for aarch neon functions >> avfilter/vf_bwdif: Add common macros and consts for aarch64 neon >> avfilter/vf_bwdif: Export C filter_intra >> avfilter/vf_bwdif: Add neon for filter_intra >> tests/checkasm: Add test for vf_bwdif filter_intra >> avfilter/vf_bwdif: Add clip and spatial macros for aarch64 neon >> avfilter/vf_bwdif: Export C filter_edge >> avfilter/vf_bwdif: Add neon for filter_edge >> tests/checkasm: Add test for vf_bwdif filter_edge >> avfilter/vf_bwdif: Export C filter_line >> avfilter/vf_bwdif: Add neon for filter_line >> avfilter/vf_bwdif: Add a filter_line3 method for optimisation >> avfilter/vf_bwdif: Add neon for filter_line3 >> tests/checkasm: Add test for vf_bwdif filter_line3 >> avfilter/vf_bwdif: Block filter slices into a multiple of 4 lines > >It's nice to have this split up in small easily checkable patches, but >this is perhaps a bit more finegrained than what's usual. But I guess >that's ok... I normally find that people ask me to split patches so I though I'd cut stuff down to the minimum plausible unit. I'm more than happy to coalesce stuff if wanted. JC >I'll comment on the patches that need commenting on. > >// Martin ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
Re: [FFmpeg-devel] [PATCH 02/15] avfilter/vf_bwdif: Add common macros and consts for aarch64 neon
On Sun, 2 Jul 2023 00:35:14 +0300 (EEST), you wrote: >On Thu, 29 Jun 2023, John Cox wrote: > >> Add macros for dual scalar half->single multiply and accumulate >> Add macro for shift, saturate and shorten single to byte >> Add filter constants >> >> Signed-off-by: John Cox >> --- >> libavfilter/aarch64/vf_bwdif_neon.S | 46 + >> 1 file changed, 46 insertions(+) >> >> diff --git a/libavfilter/aarch64/vf_bwdif_neon.S >> b/libavfilter/aarch64/vf_bwdif_neon.S >> index 639ab22998..a8f0ed525a 100644 >> --- a/libavfilter/aarch64/vf_bwdif_neon.S >> +++ b/libavfilter/aarch64/vf_bwdif_neon.S >> @@ -23,3 +23,49 @@ >> >> #include "libavutil/aarch64/asm.S" >> >> +.macro SQSHRUNN b, s0, s1, s2, s3, n >> +sqshrun \s0\().4h, \s0\().4s, #\n - 8 >> +sqshrun2\s0\().8h, \s1\().4s, #\n - 8 >> +sqshrun \s1\().4h, \s2\().4s, #\n - 8 >> +sqshrun2\s1\().8h, \s3\().4s, #\n - 8 >> +uzp2\b\().16b, \s0\().16b, \s1\().16b >> +.endm >> + >> +.macro SMULL4K a0, a1, a2, a3, s0, s1, k >> +smull \a0\().4s, \s0\().4h, \k >> +smull2 \a1\().4s, \s0\().8h, \k >> +smull \a2\().4s, \s1\().4h, \k >> +smull2 \a3\().4s, \s1\().8h, \k >> +.endm >> + >> +.macro UMULL4K a0, a1, a2, a3, s0, s1, k >> +umull \a0\().4s, \s0\().4h, \k >> +umull2 \a1\().4s, \s0\().8h, \k >> +umull \a2\().4s, \s1\().4h, \k >> +umull2 \a3\().4s, \s1\().8h, \k >> +.endm >> + >> +.macro UMLAL4K a0, a1, a2, a3, s0, s1, k >> +umlal \a0\().4s, \s0\().4h, \k >> +umlal2 \a1\().4s, \s0\().8h, \k >> +umlal \a2\().4s, \s1\().4h, \k >> +umlal2 \a3\().4s, \s1\().8h, \k >> +.endm >> + >> +.macro UMLSL4K a0, a1, a2, a3, s0, s1, k >> +umlsl \a0\().4s, \s0\().4h, \k >> +umlsl2 \a1\().4s, \s0\().8h, \k >> +umlsl \a2\().4s, \s1\().4h, \k >> +umlsl2 \a3\().4s, \s1\().8h, \k >> +.endm >> + >> +// static const uint16_t coef_lf[2] = { 4309, 213 }; >> +// static const uint16_t coef_hf[3] = { 5570, 3801, 1016 }; >> +// static const uint16_t coef_sp[2] = { 5077, 981 }; >> + >> +.align 16 > >Note that .align for arm is power of two; this triggers a 2^16 byte >alignment here, which certainly isn't intended. Yikes! I'll swap that for a .balign now I've looked that up >But in general, the usual way of defining constants is with a >const/endconst block, which places them in the right rdata section instead >of in the text section. But that then requires you to use a movrel macro >for accessing the data, instead of a plain ldr instruction. Yeah - arm has a history of mixing text & const - I went with the simpler code. Is this a deal breaker or can I leave it as is? JC >> +coeffs: >> +.hword 4309 * 4, 213 * 4 // lf[0]*4 = v0.h[0] >> +.hword 5570, 3801, 1016, -3801 // hf[0] = v0.h[2], >> -hf[1] = v0.h[5] >> +.hword 5077, 981 // sp[0] = v0.h[6] >> + >> -- > > >// Martin ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
Re: [FFmpeg-devel] [PATCH 04/15] avfilter/vf_bwdif: Add neon for filter_intra
On Sun, 2 Jul 2023 00:37:35 +0300 (EEST), you wrote: >On Thu, 29 Jun 2023, John Cox wrote: > >> Signed-off-by: John Cox >> --- >> libavfilter/aarch64/vf_bwdif_init_aarch64.c | 17 +++ >> libavfilter/aarch64/vf_bwdif_neon.S | 53 + >> 2 files changed, 70 insertions(+) >> >> diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c >> b/libavfilter/aarch64/vf_bwdif_init_aarch64.c >> index 86d53b2ca1..3ffaa07ab3 100644 >> --- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c >> +++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c >> @@ -24,6 +24,22 @@ >> #include "libavfilter/bwdif.h" >> #include "libavutil/aarch64/cpu.h" >> >> +void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, >> int mrefs, >> +int prefs3, int mrefs3, int parity, int >> clip_max); >> + >> + >> +static void filter_intra_helper(void *dst1, void *cur1, int w, int prefs, >> int mrefs, >> +int prefs3, int mrefs3, int parity, int >> clip_max) >> +{ >> +const int w0 = clip_max != 255 ? 0 : w & ~15; >> + >> +ff_bwdif_filter_intra_neon(dst1, cur1, w0, prefs, mrefs, prefs3, >> mrefs3, parity, clip_max); >> + >> +if (w0 < w) >> +ff_bwdif_filter_intra_c((char *)dst1 + w0, (char *)cur1 + w0, >> +w - w0, prefs, mrefs, prefs3, mrefs3, >> parity, clip_max); >> +} >> + >> void >> ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth) >> { >> @@ -35,5 +51,6 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth) >> if (!have_neon(cpu_flags)) >> return; >> >> +s->filter_intra = filter_intra_helper; >> } >> >> diff --git a/libavfilter/aarch64/vf_bwdif_neon.S >> b/libavfilter/aarch64/vf_bwdif_neon.S >> index a8f0ed525a..b863b3447d 100644 >> --- a/libavfilter/aarch64/vf_bwdif_neon.S >> +++ b/libavfilter/aarch64/vf_bwdif_neon.S >> @@ -69,3 +69,56 @@ coeffs: >> .hword 5570, 3801, 1016, -3801 // hf[0] = v0.h[2], >> -hf[1] = v0.h[5] >> .hword 5077, 981 // sp[0] = v0.h[6] >> >> +// >> >> +// >> +// void ff_bwdif_filter_intra_neon( >> +// void *dst1, // x0 >> +// void *cur1, // x1 >> +// int w, // w2 >> +// int prefs, // w3 >> +// int mrefs, // w4 >> +// int prefs3, // w5 >> +// int mrefs3, // w6 >> +// int parity, // w7 unused >> +// int clip_max) // [sp, #0] unused > >This bit is great to have > >> + >> +function ff_bwdif_filter_intra_neon, export=1 >> +cmp w2, #0 >> +ble 99f >> + >> +ldr q0, coeffs >> + >> +//for (x = 0; x < w; x++) { >> +10: >> + >> +//interpol = (coef_sp[0] * (cur[mrefs] + cur[prefs]) - coef_sp[1] * >> (cur[mrefs3] + cur[prefs3])) >> 13; > >I guess the style with intermixed C code is a bit uncommon in our >assembly, but as long as it doesn't affect the overall code style I guess >we can keep it. I needed it to track where I was whilst writing the code & if I ever need to change it I'll be lost without it - so I, at least, rate it as valuable and in line3 where I am very tight on registers it was invaluable for keeping track of what referred to what. >> +ldr q31, [x1, w4, SXTW] >> +ldr q30, [x1, w3, SXTW] >> +ldr q29, [x1, w6, SXTW] >> +ldr q28, [x1, w5, SXTW] > >Don't use shouty uppercase SXTW here Will change. >> + >> +uaddl v20.8h, v31.8b, v30.8b >> +uaddl2 v21.8h, v31.16b, v30.16b >> + >> +UMULL4K v2, v3, v4, v5, v20, v21, v0.h[6] >> + >> +uaddl v20.8h, v29.8b, v28.8b >> +uaddl2 v21.8h, v29.16b, v28.16b >> + >> +UMLSL4K v2, v3, v4, v5, v20, v21, v0.h[7] >> + >> +//dst[0] = av_clip(interpol, 0, clip_max); >> +SQSHRUNNv2, v2, v3, v4, v5, 13 >> +str q2, [x0], #16 >> + >> +//dst++; >> +//cur++; >> +//} >> + >> +subsw2, w2, #16 >> +add x1, x1, #
Re: [FFmpeg-devel] [PATCH 08/15] avfilter/vf_bwdif: Add neon for filter_edge
On Sun, 2 Jul 2023 00:40:09 +0300 (EEST), you wrote: >On Thu, 29 Jun 2023, John Cox wrote: > >> Signed-off-by: John Cox >> --- >> libavfilter/aarch64/vf_bwdif_init_aarch64.c | 20 >> libavfilter/aarch64/vf_bwdif_neon.S | 104 >> 2 files changed, 124 insertions(+) >> >> diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c >> b/libavfilter/aarch64/vf_bwdif_init_aarch64.c >> index 3ffaa07ab3..e75cf2f204 100644 >> --- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c >> +++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c >> @@ -24,10 +24,29 @@ >> #include "libavfilter/bwdif.h" >> #include "libavutil/aarch64/cpu.h" >> >> +void ff_bwdif_filter_edge_neon(void *dst1, void *prev1, void *cur1, void >> *next1, >> + int w, int prefs, int mrefs, int prefs2, int >> mrefs2, >> + int parity, int clip_max, int spat); >> + >> void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, >> int mrefs, >> int prefs3, int mrefs3, int parity, int >> clip_max); >> >> >> +static void filter_edge_helper(void *dst1, void *prev1, void *cur1, void >> *next1, >> + int w, int prefs, int mrefs, int prefs2, int >> mrefs2, >> + int parity, int clip_max, int spat) >> +{ >> +const int w0 = clip_max != 255 ? 0 : w & ~15; >> + >> +ff_bwdif_filter_edge_neon(dst1, prev1, cur1, next1, w0, prefs, mrefs, >> prefs2, mrefs2, >> + parity, clip_max, spat); >> + >> +if (w0 < w) >> +ff_bwdif_filter_edge_c((char *)dst1 + w0, (char *)prev1 + w0, (char >> *)cur1 + w0, (char *)next1 + w0, >> + w - w0, prefs, mrefs, prefs2, mrefs2, >> + parity, clip_max, spat); >> +} >> + >> static void filter_intra_helper(void *dst1, void *cur1, int w, int prefs, >> int mrefs, >> int prefs3, int mrefs3, int parity, int >> clip_max) >> { >> @@ -52,5 +71,6 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth) >> return; >> >> s->filter_intra = filter_intra_helper; >> +s->filter_edge = filter_edge_helper; >> } >> >> diff --git a/libavfilter/aarch64/vf_bwdif_neon.S >> b/libavfilter/aarch64/vf_bwdif_neon.S >> index 6c5d1598f4..a33b235882 100644 >> --- a/libavfilter/aarch64/vf_bwdif_neon.S >> +++ b/libavfilter/aarch64/vf_bwdif_neon.S >> @@ -128,6 +128,110 @@ coeffs: >> .hword 5570, 3801, 1016, -3801 // hf[0] = v0.h[2], >> -hf[1] = v0.h[5] >> .hword 5077, 981 // sp[0] = v0.h[6] >> >> +// >> >> +// >> +// void ff_bwdif_filter_edge_neon( >> +// void *dst1, // x0 >> +// void *prev1,// x1 >> +// void *cur1, // x2 >> +// void *next1,// x3 >> +// int w, // w4 >> +// int prefs, // w5 >> +// int mrefs, // w6 >> +// int prefs2, // w7 >> +// int mrefs2, // [sp, #0] >> +// int parity, // [sp, #8] >> +// int clip_max, // [sp, #16] unused >> +// int spat); // [sp, #24] > >This doesn't hold for macOS targets (and the checkasm tests fail on that >platform). > >On macOS, arguments that aren't passed in registers but on the stack, are >tightly packed. So since parity is 32 bit and mrefs2 also was 32 bit, >parity is available at [sp, #4]. > >Therefore, it's usually simplest for portability reasons, to pass any >arguments after the first 8, as intptr_t or ptrdiff_t, as that makes them >consistent across platforms. Not my interface - this is already existing code. What do you suggest I do? I'm happy either to change the interface or fix my stack offsets if there is any clue that lets me detect this ABI. As personal preference I'd choose the latter. I don't have easy access to a mac. Is there any easy way of getting this tested before resubmission? Thanks JC >// Martin ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
Re: [FFmpeg-devel] [PATCH 11/15] avfilter/vf_bwdif: Add neon for filter_line
On Sun, 2 Jul 2023 00:44:10 +0300 (EEST), you wrote: >On Thu, 29 Jun 2023, John Cox wrote: > >> Signed-off-by: John Cox >> --- >> libavfilter/aarch64/vf_bwdif_init_aarch64.c | 21 ++ >> libavfilter/aarch64/vf_bwdif_neon.S | 215 >> 2 files changed, 236 insertions(+) >> >> diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c >> b/libavfilter/aarch64/vf_bwdif_init_aarch64.c >> index e75cf2f204..21e67884ab 100644 >> --- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c >> +++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c >> @@ -31,6 +31,26 @@ void ff_bwdif_filter_edge_neon(void *dst1, void *prev1, >> void *cur1, void *next1, >> void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, >> int mrefs, >> int prefs3, int mrefs3, int parity, int >> clip_max); >> >> +void ff_bwdif_filter_line_neon(void *dst1, void *prev1, void *cur1, void >> *next1, >> + int w, int prefs, int mrefs, int prefs2, int >> mrefs2, >> + int prefs3, int mrefs3, int prefs4, int >> mrefs4, >> + int parity, int clip_max); >> + >> + >> +static void filter_line_helper(void *dst1, void *prev1, void *cur1, void >> *next1, >> + int w, int prefs, int mrefs, int prefs2, int >> mrefs2, >> + int prefs3, int mrefs3, int prefs4, int >> mrefs4, >> + int parity, int clip_max) >> +{ >> +const int w0 = clip_max != 255 ? 0 : w & ~15; >> + >> +ff_bwdif_filter_line_neon(dst1, prev1, cur1, next1, >> + w0, prefs, mrefs, prefs2, mrefs2, prefs3, >> mrefs3, prefs4, mrefs4, parity, clip_max); >> + >> +if (w0 < w) >> +ff_bwdif_filter_line_c((char *)dst1 + w0, (char *)prev1 + w0, (char >> *)cur1 + w0, (char *)next1 + w0, >> + w - w0, prefs, mrefs, prefs2, mrefs2, >> prefs3, mrefs3, prefs4, mrefs4, parity, clip_max); >> +} >> >> static void filter_edge_helper(void *dst1, void *prev1, void *cur1, void >> *next1, >>int w, int prefs, int mrefs, int prefs2, int >> mrefs2, >> @@ -71,6 +91,7 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth) >> return; >> >> s->filter_intra = filter_intra_helper; >> +s->filter_line = filter_line_helper; >> s->filter_edge = filter_edge_helper; >> } >> >> diff --git a/libavfilter/aarch64/vf_bwdif_neon.S >> b/libavfilter/aarch64/vf_bwdif_neon.S >> index a33b235882..675e97d966 100644 >> --- a/libavfilter/aarch64/vf_bwdif_neon.S >> +++ b/libavfilter/aarch64/vf_bwdif_neon.S >> @@ -128,6 +128,221 @@ coeffs: >> .hword 5570, 3801, 1016, -3801 // hf[0] = v0.h[2], >> -hf[1] = v0.h[5] >> .hword 5077, 981 // sp[0] = v0.h[6] >> >> +// >> === >> +// >> +// void filter_line( >> +// void *dst1, // x0 >> +// void *prev1,// x1 >> +// void *cur1, // x2 >> +// void *next1,// x3 >> +// int w, // w4 >> +// int prefs, // w5 >> +// int mrefs, // w6 >> +// int prefs2, // w7 >> +// int mrefs2, // [sp, #0] >> +// int prefs3, // [sp, #8] >> +// int mrefs3, // [sp, #16] >> +// int prefs4, // [sp, #24] >> +// int mrefs4, // [sp, #32] >> +// int parity, // [sp, #40] >> +// int clip_max) // [sp, #48] >> + >> +function ff_bwdif_filter_line_neon, export=1 >> +// Sanity check w >> +cmp w4, #0 >> +ble 99f >> + >> +// Rearrange regs to be the same as line3 for ease of debug! >> +mov w10, w4 // w10 = loop count >> +mov w9, w6 // w9 = mref >> +mov w12, w7 // w12 = pref2 >> +mov w11, w5 // w11 = pref >> +ldr w8, [sp, #0] // w8 = mref2 >> +ldr w7, [sp, #16] // w7 = mref3 >> +ldr w6, [sp, #32] // w6 = mref4 >> +ld
[FFmpeg-devel] [PATCH v2 00/15] avfilter/vf_bwdif: Add aarch64 neon functions
Also adds a filter_line3 method which on aarch64 neon yields approx 30% speedup over 2xfilter_line and a memcpy Differences from v1: .align 16 corrected to .balign 16 SXTW tolower Mac ABI (hopefully) fixed V register pop/push macroed & prettified John Cox (15): avfilter/vf_bwdif: Add outline for aarch neon functions avfilter/vf_bwdif: Add common macros and consts for aarch64 neon avfilter/vf_bwdif: Export C filter_intra avfilter/vf_bwdif: Add neon for filter_intra tests/checkasm: Add test for vf_bwdif filter_intra avfilter/vf_bwdif: Add clip and spatial macros for aarch64 neon avfilter/vf_bwdif: Export C filter_edge avfilter/vf_bwdif: Add neon for filter_edge tests/checkasm: Add test for vf_bwdif filter_edge avfilter/vf_bwdif: Export C filter_line avfilter/vf_bwdif: Add neon for filter_line avfilter/vf_bwdif: Add a filter_line3 method for optimisation avfilter/vf_bwdif: Add neon for filter_line3 tests/checkasm: Add test for vf_bwdif filter_line3 avfilter/vf_bwdif: Block filter slices into a multiple of 4 lines libavfilter/aarch64/Makefile| 2 + libavfilter/aarch64/vf_bwdif_init_aarch64.c | 125 libavfilter/aarch64/vf_bwdif_neon.S | 788 libavfilter/bwdif.h | 20 + libavfilter/vf_bwdif.c | 70 +- tests/checkasm/vf_bwdif.c | 172 + 6 files changed, 1162 insertions(+), 15 deletions(-) create mode 100644 libavfilter/aarch64/vf_bwdif_init_aarch64.c create mode 100644 libavfilter/aarch64/vf_bwdif_neon.S -- 2.39.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH v2 01/15] avfilter/vf_bwdif: Add outline for aarch neon functions
Outline but no actual functions. Signed-off-by: John Cox --- libavfilter/aarch64/Makefile| 2 ++ libavfilter/aarch64/vf_bwdif_init_aarch64.c | 39 + libavfilter/aarch64/vf_bwdif_neon.S | 25 + libavfilter/bwdif.h | 1 + libavfilter/vf_bwdif.c | 2 ++ 5 files changed, 69 insertions(+) create mode 100644 libavfilter/aarch64/vf_bwdif_init_aarch64.c create mode 100644 libavfilter/aarch64/vf_bwdif_neon.S diff --git a/libavfilter/aarch64/Makefile b/libavfilter/aarch64/Makefile index b58daa3a3f..b68209bc94 100644 --- a/libavfilter/aarch64/Makefile +++ b/libavfilter/aarch64/Makefile @@ -1,3 +1,5 @@ +OBJS-$(CONFIG_BWDIF_FILTER) += aarch64/vf_bwdif_init_aarch64.o OBJS-$(CONFIG_NLMEANS_FILTER)+= aarch64/vf_nlmeans_init.o +NEON-OBJS-$(CONFIG_BWDIF_FILTER) += aarch64/vf_bwdif_neon.o NEON-OBJS-$(CONFIG_NLMEANS_FILTER) += aarch64/vf_nlmeans_neon.o diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c b/libavfilter/aarch64/vf_bwdif_init_aarch64.c new file mode 100644 index 00..86d53b2ca1 --- /dev/null +++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c @@ -0,0 +1,39 @@ +/* + * bwdif aarch64 NEON optimisations + * + * Copyright (c) 2023 John Cox + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libavutil/common.h" +#include "libavfilter/bwdif.h" +#include "libavutil/aarch64/cpu.h" + +void +ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth) +{ +const int cpu_flags = av_get_cpu_flags(); + +if (bit_depth != 8) +return; + +if (!have_neon(cpu_flags)) +return; + +} + diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S new file mode 100644 index 00..639ab22998 --- /dev/null +++ b/libavfilter/aarch64/vf_bwdif_neon.S @@ -0,0 +1,25 @@ +/* + * bwdif aarch64 NEON optimisations + * + * Copyright (c) 2023 John Cox + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + + +#include "libavutil/aarch64/asm.S" + diff --git a/libavfilter/bwdif.h b/libavfilter/bwdif.h index 5749345f78..6a0f70487a 100644 --- a/libavfilter/bwdif.h +++ b/libavfilter/bwdif.h @@ -39,5 +39,6 @@ typedef struct BWDIFContext { void ff_bwdif_init_filter_line(BWDIFContext *bwdif, int bit_depth); void ff_bwdif_init_x86(BWDIFContext *bwdif, int bit_depth); +void ff_bwdif_init_aarch64(BWDIFContext *bwdif, int bit_depth); #endif /* AVFILTER_BWDIF_H */ diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c index e278cf1217..39a51429ac 100644 --- a/libavfilter/vf_bwdif.c +++ b/libavfilter/vf_bwdif.c @@ -369,6 +369,8 @@ av_cold void ff_bwdif_init_filter_line(BWDIFContext *s, int bit_depth) #if ARCH_X86 ff_bwdif_init_x86(s, bit_depth); +#elif ARCH_AARCH64 +ff_bwdif_init_aarch64(s, bit_depth); #endif } -- 2.39.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH v2 02/15] avfilter/vf_bwdif: Add common macros and consts for aarch64 neon
Add macros for dual scalar half->single multiply and accumulate Add macro for shift, saturate and shorten single to byte Add filter constants Signed-off-by: John Cox --- libavfilter/aarch64/vf_bwdif_neon.S | 53 + 1 file changed, 53 insertions(+) diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S index 639ab22998..c2f5eb1f73 100644 --- a/libavfilter/aarch64/vf_bwdif_neon.S +++ b/libavfilter/aarch64/vf_bwdif_neon.S @@ -23,3 +23,56 @@ #include "libavutil/aarch64/asm.S" +// Space taken on the stack by an int (32-bit) +#ifdef __APPLE__ +.setSP_INT, 4 +#else +.setSP_INT, 8 +#endif + +.macro SQSHRUNN b, s0, s1, s2, s3, n +sqshrun \s0\().4h, \s0\().4s, #\n - 8 +sqshrun2\s0\().8h, \s1\().4s, #\n - 8 +sqshrun \s1\().4h, \s2\().4s, #\n - 8 +sqshrun2\s1\().8h, \s3\().4s, #\n - 8 +uzp2\b\().16b, \s0\().16b, \s1\().16b +.endm + +.macro SMULL4K a0, a1, a2, a3, s0, s1, k +smull \a0\().4s, \s0\().4h, \k +smull2 \a1\().4s, \s0\().8h, \k +smull \a2\().4s, \s1\().4h, \k +smull2 \a3\().4s, \s1\().8h, \k +.endm + +.macro UMULL4K a0, a1, a2, a3, s0, s1, k +umull \a0\().4s, \s0\().4h, \k +umull2 \a1\().4s, \s0\().8h, \k +umull \a2\().4s, \s1\().4h, \k +umull2 \a3\().4s, \s1\().8h, \k +.endm + +.macro UMLAL4K a0, a1, a2, a3, s0, s1, k +umlal \a0\().4s, \s0\().4h, \k +umlal2 \a1\().4s, \s0\().8h, \k +umlal \a2\().4s, \s1\().4h, \k +umlal2 \a3\().4s, \s1\().8h, \k +.endm + +.macro UMLSL4K a0, a1, a2, a3, s0, s1, k +umlsl \a0\().4s, \s0\().4h, \k +umlsl2 \a1\().4s, \s0\().8h, \k +umlsl \a2\().4s, \s1\().4h, \k +umlsl2 \a3\().4s, \s1\().8h, \k +.endm + +// static const uint16_t coef_lf[2] = { 4309, 213 }; +// static const uint16_t coef_hf[3] = { 5570, 3801, 1016 }; +// static const uint16_t coef_sp[2] = { 5077, 981 }; + +.balign 16 +coeffs: +.hword 4309 * 4, 213 * 4 // lf[0]*4 = v0.h[0] +.hword 5570, 3801, 1016, -3801 // hf[0] = v0.h[2], -hf[1] = v0.h[5] +.hword 5077, 981 // sp[0] = v0.h[6] + -- 2.39.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH v2 03/15] avfilter/vf_bwdif: Export C filter_intra
Needed for tail fixup of neon code Signed-off-by: John Cox --- libavfilter/bwdif.h| 3 +++ libavfilter/vf_bwdif.c | 6 +++--- 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/libavfilter/bwdif.h b/libavfilter/bwdif.h index 6a0f70487a..ae6f6ce223 100644 --- a/libavfilter/bwdif.h +++ b/libavfilter/bwdif.h @@ -41,4 +41,7 @@ void ff_bwdif_init_filter_line(BWDIFContext *bwdif, int bit_depth); void ff_bwdif_init_x86(BWDIFContext *bwdif, int bit_depth); void ff_bwdif_init_aarch64(BWDIFContext *bwdif, int bit_depth); +void ff_bwdif_filter_intra_c(void *dst1, void *cur1, int w, int prefs, int mrefs, + int prefs3, int mrefs3, int parity, int clip_max); + #endif /* AVFILTER_BWDIF_H */ diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c index 39a51429ac..035fc58670 100644 --- a/libavfilter/vf_bwdif.c +++ b/libavfilter/vf_bwdif.c @@ -122,8 +122,8 @@ typedef struct ThreadData { next2++; \ } -static void filter_intra(void *dst1, void *cur1, int w, int prefs, int mrefs, - int prefs3, int mrefs3, int parity, int clip_max) +void ff_bwdif_filter_intra_c(void *dst1, void *cur1, int w, int prefs, int mrefs, + int prefs3, int mrefs3, int parity, int clip_max) { uint8_t *dst = dst1; uint8_t *cur = cur1; @@ -362,7 +362,7 @@ av_cold void ff_bwdif_init_filter_line(BWDIFContext *s, int bit_depth) s->filter_line = filter_line_c_16bit; s->filter_edge = filter_edge_16bit; } else { -s->filter_intra = filter_intra; +s->filter_intra = ff_bwdif_filter_intra_c; s->filter_line = filter_line_c; s->filter_edge = filter_edge; } -- 2.39.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH v2 04/15] avfilter/vf_bwdif: Add neon for filter_intra
Signed-off-by: John Cox --- libavfilter/aarch64/vf_bwdif_init_aarch64.c | 17 +++ libavfilter/aarch64/vf_bwdif_neon.S | 53 + 2 files changed, 70 insertions(+) diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c b/libavfilter/aarch64/vf_bwdif_init_aarch64.c index 86d53b2ca1..3ffaa07ab3 100644 --- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c +++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c @@ -24,6 +24,22 @@ #include "libavfilter/bwdif.h" #include "libavutil/aarch64/cpu.h" +void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, int mrefs, +int prefs3, int mrefs3, int parity, int clip_max); + + +static void filter_intra_helper(void *dst1, void *cur1, int w, int prefs, int mrefs, +int prefs3, int mrefs3, int parity, int clip_max) +{ +const int w0 = clip_max != 255 ? 0 : w & ~15; + +ff_bwdif_filter_intra_neon(dst1, cur1, w0, prefs, mrefs, prefs3, mrefs3, parity, clip_max); + +if (w0 < w) +ff_bwdif_filter_intra_c((char *)dst1 + w0, (char *)cur1 + w0, +w - w0, prefs, mrefs, prefs3, mrefs3, parity, clip_max); +} + void ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth) { @@ -35,5 +51,6 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth) if (!have_neon(cpu_flags)) return; +s->filter_intra = filter_intra_helper; } diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S index c2f5eb1f73..6a614f8d6e 100644 --- a/libavfilter/aarch64/vf_bwdif_neon.S +++ b/libavfilter/aarch64/vf_bwdif_neon.S @@ -76,3 +76,56 @@ coeffs: .hword 5570, 3801, 1016, -3801 // hf[0] = v0.h[2], -hf[1] = v0.h[5] .hword 5077, 981 // sp[0] = v0.h[6] +// +// +// void ff_bwdif_filter_intra_neon( +// void *dst1, // x0 +// void *cur1, // x1 +// int w, // w2 +// int prefs, // w3 +// int mrefs, // w4 +// int prefs3, // w5 +// int mrefs3, // w6 +// int parity, // w7 unused +// int clip_max) // [sp, #0] unused + +function ff_bwdif_filter_intra_neon, export=1 +cmp w2, #0 +ble 99f + +ldr q0, coeffs + +//for (x = 0; x < w; x++) { +10: + +//interpol = (coef_sp[0] * (cur[mrefs] + cur[prefs]) - coef_sp[1] * (cur[mrefs3] + cur[prefs3])) >> 13; +ldr q31, [x1, w4, sxtw] +ldr q30, [x1, w3, sxtw] +ldr q29, [x1, w6, sxtw] +ldr q28, [x1, w5, sxtw] + +uaddl v20.8h, v31.8b, v30.8b +uaddl2 v21.8h, v31.16b, v30.16b + +UMULL4K v2, v3, v4, v5, v20, v21, v0.h[6] + +uaddl v20.8h, v29.8b, v28.8b +uaddl2 v21.8h, v29.16b, v28.16b + +UMLSL4K v2, v3, v4, v5, v20, v21, v0.h[7] + +//dst[0] = av_clip(interpol, 0, clip_max); +SQSHRUNNv2, v2, v3, v4, v5, 13 +str q2, [x0], #16 + +//dst++; +//cur++; +//} + +subsw2, w2, #16 +add x1, x1, #16 +bgt 10b + +99: +ret +endfunc -- 2.39.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH v2 08/15] avfilter/vf_bwdif: Add neon for filter_edge
Signed-off-by: John Cox --- libavfilter/aarch64/vf_bwdif_init_aarch64.c | 20 libavfilter/aarch64/vf_bwdif_neon.S | 104 2 files changed, 124 insertions(+) diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c b/libavfilter/aarch64/vf_bwdif_init_aarch64.c index 3ffaa07ab3..e75cf2f204 100644 --- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c +++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c @@ -24,10 +24,29 @@ #include "libavfilter/bwdif.h" #include "libavutil/aarch64/cpu.h" +void ff_bwdif_filter_edge_neon(void *dst1, void *prev1, void *cur1, void *next1, + int w, int prefs, int mrefs, int prefs2, int mrefs2, + int parity, int clip_max, int spat); + void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, int mrefs, int prefs3, int mrefs3, int parity, int clip_max); +static void filter_edge_helper(void *dst1, void *prev1, void *cur1, void *next1, + int w, int prefs, int mrefs, int prefs2, int mrefs2, + int parity, int clip_max, int spat) +{ +const int w0 = clip_max != 255 ? 0 : w & ~15; + +ff_bwdif_filter_edge_neon(dst1, prev1, cur1, next1, w0, prefs, mrefs, prefs2, mrefs2, + parity, clip_max, spat); + +if (w0 < w) +ff_bwdif_filter_edge_c((char *)dst1 + w0, (char *)prev1 + w0, (char *)cur1 + w0, (char *)next1 + w0, + w - w0, prefs, mrefs, prefs2, mrefs2, + parity, clip_max, spat); +} + static void filter_intra_helper(void *dst1, void *cur1, int w, int prefs, int mrefs, int prefs3, int mrefs3, int parity, int clip_max) { @@ -52,5 +71,6 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth) return; s->filter_intra = filter_intra_helper; +s->filter_edge = filter_edge_helper; } diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S index 48dc7bcd9d..d6e7d109f5 100644 --- a/libavfilter/aarch64/vf_bwdif_neon.S +++ b/libavfilter/aarch64/vf_bwdif_neon.S @@ -149,6 +149,110 @@ coeffs: .hword 5570, 3801, 1016, -3801 // hf[0] = v0.h[2], -hf[1] = v0.h[5] .hword 5077, 981 // sp[0] = v0.h[6] +// +// +// void ff_bwdif_filter_edge_neon( +// void *dst1, // x0 +// void *prev1,// x1 +// void *cur1, // x2 +// void *next1,// x3 +// int w, // w4 +// int prefs, // w5 +// int mrefs, // w6 +// int prefs2, // w7 +// int mrefs2, // [sp, #0] +// int parity, // [sp, #SP_INT] +// int clip_max, // [sp, #SP_INT*2] unused +// int spat); // [sp, #SP_INT*3] + +function ff_bwdif_filter_edge_neon, export=1 +// Sanity check w +cmp w4, #0 +ble 99f + +// #define prev2 cur +// const uint8_t * restrict next2 = parity ? prev : next; + +ldr w8, [sp, #0] // mrefs2 + +ldr w17, [sp, #SP_INT] // parity +ldr w16, [sp, #SP_INT*3]// spat +cmp w17, #0 +cselx17, x1, x3, ne + +// for (x = 0; x < w; x++) { + +10: +//int m1 = cur[mrefs]; +//int d = (prev2[0] + next2[0]) >> 1; +//int p1 = cur[prefs]; +//int temporal_diff0 = FFABS(prev2[0] - next2[0]); +//int temporal_diff1 =(FFABS(prev[mrefs] - m1) + FFABS(prev[prefs] - p1)) >> 1; +//int temporal_diff2 =(FFABS(next[mrefs] - m1) + FFABS(next[prefs] - p1)) >> 1; +//int diff = FFMAX3(temporal_diff0 >> 1, temporal_diff1, temporal_diff2); +ldr q31, [x2] +ldr q21, [x17] +uhadd v16.16b, v31.16b, v21.16b // d0 = v16 +uabdv17.16b, v31.16b, v21.16b // td0 = v17 +ldr q24, [x2, w6, sxtw] // m1 = v24 +ldr q22, [x2, w5, sxtw] // p1 = v22 + +ldr q0, [x1, w6, sxtw] // prev[mrefs] +ldr q2, [x1, w5, sxtw] // prev[prefs] +ldr q1, [x3, w6, sxtw] // next[mrefs] +ldr q3, [x3, w5, sxtw] // next[prefs] + +ushrv29.16b, v17.16b, #1 + +uabdv31.16b, v0.16b, v24.16b +uabdv30.16b, v2.16b, v22.16b +uhadd v0.16b, v31.16b, v30.16b // td1 = q0 + +uabdv31.16b, v1.16b, v24.16b +uabdv30.16b, v3.16b, v22.16b +uhadd v1.16b, v31.16b,
[FFmpeg-devel] [PATCH v2 09/15] tests/checkasm: Add test for vf_bwdif filter_edge
Signed-off-by: John Cox --- tests/checkasm/vf_bwdif.c | 54 +++ 1 file changed, 54 insertions(+) diff --git a/tests/checkasm/vf_bwdif.c b/tests/checkasm/vf_bwdif.c index 034bbabb4c..5fdba09fdc 100644 --- a/tests/checkasm/vf_bwdif.c +++ b/tests/checkasm/vf_bwdif.c @@ -83,6 +83,60 @@ void checkasm_check_vf_bwdif(void) report("bwdif10"); } +{ +LOCAL_ALIGNED_16(uint8_t, prev0, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, prev1, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, next0, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, next1, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, cur0, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, cur1, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, dst0, [WIDTH*3]); +LOCAL_ALIGNED_16(uint8_t, dst1, [WIDTH*3]); +const int stride = WIDTH; +const int mask = (1<<8)-1; +int spat; +int parity; + +for (spat = 0; spat != 2; ++spat) { +for (parity = 0; parity != 2; ++parity) { +if (check_func(ctx_8.filter_edge, "bwdif8.edge.s%d.p%d", spat, parity)) { + +declare_func(void, void *dst1, void *prev1, void *cur1, void *next1, +int w, int prefs, int mrefs, int prefs2, int mrefs2, +int parity, int clip_max, int spat); + +randomize_buffers(prev0, prev1, mask, 11*WIDTH); +randomize_buffers(next0, next1, mask, 11*WIDTH); +randomize_buffers( cur0, cur1, mask, 11*WIDTH); +memset(dst0, 0xba, WIDTH * 3); +memset(dst1, 0xba, WIDTH * 3); + +call_ref(dst0 + stride, + prev0 + stride * 4, cur0 + stride * 4, next0 + stride * 4, WIDTH, + stride, -stride, stride * 2, -stride * 2, + parity, mask, spat); +call_new(dst1 + stride, + prev1 + stride * 4, cur1 + stride * 4, next1 + stride * 4, WIDTH, + stride, -stride, stride * 2, -stride * 2, + parity, mask, spat); + +if (memcmp(dst0, dst1, WIDTH*3) +|| memcmp(prev0, prev1, WIDTH*11) +|| memcmp(next0, next1, WIDTH*11) +|| memcmp( cur0, cur1, WIDTH*11)) +fail(); + +bench_new(dst1 + stride, + prev1 + stride * 4, cur1 + stride * 4, next1 + stride * 4, WIDTH, + stride, -stride, stride * 2, -stride * 2, + parity, mask, spat); +} +} +} + +report("bwdif8.edge"); +} + if (check_func(ctx_8.filter_intra, "bwdif8.intra")) { LOCAL_ALIGNED_16(uint8_t, cur0, [11*WIDTH]); LOCAL_ALIGNED_16(uint8_t, cur1, [11*WIDTH]); -- 2.39.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH v2 05/15] tests/checkasm: Add test for vf_bwdif filter_intra
Signed-off-by: John Cox --- tests/checkasm/vf_bwdif.c | 37 + 1 file changed, 37 insertions(+) diff --git a/tests/checkasm/vf_bwdif.c b/tests/checkasm/vf_bwdif.c index 46224bb575..034bbabb4c 100644 --- a/tests/checkasm/vf_bwdif.c +++ b/tests/checkasm/vf_bwdif.c @@ -20,6 +20,7 @@ #include "checkasm.h" #include "libavcodec/internal.h" #include "libavfilter/bwdif.h" +#include "libavutil/mem_internal.h" #define WIDTH 256 @@ -81,4 +82,40 @@ void checkasm_check_vf_bwdif(void) BODY(uint16_t, 10); report("bwdif10"); } + +if (check_func(ctx_8.filter_intra, "bwdif8.intra")) { +LOCAL_ALIGNED_16(uint8_t, cur0, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, cur1, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, dst0, [WIDTH*3]); +LOCAL_ALIGNED_16(uint8_t, dst1, [WIDTH*3]); +const int stride = WIDTH; +const int mask = (1<<8)-1; + +declare_func(void, void *dst1, void *cur1, int w, int prefs, int mrefs, + int prefs3, int mrefs3, int parity, int clip_max); + +randomize_buffers( cur0, cur1, mask, 11*WIDTH); +memset(dst0, 0xba, WIDTH * 3); +memset(dst1, 0xba, WIDTH * 3); + +call_ref(dst0 + stride, + cur0 + stride * 4, WIDTH, + stride, -stride, stride * 3, -stride * 3, + 0, mask); +call_new(dst1 + stride, + cur0 + stride * 4, WIDTH, + stride, -stride, stride * 3, -stride * 3, + 0, mask); + +if (memcmp(dst0, dst1, WIDTH*3) +|| memcmp( cur0, cur1, WIDTH*11)) +fail(); + +bench_new(dst1 + stride, + cur0 + stride * 4, WIDTH, + stride, -stride, stride * 3, -stride * 3, + 0, mask); + +report("bwdif8.intra"); +} } -- 2.39.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH v2 06/15] avfilter/vf_bwdif: Add clip and spatial macros for aarch64 neon
Signed-off-by: John Cox --- libavfilter/aarch64/vf_bwdif_neon.S | 73 + 1 file changed, 73 insertions(+) diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S index 6a614f8d6e..48dc7bcd9d 100644 --- a/libavfilter/aarch64/vf_bwdif_neon.S +++ b/libavfilter/aarch64/vf_bwdif_neon.S @@ -66,6 +66,79 @@ umlsl2 \a3\().4s, \s1\().8h, \k .endm +// int b = m2s1 - m1; +// int f = p2s1 - p1; +// int dc = c0s1 - m1; +// int de = c0s1 - p1; +// int sp_max = FFMIN(p1 - c0s1, m1 - c0s1); +// sp_max = FFMIN(sp_max, FFMAX(-b,-f)); +// int sp_min = FFMIN(c0s1 - p1, c0s1 - m1); +// sp_min = FFMIN(sp_min, FFMAX(b,f)); +// diff = diff == 0 ? 0 : FFMAX3(diff, sp_min, sp_max); +.macro SPAT_CHECK diff, m2s1, m1, c0s1, p1, p2s1, t0, t1, t2, t3 +uqsub \t0\().16b, \p1\().16b, \c0s1\().16b +uqsub \t2\().16b, \m1\().16b, \c0s1\().16b +umin\t2\().16b, \t0\().16b, \t2\().16b + +uqsub \t1\().16b, \m1\().16b, \m2s1\().16b +uqsub \t3\().16b, \p1\().16b, \p2s1\().16b +umax\t3\().16b, \t3\().16b, \t1\().16b +umin\t3\().16b, \t3\().16b, \t2\().16b + +uqsub \t0\().16b, \c0s1\().16b, \p1\().16b +uqsub \t2\().16b, \c0s1\().16b, \m1\().16b +umin\t2\().16b, \t0\().16b, \t2\().16b + +uqsub \t1\().16b, \m2s1\().16b, \m1\().16b +uqsub \t0\().16b, \p2s1\().16b, \p1\().16b +umax\t0\().16b, \t0\().16b, \t1\().16b +umin\t2\().16b, \t2\().16b, \t0\().16b + +cmeq\t1\().16b, \diff\().16b, #0 +umax\diff\().16b, \diff\().16b, \t3\().16b +umax\diff\().16b, \diff\().16b, \t2\().16b +bic \diff\().16b, \diff\().16b, \t1\().16b +.endm + +// i0 = s0; +// if (i0 > d0 + diff0) +// i0 = d0 + diff0; +// else if (i0 < d0 - diff0) +// i0 = d0 - diff0; +// +// i0 = s0 is safe +.macro DIFF_CLIP i0, s0, d0, diff, t0, t1 +uqadd \t0\().16b, \d0\().16b, \diff\().16b +uqsub \t1\().16b, \d0\().16b, \diff\().16b +umin\i0\().16b, \s0\().16b, \t0\().16b +umax\i0\().16b, \i0\().16b, \t1\().16b +.endm + +// i0 = FFABS(m1 - p1) > td0 ? i1 : i2; +// DIFF_CLIP +// +// i0 = i1 is safe +.macro INTERPOL i0, i1, i2, m1, d0, p1, td0, diff, t0, t1, t2 +uabd\t0\().16b, \m1\().16b, \p1\().16b +cmhi\t0\().16b, \t0\().16b, \td0\().16b +bsl \t0\().16b, \i1\().16b, \i2\().16b +DIFF_CLIP \i0, \t0, \d0, \diff, \t1, \t2 +.endm + +.macro PUSH_VREGS +stp d8, d9, [sp, #-64]! +stp d10, d11, [sp, #16] +stp d12, d13, [sp, #32] +stp d14, d15, [sp, #48] +.endm + +.macro POP_VREGS +ldp d14, d15, [sp, #48] +ldp d12, d13, [sp, #32] +ldp d10, d11, [sp, #16] +ldp d8, d9, [sp], #64 +.endm + // static const uint16_t coef_lf[2] = { 4309, 213 }; // static const uint16_t coef_hf[3] = { 5570, 3801, 1016 }; // static const uint16_t coef_sp[2] = { 5077, 981 }; -- 2.39.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH v2 10/15] avfilter/vf_bwdif: Export C filter_line
Needed for tail fixup of neon code Signed-off-by: John Cox --- libavfilter/bwdif.h| 5 + libavfilter/vf_bwdif.c | 10 +- 2 files changed, 10 insertions(+), 5 deletions(-) diff --git a/libavfilter/bwdif.h b/libavfilter/bwdif.h index ae1616d366..cce99953f3 100644 --- a/libavfilter/bwdif.h +++ b/libavfilter/bwdif.h @@ -48,4 +48,9 @@ void ff_bwdif_filter_edge_c(void *dst1, void *prev1, void *cur1, void *next1, void ff_bwdif_filter_intra_c(void *dst1, void *cur1, int w, int prefs, int mrefs, int prefs3, int mrefs3, int parity, int clip_max); +void ff_bwdif_filter_line_c(void *dst1, void *prev1, void *cur1, void *next1, +int w, int prefs, int mrefs, int prefs2, int mrefs2, +int prefs3, int mrefs3, int prefs4, int mrefs4, +int parity, int clip_max); + #endif /* AVFILTER_BWDIF_H */ diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c index bec83111b4..26349da1fd 100644 --- a/libavfilter/vf_bwdif.c +++ b/libavfilter/vf_bwdif.c @@ -132,10 +132,10 @@ void ff_bwdif_filter_intra_c(void *dst1, void *cur1, int w, int prefs, int mrefs FILTER_INTRA() } -static void filter_line_c(void *dst1, void *prev1, void *cur1, void *next1, - int w, int prefs, int mrefs, int prefs2, int mrefs2, - int prefs3, int mrefs3, int prefs4, int mrefs4, - int parity, int clip_max) +void ff_bwdif_filter_line_c(void *dst1, void *prev1, void *cur1, void *next1, +int w, int prefs, int mrefs, int prefs2, int mrefs2, +int prefs3, int mrefs3, int prefs4, int mrefs4, +int parity, int clip_max) { uint8_t *dst = dst1; uint8_t *prev = prev1; @@ -363,7 +363,7 @@ av_cold void ff_bwdif_init_filter_line(BWDIFContext *s, int bit_depth) s->filter_edge = filter_edge_16bit; } else { s->filter_intra = ff_bwdif_filter_intra_c; -s->filter_line = filter_line_c; +s->filter_line = ff_bwdif_filter_line_c; s->filter_edge = ff_bwdif_filter_edge_c; } -- 2.39.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH v2 07/15] avfilter/vf_bwdif: Export C filter_edge
Needed for tail fixup of neon code Signed-off-by: John Cox --- libavfilter/bwdif.h| 4 libavfilter/vf_bwdif.c | 8 2 files changed, 8 insertions(+), 4 deletions(-) diff --git a/libavfilter/bwdif.h b/libavfilter/bwdif.h index ae6f6ce223..ae1616d366 100644 --- a/libavfilter/bwdif.h +++ b/libavfilter/bwdif.h @@ -41,6 +41,10 @@ void ff_bwdif_init_filter_line(BWDIFContext *bwdif, int bit_depth); void ff_bwdif_init_x86(BWDIFContext *bwdif, int bit_depth); void ff_bwdif_init_aarch64(BWDIFContext *bwdif, int bit_depth); +void ff_bwdif_filter_edge_c(void *dst1, void *prev1, void *cur1, void *next1, +int w, int prefs, int mrefs, int prefs2, int mrefs2, +int parity, int clip_max, int spat); + void ff_bwdif_filter_intra_c(void *dst1, void *cur1, int w, int prefs, int mrefs, int prefs3, int mrefs3, int parity, int clip_max); diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c index 035fc58670..bec83111b4 100644 --- a/libavfilter/vf_bwdif.c +++ b/libavfilter/vf_bwdif.c @@ -150,9 +150,9 @@ static void filter_line_c(void *dst1, void *prev1, void *cur1, void *next1, FILTER2() } -static void filter_edge(void *dst1, void *prev1, void *cur1, void *next1, -int w, int prefs, int mrefs, int prefs2, int mrefs2, -int parity, int clip_max, int spat) +void ff_bwdif_filter_edge_c(void *dst1, void *prev1, void *cur1, void *next1, +int w, int prefs, int mrefs, int prefs2, int mrefs2, +int parity, int clip_max, int spat) { uint8_t *dst = dst1; uint8_t *prev = prev1; @@ -364,7 +364,7 @@ av_cold void ff_bwdif_init_filter_line(BWDIFContext *s, int bit_depth) } else { s->filter_intra = ff_bwdif_filter_intra_c; s->filter_line = filter_line_c; -s->filter_edge = filter_edge; +s->filter_edge = ff_bwdif_filter_edge_c; } #if ARCH_X86 -- 2.39.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH v2 11/15] avfilter/vf_bwdif: Add neon for filter_line
Signed-off-by: John Cox --- libavfilter/aarch64/vf_bwdif_init_aarch64.c | 21 ++ libavfilter/aarch64/vf_bwdif_neon.S | 208 2 files changed, 229 insertions(+) diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c b/libavfilter/aarch64/vf_bwdif_init_aarch64.c index e75cf2f204..21e67884ab 100644 --- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c +++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c @@ -31,6 +31,26 @@ void ff_bwdif_filter_edge_neon(void *dst1, void *prev1, void *cur1, void *next1, void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, int mrefs, int prefs3, int mrefs3, int parity, int clip_max); +void ff_bwdif_filter_line_neon(void *dst1, void *prev1, void *cur1, void *next1, + int w, int prefs, int mrefs, int prefs2, int mrefs2, + int prefs3, int mrefs3, int prefs4, int mrefs4, + int parity, int clip_max); + + +static void filter_line_helper(void *dst1, void *prev1, void *cur1, void *next1, + int w, int prefs, int mrefs, int prefs2, int mrefs2, + int prefs3, int mrefs3, int prefs4, int mrefs4, + int parity, int clip_max) +{ +const int w0 = clip_max != 255 ? 0 : w & ~15; + +ff_bwdif_filter_line_neon(dst1, prev1, cur1, next1, + w0, prefs, mrefs, prefs2, mrefs2, prefs3, mrefs3, prefs4, mrefs4, parity, clip_max); + +if (w0 < w) +ff_bwdif_filter_line_c((char *)dst1 + w0, (char *)prev1 + w0, (char *)cur1 + w0, (char *)next1 + w0, + w - w0, prefs, mrefs, prefs2, mrefs2, prefs3, mrefs3, prefs4, mrefs4, parity, clip_max); +} static void filter_edge_helper(void *dst1, void *prev1, void *cur1, void *next1, int w, int prefs, int mrefs, int prefs2, int mrefs2, @@ -71,6 +91,7 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth) return; s->filter_intra = filter_intra_helper; +s->filter_line = filter_line_helper; s->filter_edge = filter_edge_helper; } diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S index d6e7d109f5..abc050565c 100644 --- a/libavfilter/aarch64/vf_bwdif_neon.S +++ b/libavfilter/aarch64/vf_bwdif_neon.S @@ -149,6 +149,214 @@ coeffs: .hword 5570, 3801, 1016, -3801 // hf[0] = v0.h[2], -hf[1] = v0.h[5] .hword 5077, 981 // sp[0] = v0.h[6] +// === +// +// void filter_line( +// void *dst1, // x0 +// void *prev1,// x1 +// void *cur1, // x2 +// void *next1,// x3 +// int w, // w4 +// int prefs, // w5 +// int mrefs, // w6 +// int prefs2, // w7 +// int mrefs2, // [sp, #0] +// int prefs3, // [sp, #SP_INT] +// int mrefs3, // [sp, #SP_INT*2] +// int prefs4, // [sp, #SP_INT*3] +// int mrefs4, // [sp, #SP_INT*4] +// int parity, // [sp, #SP_INT*5] +// int clip_max) // [sp, #SP_INT*6] + +function ff_bwdif_filter_line_neon, export=1 +// Sanity check w +cmp w4, #0 +ble 99f + +// Rearrange regs to be the same as line3 for ease of debug! +mov w10, w4 // w10 = loop count +mov w9, w6 // w9 = mref +mov w12, w7 // w12 = pref2 +mov w11, w5 // w11 = pref +ldr w8, [sp, #0] // w8 = mref2 +ldr w7, [sp, #SP_INT*2]// w7 = mref3 +ldr w6, [sp, #SP_INT*4]// w6 = mref4 +ldr w13, [sp, #SP_INT] // w13 = pref3 +ldr w14, [sp, #SP_INT*3]// w14 = pref4 + +mov x4, x3 +mov x3, x2 +mov x2, x1 + +// #define prev2 cur +//const uint8_t * restrict next2 = parity ? prev : next; +ldr w17, [sp, #SP_INT*5]// parity +cmp w17, #0 +cselx17, x2, x4, ne + +PUSH_VREGS + +ldr q0, coeffs + +// for (x = 0; x < w; x++) { +// int diff0, diff2; +// int d0, d2; +// int temporal_diff0, temporal_diff2; +// +// int i1, i2; +// int j1, j2; +// int p6, p5, p4, p3, p2, p1, c0, m1, m2, m3, m4; + +10: +// c0 = prev2[0] + next2[0];// c0 = v20, v21 +// d0 = c0 >> 1; // d0 = v10 +// temporal_diff0
[FFmpeg-devel] [PATCH v2 12/15] avfilter/vf_bwdif: Add a filter_line3 method for optimisation
Add an optional filter_line3 to the available optimisations. filter_line3 is equivalent to filter_line, memcpy, filter_line filter_line shares quite a number of loads and some calculations in common with its next iteration and testing shows that using aarch64 neon filter_line3s performance is 30% better than two filter_lines and a memcpy. Signed-off-by: John Cox --- libavfilter/bwdif.h| 7 +++ libavfilter/vf_bwdif.c | 31 +++ 2 files changed, 38 insertions(+) diff --git a/libavfilter/bwdif.h b/libavfilter/bwdif.h index cce99953f3..496cec72ef 100644 --- a/libavfilter/bwdif.h +++ b/libavfilter/bwdif.h @@ -35,6 +35,9 @@ typedef struct BWDIFContext { void (*filter_edge)(void *dst, void *prev, void *cur, void *next, int w, int prefs, int mrefs, int prefs2, int mrefs2, int parity, int clip_max, int spat); +void (*filter_line3)(void *dst, int dstride, + const void *prev, const void *cur, const void *next, int prefs, + int w, int parity, int clip_max); } BWDIFContext; void ff_bwdif_init_filter_line(BWDIFContext *bwdif, int bit_depth); @@ -53,4 +56,8 @@ void ff_bwdif_filter_line_c(void *dst1, void *prev1, void *cur1, void *next1, int prefs3, int mrefs3, int prefs4, int mrefs4, int parity, int clip_max); +void ff_bwdif_filter_line3_c(void * dst1, int d_stride, + const void * prev1, const void * cur1, const void * next1, int s_stride, + int w, int parity, int clip_max); + #endif /* AVFILTER_BWDIF_H */ diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c index 26349da1fd..52bc676cf8 100644 --- a/libavfilter/vf_bwdif.c +++ b/libavfilter/vf_bwdif.c @@ -150,6 +150,31 @@ void ff_bwdif_filter_line_c(void *dst1, void *prev1, void *cur1, void *next1, FILTER2() } +#define NEXT_LINE()\ +dst += d_stride; \ +prev += prefs; \ +cur += prefs; \ +next += prefs; + +void ff_bwdif_filter_line3_c(void * dst1, int d_stride, + const void * prev1, const void * cur1, const void * next1, int s_stride, + int w, int parity, int clip_max) +{ +const int prefs = s_stride; +uint8_t * dst = dst1; +const uint8_t * prev = prev1; +const uint8_t * cur = cur1; +const uint8_t * next = next1; + +ff_bwdif_filter_line_c(dst, (void*)prev, (void*)cur, (void*)next, w, + prefs, -prefs, prefs * 2, - prefs * 2, prefs * 3, -prefs * 3, prefs * 4, -prefs * 4, parity, clip_max); +NEXT_LINE(); +memcpy(dst, cur, w); +NEXT_LINE(); +ff_bwdif_filter_line_c(dst, (void*)prev, (void*)cur, (void*)next, w, + prefs, -prefs, prefs * 2, - prefs * 2, prefs * 3, -prefs * 3, prefs * 4, -prefs * 4, parity, clip_max); +} + void ff_bwdif_filter_edge_c(void *dst1, void *prev1, void *cur1, void *next1, int w, int prefs, int mrefs, int prefs2, int mrefs2, int parity, int clip_max, int spat) @@ -244,6 +269,11 @@ static int filter_slice(AVFilterContext *ctx, void *arg, int jobnr, int nb_jobs) refs << 1, -(refs << 1), td->parity ^ td->tff, clip_max, (y < 2) || ((y + 3) > td->h) ? 0 : 1); +} else if (s->filter_line3 && y + 2 < slice_end && y + 6 < td->h) { +s->filter_line3(dst, td->frame->linesize[td->plane], +prev, cur, next, linesize, td->w, +td->parity ^ td->tff, clip_max); +y += 2; } else { s->filter_line(dst, prev, cur, next, td->w, refs, -refs, refs << 1, -(refs << 1), @@ -357,6 +387,7 @@ static int config_props(AVFilterLink *link) av_cold void ff_bwdif_init_filter_line(BWDIFContext *s, int bit_depth) { +s->filter_line3 = 0; if (bit_depth > 8) { s->filter_intra = filter_intra_16bit; s->filter_line = filter_line_c_16bit; -- 2.39.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH v2 13/15] avfilter/vf_bwdif: Add neon for filter_line3
Signed-off-by: John Cox --- libavfilter/aarch64/vf_bwdif_init_aarch64.c | 28 ++ libavfilter/aarch64/vf_bwdif_neon.S | 272 2 files changed, 300 insertions(+) diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c b/libavfilter/aarch64/vf_bwdif_init_aarch64.c index 21e67884ab..f52bc4b9b4 100644 --- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c +++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c @@ -36,6 +36,33 @@ void ff_bwdif_filter_line_neon(void *dst1, void *prev1, void *cur1, void *next1, int prefs3, int mrefs3, int prefs4, int mrefs4, int parity, int clip_max); +void ff_bwdif_filter_line3_neon(void * dst1, int d_stride, +const void * prev1, const void * cur1, const void * next1, int s_stride, +int w, int parity, int clip_max); + + +static void filter_line3_helper(void * dst1, int d_stride, +const void * prev1, const void * cur1, const void * next1, int s_stride, +int w, int parity, int clip_max) +{ +// Asm works on 16 byte chunks +// If w is a multiple of 16 then all is good - if not then if width rounded +// up to nearest 16 will fit in both src & dst strides then allow the asm +// to write over the padding bytes as that is almost certainly faster than +// having to invoke the C version to clean up the tail. +const int w1 = FFALIGN(w, 16); +const int w0 = clip_max != 255 ? 0 : + d_stride <= w1 && s_stride <= w1 ? w : w & ~15; + +ff_bwdif_filter_line3_neon(dst1, d_stride, + prev1, cur1, next1, s_stride, + w0, parity, clip_max); + +if (w0 < w) +ff_bwdif_filter_line3_c((char *)dst1 + w0, d_stride, +(const char *)prev1 + w0, (const char *)cur1 + w0, (const char *)next1 + w0, s_stride, +w - w0, parity, clip_max); +} static void filter_line_helper(void *dst1, void *prev1, void *cur1, void *next1, int w, int prefs, int mrefs, int prefs2, int mrefs2, @@ -93,5 +120,6 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth) s->filter_intra = filter_intra_helper; s->filter_line = filter_line_helper; s->filter_edge = filter_edge_helper; +s->filter_line3 = filter_line3_helper; } diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S index abc050565c..1405ea10fb 100644 --- a/libavfilter/aarch64/vf_bwdif_neon.S +++ b/libavfilter/aarch64/vf_bwdif_neon.S @@ -149,6 +149,278 @@ coeffs: .hword 5570, 3801, 1016, -3801 // hf[0] = v0.h[2], -hf[1] = v0.h[5] .hword 5077, 981 // sp[0] = v0.h[6] +// === +// +// void ff_bwdif_filter_line3_neon( +// void * dst1, // x0 +// int d_stride,// w1 +// const void * prev1, // x2 +// const void * cur1, // x3 +// const void * next1, // x4 +// int s_stride,// w5 +// int w, // w6 +// int parity, // w7 +// int clip_max); // [sp, #0] (Ignored) + +function ff_bwdif_filter_line3_neon, export=1 +// Sanity check w +cmp w6, #0 +ble 99f + +// #define prev2 cur +//const uint8_t * restrict next2 = parity ? prev : next; +cmp w7, #0 +cselx17, x2, x4, ne + +// We want all the V registers - save all the ones we must +PUSH_VREGS + +ldr q0, coeffs + +// Some rearrangement of initial values for nice layout of refs in regs +mov w10, w6 // w10 = loop count +neg w9, w5 // w9 = mref +lsl w8, w9, #1// w8 = mref2 +add w7, w9, w9, LSL #1// w7 = mref3 +lsl w6, w9, #2// w6 = mref4 +mov w11, w5 // w11 = pref +lsl w12, w5, #1// w12 = pref2 +add w13, w5, w5, LSL #1// w13 = pref3 +lsl w14, w5, #2// w14 = pref4 +add w15, w5, w5, LSL #2// w15 = pref5 +add w16, w14, w12 // w16 = pref6 + +lsl w5, w1, #1// w5 = d_stride * 2 + +// for (x = 0; x < w; x++) { +// int diff0, diff2; +// int d0, d2; +// int temporal_diff0, temporal_diff2; +// +//
[FFmpeg-devel] [PATCH v2 14/15] tests/checkasm: Add test for vf_bwdif filter_line3
Signed-off-by: John Cox --- tests/checkasm/vf_bwdif.c | 81 +++ 1 file changed, 81 insertions(+) diff --git a/tests/checkasm/vf_bwdif.c b/tests/checkasm/vf_bwdif.c index 5fdba09fdc..3399cacdf7 100644 --- a/tests/checkasm/vf_bwdif.c +++ b/tests/checkasm/vf_bwdif.c @@ -28,6 +28,10 @@ for (size_t i = 0; i < count; i++) \ buf0[i] = buf1[i] = rnd() & mask +#define randomize_overflow_check(buf0, buf1, mask, count) \ +for (size_t i = 0; i < count; i++) \ +buf0[i] = buf1[i] = (rnd() & 1) != 0 ? mask : 0; + #define BODY(type, depth) \ do { \ type prev0[9*WIDTH], prev1[9*WIDTH]; \ @@ -83,6 +87,83 @@ void checkasm_check_vf_bwdif(void) report("bwdif10"); } +if (!ctx_8.filter_line3) +ctx_8.filter_line3 = ff_bwdif_filter_line3_c; + +{ +LOCAL_ALIGNED_16(uint8_t, prev0, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, prev1, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, next0, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, next1, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, cur0, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, cur1, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, dst0, [WIDTH*3]); +LOCAL_ALIGNED_16(uint8_t, dst1, [WIDTH*3]); +const int stride = WIDTH; +const int mask = (1<<8)-1; +int parity; + +for (parity = 0; parity != 2; ++parity) { +if (check_func(ctx_8.filter_line3, "bwdif8.line3.rnd.p%d", parity)) { + +declare_func(void, void * dst1, int d_stride, + const void * prev1, const void * cur1, const void * next1, int prefs, + int w, int parity, int clip_max); + +randomize_buffers(prev0, prev1, mask, 11*WIDTH); +randomize_buffers(next0, next1, mask, 11*WIDTH); +randomize_buffers( cur0, cur1, mask, 11*WIDTH); + +call_ref(dst0, stride, + prev0 + stride * 4, cur0 + stride * 4, next0 + stride * 4, stride, + WIDTH, parity, mask); +call_new(dst1, stride, + prev1 + stride * 4, cur1 + stride * 4, next1 + stride * 4, stride, + WIDTH, parity, mask); + +if (memcmp(dst0, dst1, WIDTH*3) +|| memcmp(prev0, prev1, WIDTH*11) +|| memcmp(next0, next1, WIDTH*11) +|| memcmp( cur0, cur1, WIDTH*11)) +fail(); + +bench_new(dst1, stride, + prev1 + stride * 4, cur1 + stride * 4, next1 + stride * 4, stride, + WIDTH, parity, mask); +} +} + +// Use just 0s and ~0s to try to provoke bad cropping or overflow +// Parity makes no difference to this test so just test 0 +if (check_func(ctx_8.filter_line3, "bwdif8.line3.overflow")) { + +declare_func(void, void * dst1, int d_stride, + const void * prev1, const void * cur1, const void * next1, int prefs, + int w, int parity, int clip_max); + +randomize_overflow_check(prev0, prev1, mask, 11*WIDTH); +randomize_overflow_check(next0, next1, mask, 11*WIDTH); +randomize_overflow_check( cur0, cur1, mask, 11*WIDTH); + +call_ref(dst0, stride, + prev0 + stride * 4, cur0 + stride * 4, next0 + stride * 4, stride, + WIDTH, 0, mask); +call_new(dst1, stride, + prev1 + stride * 4, cur1 + stride * 4, next1 + stride * 4, stride, + WIDTH, 0, mask); + +if (memcmp(dst0, dst1, WIDTH*3) +|| memcmp(prev0, prev1, WIDTH*11) +|| memcmp(next0, next1, WIDTH*11) +|| memcmp( cur0, cur1, WIDTH*11)) +fail(); + +// No point to benching +} + +report("bwdif8.line3"); +} + { LOCAL_ALIGNED_16(uint8_t, prev0, [11*WIDTH]); LOCAL_ALIGNED_16(uint8_t, prev1, [11*WIDTH]); -- 2.39.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH v2 15/15] avfilter/vf_bwdif: Block filter slices into a multiple of 4 lines
Round job start lines down to a multiple of 4. This means that if filter_line3 exists then filter_line will not sometimes be called once at the end of a slice depending on thread count. The final slice may do up to 3 extra lines but filter_edge is faster than filter_line so it is unlikely to create any noticable thread load variation. Signed-off-by: John Cox --- libavfilter/vf_bwdif.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c index 52bc676cf8..6701208efe 100644 --- a/libavfilter/vf_bwdif.c +++ b/libavfilter/vf_bwdif.c @@ -237,6 +237,13 @@ static void filter_edge_16bit(void *dst1, void *prev1, void *cur1, void *next1, FILTER2() } +// Round job start line down to multiple of 4 so that if filter_line3 exists +// and the frame is a multiple of 4 high then filter_line will never be called +static inline int job_start(const int jobnr, const int nb_jobs, const int h) +{ +return jobnr >= nb_jobs ? h : ((h * jobnr) / nb_jobs) & ~3; +} + static int filter_slice(AVFilterContext *ctx, void *arg, int jobnr, int nb_jobs) { BWDIFContext *s = ctx->priv; @@ -246,8 +253,8 @@ static int filter_slice(AVFilterContext *ctx, void *arg, int jobnr, int nb_jobs) int clip_max = (1 << (yadif->csp->comp[td->plane].depth)) - 1; int df = (yadif->csp->comp[td->plane].depth + 7) / 8; int refs = linesize / df; -int slice_start = (td->h * jobnr ) / nb_jobs; -int slice_end = (td->h * (jobnr+1)) / nb_jobs; +int slice_start = job_start(jobnr, nb_jobs, td->h); +int slice_end = job_start(jobnr + 1, nb_jobs, td->h); int y; for (y = slice_start; y < slice_end; y++) { @@ -310,7 +317,7 @@ static void filter(AVFilterContext *ctx, AVFrame *dstpic, td.plane = i; ff_filter_execute(ctx, filter_slice, &td, NULL, - FFMIN(h, ff_filter_get_nb_threads(ctx))); + FFMIN((h+3)/4, ff_filter_get_nb_threads(ctx))); } if (yadif->current_field == YADIF_FIELD_END) { yadif->current_field = YADIF_FIELD_NORMAL; -- 2.39.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
Re: [FFmpeg-devel] [PATCH v2 12/15] avfilter/vf_bwdif: Add a filter_line3 method for optimisation
On Mon, 3 Jul 2023 00:12:46 +0300 (EEST), you wrote: >On Sun, 2 Jul 2023, Thomas Mundt wrote: > >> Am So., 2. Juli 2023 um 14:34 Uhr schrieb John Cox : >> Add an optional filter_line3 to the available optimisations. >> >> filter_line3 is equivalent to filter_line, memcpy, filter_line >> >> filter_line shares quite a number of loads and some calculations >> in >> common with its next iteration and testing shows that using >> aarch64 >> neon filter_line3s performance is 30% better than two >> filter_lines >> and a memcpy. >> >> Signed-off-by: John Cox >> --- >> libavfilter/bwdif.h | 7 +++ >> libavfilter/vf_bwdif.c | 31 +++ >> 2 files changed, 38 insertions(+) >> >> diff --git a/libavfilter/bwdif.h b/libavfilter/bwdif.h >> index cce99953f3..496cec72ef 100644 >> --- a/libavfilter/bwdif.h >> +++ b/libavfilter/bwdif.h >> @@ -35,6 +35,9 @@ typedef struct BWDIFContext { >> void (*filter_edge)(void *dst, void *prev, void *cur, void >> *next, >> int w, int prefs, int mrefs, int >> prefs2, int mrefs2, >> int parity, int clip_max, int spat); >> + void (*filter_line3)(void *dst, int dstride, >> + const void *prev, const void *cur, >> const void *next, int prefs, >> + int w, int parity, int clip_max); >> } BWDIFContext; >> >> void ff_bwdif_init_filter_line(BWDIFContext *bwdif, int >> bit_depth); >> @@ -53,4 +56,8 @@ void ff_bwdif_filter_line_c(void *dst1, void >> *prev1, void *cur1, void *next1, >> int prefs3, int mrefs3, int prefs4, >> int mrefs4, >> int parity, int clip_max); >> >> +void ff_bwdif_filter_line3_c(void * dst1, int d_stride, >> + const void * prev1, const void * >> cur1, const void * next1, int s_stride, >> + int w, int parity, int clip_max); >> + >> #endif /* AVFILTER_BWDIF_H */ >> diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c >> index 26349da1fd..52bc676cf8 100644 >> --- a/libavfilter/vf_bwdif.c >> +++ b/libavfilter/vf_bwdif.c >> @@ -150,6 +150,31 @@ void ff_bwdif_filter_line_c(void *dst1, >> void *prev1, void *cur1, void *next1, >> FILTER2() >> } >> >> +#define NEXT_LINE()\ >> + dst += d_stride; \ >> + prev += prefs; \ >> + cur += prefs; \ >> + next += prefs; >> + >> +void ff_bwdif_filter_line3_c(void * dst1, int d_stride, >> + const void * prev1, const void * >> cur1, const void * next1, int s_stride, >> + int w, int parity, int clip_max) >> +{ >> + const int prefs = s_stride; >> + uint8_t * dst = dst1; >> + const uint8_t * prev = prev1; >> + const uint8_t * cur = cur1; >> + const uint8_t * next = next1; >> + >> + ff_bwdif_filter_line_c(dst, (void*)prev, (void*)cur, >> (void*)next, w, >> + prefs, -prefs, prefs * 2, - prefs * >> 2, prefs * 3, -prefs * 3, prefs * 4, -prefs * 4, parity, >> clip_max); >> + NEXT_LINE(); >> + memcpy(dst, cur, w); >> + NEXT_LINE(); >> + ff_bwdif_filter_line_c(dst, (void*)prev, (void*)cur, >> (void*)next, w, >> + prefs, -prefs, prefs * 2, - prefs * >> 2, prefs * 3, -prefs * 3, prefs * 4, -prefs * 4, parity, >> clip_max); >> +} >> + >> void ff_bwdif_filter_edge_c(void *dst1, void *prev1, void >> *cur1, void *next1, >> int w, int prefs, int mrefs, int >> prefs2, int mrefs2, >> int parity, int clip_max, int spat) >> @@ -244,6 +269,11 @@ static int filter_slice(AVFilterContext >> *ctx, void *arg, int jobnr, int nb_jobs) >> refs << 1, -(refs << 1), >>
Re: [FFmpeg-devel] [PATCH 02/15] avfilter/vf_bwdif: Add common macros and consts for aarch64 neon
On Mon, 3 Jul 2023 00:02:27 +0300 (EEST), you wrote: >On Sun, 2 Jul 2023, Martin Storsjö wrote: > >> On Sun, 2 Jul 2023, John Cox wrote: >> >>> On Sun, 2 Jul 2023 00:35:14 +0300 (EEST), you wrote: >>> >>>> On Thu, 29 Jun 2023, John Cox wrote: >>>> >>>>> Add macros for dual scalar half->single multiply and accumulate >>>>> Add macro for shift, saturate and shorten single to byte >>>>> Add filter constants >>>>> >>>>> Signed-off-by: John Cox >>>>> --- >>>>> libavfilter/aarch64/vf_bwdif_neon.S | 46 + >>>>> 1 file changed, 46 insertions(+) >>>>> >>>>> + >>>>> +.align 16 >>>> >>>> Note that .align for arm is power of two; this triggers a 2^16 byte >>>> alignment here, which certainly isn't intended. >>> >>> Yikes! I'll swap that for a .balign now I've looked that up >>> >>>> But in general, the usual way of defining constants is with a >>>> const/endconst block, which places them in the right rdata section instead >>>> of in the text section. But that then requires you to use a movrel macro >>>> for accessing the data, instead of a plain ldr instruction. >>> >>> Yeah - arm has a history of mixing text & const - I went with the >>> simpler code. Is this a deal breaker or can I leave it as is? >> >> I wouldn't treat it as a deal breaker as long as it works across all >> platforms - even if consistency with the code style elsewhere is preferred, >> but IIRC there may be issues with MS armasm (after passed through >> gas-preprocessor). IIRC there might be issues with starting out with >> straight >> up content without the full setup made by the function/const macros. OTOH I >> might have fixed that in gas-preprocessor too... >> >> Last time around, the patchset failed building in that configuration due ot >> the out of range alignment, I'll see how it fares now. > >I'm sorry, but I'd just recommend you to go with the const macros. > >Your current patch fails because gas-preprocessor, >https://github.com/ffmpeg/gas-preprocessor, doesn't support the .balign >directive, it only recognizes .align and .p2align. (Extending it to >support it would be trivial though.) > >If I change your code to ".align 4", I get the following warning: > >\home\martin\code\ffmpeg-msvc-arm64\libavfilter\aarch64\vf_bwdif_neon.o.asm(1011) > >: warning A4228: Alignment value exceeds AREA alignment; alignment not >guaranteed > ALIGN 16 > >Since we haven't started any section, apparently armasm defaults to a >section with 4 byte alignment. > >But anyway, regardless of the alignment, it later fails with this error: > >\home\martin\code\ffmpeg-msvc-arm64\libavfilter\aarch64\vf_bwdif_neon.o.asm(1051) > >: error A2504: operand 2: Expected address > ldr q0, coeffs > > >So I would request you to just go with the macros we use elsewhere. The >gas-preprocessor/armasm setup doesn't support/expect any random assembly, >but the disciplined subset we normally write. In most cases, we >essentially never write bare directives in the code, but only use the >macros from asm.S, which are set up to handle portability across all >supported platforms and their toolchains. OK - will do. JC >// Martin ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
Re: [FFmpeg-devel] [PATCH v2 00/15] avfilter/vf_bwdif: Add aarch64 neon functions
On Mon, 3 Jul 2023 00:09:52 +0300 (EEST), you wrote: >On Sun, 2 Jul 2023, John Cox wrote: > >> Also adds a filter_line3 method which on aarch64 neon yields approx 30% >> speedup over 2xfilter_line and a memcpy >> >> Differences from v1: >> .align 16 corrected to .balign 16 >> SXTW tolower >> Mac ABI (hopefully) fixed >> V register pop/push macroed & prettified >> >> John Cox (15): >> avfilter/vf_bwdif: Add outline for aarch neon functions >> avfilter/vf_bwdif: Add common macros and consts for aarch64 neon >> avfilter/vf_bwdif: Export C filter_intra >> avfilter/vf_bwdif: Add neon for filter_intra >> tests/checkasm: Add test for vf_bwdif filter_intra >> avfilter/vf_bwdif: Add clip and spatial macros for aarch64 neon >> avfilter/vf_bwdif: Export C filter_edge >> avfilter/vf_bwdif: Add neon for filter_edge >> tests/checkasm: Add test for vf_bwdif filter_edge >> avfilter/vf_bwdif: Export C filter_line >> avfilter/vf_bwdif: Add neon for filter_line >> avfilter/vf_bwdif: Add a filter_line3 method for optimisation >> avfilter/vf_bwdif: Add neon for filter_line3 >> tests/checkasm: Add test for vf_bwdif filter_line3 >> avfilter/vf_bwdif: Block filter slices into a multiple of 4 lines > >Overall, I'd suggest squashing/reordering the patches like this: > >- tests/checkasm: Add test for vf_bwdif filter_intra >- avfilter/vf_bwdif: Add neon for filter_intra > (With the preceding patches squashed. For extra common macros, only add > the ones you use in this patch here.) >- tests/checkasm: Add test for vf_bwdif filter_edge >- avfilter/vf_bwdif: Add neon for filter_edge (with other dependencies > squashed) >- avfilter/vf_bwdif: Add neon for filter_line >- avfilter/vf_bwdif: Add a filter_line3 method for optimisation > + checkasm test squashed >- avfilter/vf_bwdif: Add neon for filter_line3 I'm happy with that if everyone else is - it is easy to merge patches - harder to take them apart. JC >// Martin ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH v3 0/7] avfilter/vf_bwdif: Add aarch64 neon functions
Also adds a filter_line3 method which on aarch64 neon yields approx 30% speedup over 2xfilter_line and a memcpy Differences from v2: coeffs moved into const segment number of patches reduced John Cox (7): tests/checkasm: Add test for vf_bwdif filter_intra avfilter/vf_bwdif: Add neon for filter_intra tests/checkasm: Add test for vf_bwdif filter_edge avfilter/vf_bwdif: Add neon for filter_edge avfilter/vf_bwdif: Add neon for filter_line Exports C filter_line needed for tail fixup of neon code avfilter/vf_bwdif: Add a filter_line3 method for optimisation avfilter/vf_bwdif: Add neon for filter_line3 libavfilter/aarch64/Makefile| 2 + libavfilter/aarch64/vf_bwdif_init_aarch64.c | 125 +++ libavfilter/aarch64/vf_bwdif_neon.S | 793 libavfilter/bwdif.h | 20 + libavfilter/vf_bwdif.c | 70 +- tests/checkasm/vf_bwdif.c | 172 + 6 files changed, 1167 insertions(+), 15 deletions(-) create mode 100644 libavfilter/aarch64/vf_bwdif_init_aarch64.c create mode 100644 libavfilter/aarch64/vf_bwdif_neon.S -- 2.39.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH v3 1/7] tests/checkasm: Add test for vf_bwdif filter_intra
Signed-off-by: John Cox --- tests/checkasm/vf_bwdif.c | 37 + 1 file changed, 37 insertions(+) diff --git a/tests/checkasm/vf_bwdif.c b/tests/checkasm/vf_bwdif.c index 46224bb575..034bbabb4c 100644 --- a/tests/checkasm/vf_bwdif.c +++ b/tests/checkasm/vf_bwdif.c @@ -20,6 +20,7 @@ #include "checkasm.h" #include "libavcodec/internal.h" #include "libavfilter/bwdif.h" +#include "libavutil/mem_internal.h" #define WIDTH 256 @@ -81,4 +82,40 @@ void checkasm_check_vf_bwdif(void) BODY(uint16_t, 10); report("bwdif10"); } + +if (check_func(ctx_8.filter_intra, "bwdif8.intra")) { +LOCAL_ALIGNED_16(uint8_t, cur0, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, cur1, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, dst0, [WIDTH*3]); +LOCAL_ALIGNED_16(uint8_t, dst1, [WIDTH*3]); +const int stride = WIDTH; +const int mask = (1<<8)-1; + +declare_func(void, void *dst1, void *cur1, int w, int prefs, int mrefs, + int prefs3, int mrefs3, int parity, int clip_max); + +randomize_buffers( cur0, cur1, mask, 11*WIDTH); +memset(dst0, 0xba, WIDTH * 3); +memset(dst1, 0xba, WIDTH * 3); + +call_ref(dst0 + stride, + cur0 + stride * 4, WIDTH, + stride, -stride, stride * 3, -stride * 3, + 0, mask); +call_new(dst1 + stride, + cur0 + stride * 4, WIDTH, + stride, -stride, stride * 3, -stride * 3, + 0, mask); + +if (memcmp(dst0, dst1, WIDTH*3) +|| memcmp( cur0, cur1, WIDTH*11)) +fail(); + +bench_new(dst1 + stride, + cur0 + stride * 4, WIDTH, + stride, -stride, stride * 3, -stride * 3, + 0, mask); + +report("bwdif8.intra"); +} } -- 2.39.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH v3 2/7] avfilter/vf_bwdif: Add neon for filter_intra
Adds an outline for aarch neon functions Adds common macros and consts for aarch64 neon Exports C filter_intra needed for tail fixup of neon code Adds neon for filter_intra Signed-off-by: John Cox --- libavfilter/aarch64/Makefile| 2 + libavfilter/aarch64/vf_bwdif_init_aarch64.c | 56 libavfilter/aarch64/vf_bwdif_neon.S | 136 libavfilter/bwdif.h | 4 + libavfilter/vf_bwdif.c | 8 +- 5 files changed, 203 insertions(+), 3 deletions(-) create mode 100644 libavfilter/aarch64/vf_bwdif_init_aarch64.c create mode 100644 libavfilter/aarch64/vf_bwdif_neon.S diff --git a/libavfilter/aarch64/Makefile b/libavfilter/aarch64/Makefile index b58daa3a3f..b68209bc94 100644 --- a/libavfilter/aarch64/Makefile +++ b/libavfilter/aarch64/Makefile @@ -1,3 +1,5 @@ +OBJS-$(CONFIG_BWDIF_FILTER) += aarch64/vf_bwdif_init_aarch64.o OBJS-$(CONFIG_NLMEANS_FILTER)+= aarch64/vf_nlmeans_init.o +NEON-OBJS-$(CONFIG_BWDIF_FILTER) += aarch64/vf_bwdif_neon.o NEON-OBJS-$(CONFIG_NLMEANS_FILTER) += aarch64/vf_nlmeans_neon.o diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c b/libavfilter/aarch64/vf_bwdif_init_aarch64.c new file mode 100644 index 00..3ffaa07ab3 --- /dev/null +++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c @@ -0,0 +1,56 @@ +/* + * bwdif aarch64 NEON optimisations + * + * Copyright (c) 2023 John Cox + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libavutil/common.h" +#include "libavfilter/bwdif.h" +#include "libavutil/aarch64/cpu.h" + +void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, int mrefs, +int prefs3, int mrefs3, int parity, int clip_max); + + +static void filter_intra_helper(void *dst1, void *cur1, int w, int prefs, int mrefs, +int prefs3, int mrefs3, int parity, int clip_max) +{ +const int w0 = clip_max != 255 ? 0 : w & ~15; + +ff_bwdif_filter_intra_neon(dst1, cur1, w0, prefs, mrefs, prefs3, mrefs3, parity, clip_max); + +if (w0 < w) +ff_bwdif_filter_intra_c((char *)dst1 + w0, (char *)cur1 + w0, +w - w0, prefs, mrefs, prefs3, mrefs3, parity, clip_max); +} + +void +ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth) +{ +const int cpu_flags = av_get_cpu_flags(); + +if (bit_depth != 8) +return; + +if (!have_neon(cpu_flags)) +return; + +s->filter_intra = filter_intra_helper; +} + diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S new file mode 100644 index 00..e288efbe6c --- /dev/null +++ b/libavfilter/aarch64/vf_bwdif_neon.S @@ -0,0 +1,136 @@ +/* + * bwdif aarch64 NEON optimisations + * + * Copyright (c) 2023 John Cox + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + + +#include "libavutil/aarch64/asm.S" + +// Space taken on the stack by an int (32-bit) +#ifdef __APPLE__ +.setSP_INT, 4 +#else +.setSP_INT, 8 +#endif + +.macro SQSHRUNN b, s0, s1, s2, s3, n +sqshrun \s0\().4h, \s0\().4s, #\n - 8 +sqshrun2\s0\().8h, \s1\().4s, #\n - 8 +sqshrun \s1\().4h, \s2\().4s, #\n - 8 +sqshrun2\s1\().8h, \s3\().4s, #\n - 8 +uzp2\b\().16b, \s0\().16b, \s1\().16b +.endm + +.macro SMULL4K a0, a1, a2, a3, s0, s1, k +smull \a0\().4s
[FFmpeg-devel] [PATCH v3 3/7] tests/checkasm: Add test for vf_bwdif filter_edge
Signed-off-by: John Cox --- tests/checkasm/vf_bwdif.c | 54 +++ 1 file changed, 54 insertions(+) diff --git a/tests/checkasm/vf_bwdif.c b/tests/checkasm/vf_bwdif.c index 034bbabb4c..5fdba09fdc 100644 --- a/tests/checkasm/vf_bwdif.c +++ b/tests/checkasm/vf_bwdif.c @@ -83,6 +83,60 @@ void checkasm_check_vf_bwdif(void) report("bwdif10"); } +{ +LOCAL_ALIGNED_16(uint8_t, prev0, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, prev1, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, next0, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, next1, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, cur0, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, cur1, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, dst0, [WIDTH*3]); +LOCAL_ALIGNED_16(uint8_t, dst1, [WIDTH*3]); +const int stride = WIDTH; +const int mask = (1<<8)-1; +int spat; +int parity; + +for (spat = 0; spat != 2; ++spat) { +for (parity = 0; parity != 2; ++parity) { +if (check_func(ctx_8.filter_edge, "bwdif8.edge.s%d.p%d", spat, parity)) { + +declare_func(void, void *dst1, void *prev1, void *cur1, void *next1, +int w, int prefs, int mrefs, int prefs2, int mrefs2, +int parity, int clip_max, int spat); + +randomize_buffers(prev0, prev1, mask, 11*WIDTH); +randomize_buffers(next0, next1, mask, 11*WIDTH); +randomize_buffers( cur0, cur1, mask, 11*WIDTH); +memset(dst0, 0xba, WIDTH * 3); +memset(dst1, 0xba, WIDTH * 3); + +call_ref(dst0 + stride, + prev0 + stride * 4, cur0 + stride * 4, next0 + stride * 4, WIDTH, + stride, -stride, stride * 2, -stride * 2, + parity, mask, spat); +call_new(dst1 + stride, + prev1 + stride * 4, cur1 + stride * 4, next1 + stride * 4, WIDTH, + stride, -stride, stride * 2, -stride * 2, + parity, mask, spat); + +if (memcmp(dst0, dst1, WIDTH*3) +|| memcmp(prev0, prev1, WIDTH*11) +|| memcmp(next0, next1, WIDTH*11) +|| memcmp( cur0, cur1, WIDTH*11)) +fail(); + +bench_new(dst1 + stride, + prev1 + stride * 4, cur1 + stride * 4, next1 + stride * 4, WIDTH, + stride, -stride, stride * 2, -stride * 2, + parity, mask, spat); +} +} +} + +report("bwdif8.edge"); +} + if (check_func(ctx_8.filter_intra, "bwdif8.intra")) { LOCAL_ALIGNED_16(uint8_t, cur0, [11*WIDTH]); LOCAL_ALIGNED_16(uint8_t, cur1, [11*WIDTH]); -- 2.39.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH v3 4/7] avfilter/vf_bwdif: Add neon for filter_edge
Adds clip and spatial macros for aarch64 neon Exports C filter_edge needed for tail fixup of neon code Adds neon for filter_edge Signed-off-by: John Cox --- libavfilter/aarch64/vf_bwdif_init_aarch64.c | 20 +++ libavfilter/aarch64/vf_bwdif_neon.S | 177 libavfilter/bwdif.h | 4 + libavfilter/vf_bwdif.c | 8 +- 4 files changed, 205 insertions(+), 4 deletions(-) diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c b/libavfilter/aarch64/vf_bwdif_init_aarch64.c index 3ffaa07ab3..e75cf2f204 100644 --- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c +++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c @@ -24,10 +24,29 @@ #include "libavfilter/bwdif.h" #include "libavutil/aarch64/cpu.h" +void ff_bwdif_filter_edge_neon(void *dst1, void *prev1, void *cur1, void *next1, + int w, int prefs, int mrefs, int prefs2, int mrefs2, + int parity, int clip_max, int spat); + void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, int mrefs, int prefs3, int mrefs3, int parity, int clip_max); +static void filter_edge_helper(void *dst1, void *prev1, void *cur1, void *next1, + int w, int prefs, int mrefs, int prefs2, int mrefs2, + int parity, int clip_max, int spat) +{ +const int w0 = clip_max != 255 ? 0 : w & ~15; + +ff_bwdif_filter_edge_neon(dst1, prev1, cur1, next1, w0, prefs, mrefs, prefs2, mrefs2, + parity, clip_max, spat); + +if (w0 < w) +ff_bwdif_filter_edge_c((char *)dst1 + w0, (char *)prev1 + w0, (char *)cur1 + w0, (char *)next1 + w0, + w - w0, prefs, mrefs, prefs2, mrefs2, + parity, clip_max, spat); +} + static void filter_intra_helper(void *dst1, void *cur1, int w, int prefs, int mrefs, int prefs3, int mrefs3, int parity, int clip_max) { @@ -52,5 +71,6 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth) return; s->filter_intra = filter_intra_helper; +s->filter_edge = filter_edge_helper; } diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S index e288efbe6c..389302b813 100644 --- a/libavfilter/aarch64/vf_bwdif_neon.S +++ b/libavfilter/aarch64/vf_bwdif_neon.S @@ -66,6 +66,79 @@ umlsl2 \a3\().4s, \s1\().8h, \k .endm +// int b = m2s1 - m1; +// int f = p2s1 - p1; +// int dc = c0s1 - m1; +// int de = c0s1 - p1; +// int sp_max = FFMIN(p1 - c0s1, m1 - c0s1); +// sp_max = FFMIN(sp_max, FFMAX(-b,-f)); +// int sp_min = FFMIN(c0s1 - p1, c0s1 - m1); +// sp_min = FFMIN(sp_min, FFMAX(b,f)); +// diff = diff == 0 ? 0 : FFMAX3(diff, sp_min, sp_max); +.macro SPAT_CHECK diff, m2s1, m1, c0s1, p1, p2s1, t0, t1, t2, t3 +uqsub \t0\().16b, \p1\().16b, \c0s1\().16b +uqsub \t2\().16b, \m1\().16b, \c0s1\().16b +umin\t2\().16b, \t0\().16b, \t2\().16b + +uqsub \t1\().16b, \m1\().16b, \m2s1\().16b +uqsub \t3\().16b, \p1\().16b, \p2s1\().16b +umax\t3\().16b, \t3\().16b, \t1\().16b +umin\t3\().16b, \t3\().16b, \t2\().16b + +uqsub \t0\().16b, \c0s1\().16b, \p1\().16b +uqsub \t2\().16b, \c0s1\().16b, \m1\().16b +umin\t2\().16b, \t0\().16b, \t2\().16b + +uqsub \t1\().16b, \m2s1\().16b, \m1\().16b +uqsub \t0\().16b, \p2s1\().16b, \p1\().16b +umax\t0\().16b, \t0\().16b, \t1\().16b +umin\t2\().16b, \t2\().16b, \t0\().16b + +cmeq\t1\().16b, \diff\().16b, #0 +umax\diff\().16b, \diff\().16b, \t3\().16b +umax\diff\().16b, \diff\().16b, \t2\().16b +bic \diff\().16b, \diff\().16b, \t1\().16b +.endm + +// i0 = s0; +// if (i0 > d0 + diff0) +// i0 = d0 + diff0; +// else if (i0 < d0 - diff0) +// i0 = d0 - diff0; +// +// i0 = s0 is safe +.macro DIFF_CLIP i0, s0, d0, diff, t0, t1 +uqadd \t0\().16b, \d0\().16b, \diff\().16b +uqsub \t1\().16b, \d0\().16b, \diff\().16b +umin\i0\().16b, \s0\().16b, \t0\().16b +umax\i0\().16b, \i0\().16b, \t1\().16b +.endm + +// i0 = FFABS(m1 - p1) > td0 ? i1 : i2; +// DIFF_CLIP +// +// i0 = i1 is safe +.macro INTERPOL i0, i1, i2, m1, d0, p1, td0, diff, t0, t1, t2 +uabd\t0\().16b, \m1\().16b, \p1\().16b +cmhi\t0\().16b, \t0\().16b, \td0\().16b +bsl \t0\().16b, \i1\().16b, \i2\().16b +DIFF_CLIP \i0, \t0, \d0, \diff, \t1, \t
[FFmpeg-devel] [PATCH v3 6/7] avfilter/vf_bwdif: Add a filter_line3 method for optimisation
Add an optional filter_line3 to the available optimisations. filter_line3 is equivalent to filter_line, memcpy, filter_line filter_line shares quite a number of loads and some calculations in common with its next iteration and testing shows that using aarch64 neon filter_line3s performance is 30% better than two filter_lines and a memcpy. Adds a test for vf_bwdif filter_line3 to checkasm Rounds job start lines down to a multiple of 4. This means that if filter_line3 exists then filter_line will not sometimes be called once at the end of a slice depending on thread count. The final slice may do up to 3 extra lines but filter_edge is faster than filter_line so it is unlikely to create any noticable thread load variation. Signed-off-by: John Cox --- libavfilter/bwdif.h | 7 libavfilter/vf_bwdif.c| 44 +++-- tests/checkasm/vf_bwdif.c | 81 +++ 3 files changed, 129 insertions(+), 3 deletions(-) diff --git a/libavfilter/bwdif.h b/libavfilter/bwdif.h index cce99953f3..496cec72ef 100644 --- a/libavfilter/bwdif.h +++ b/libavfilter/bwdif.h @@ -35,6 +35,9 @@ typedef struct BWDIFContext { void (*filter_edge)(void *dst, void *prev, void *cur, void *next, int w, int prefs, int mrefs, int prefs2, int mrefs2, int parity, int clip_max, int spat); +void (*filter_line3)(void *dst, int dstride, + const void *prev, const void *cur, const void *next, int prefs, + int w, int parity, int clip_max); } BWDIFContext; void ff_bwdif_init_filter_line(BWDIFContext *bwdif, int bit_depth); @@ -53,4 +56,8 @@ void ff_bwdif_filter_line_c(void *dst1, void *prev1, void *cur1, void *next1, int prefs3, int mrefs3, int prefs4, int mrefs4, int parity, int clip_max); +void ff_bwdif_filter_line3_c(void * dst1, int d_stride, + const void * prev1, const void * cur1, const void * next1, int s_stride, + int w, int parity, int clip_max); + #endif /* AVFILTER_BWDIF_H */ diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c index 26349da1fd..6701208efe 100644 --- a/libavfilter/vf_bwdif.c +++ b/libavfilter/vf_bwdif.c @@ -150,6 +150,31 @@ void ff_bwdif_filter_line_c(void *dst1, void *prev1, void *cur1, void *next1, FILTER2() } +#define NEXT_LINE()\ +dst += d_stride; \ +prev += prefs; \ +cur += prefs; \ +next += prefs; + +void ff_bwdif_filter_line3_c(void * dst1, int d_stride, + const void * prev1, const void * cur1, const void * next1, int s_stride, + int w, int parity, int clip_max) +{ +const int prefs = s_stride; +uint8_t * dst = dst1; +const uint8_t * prev = prev1; +const uint8_t * cur = cur1; +const uint8_t * next = next1; + +ff_bwdif_filter_line_c(dst, (void*)prev, (void*)cur, (void*)next, w, + prefs, -prefs, prefs * 2, - prefs * 2, prefs * 3, -prefs * 3, prefs * 4, -prefs * 4, parity, clip_max); +NEXT_LINE(); +memcpy(dst, cur, w); +NEXT_LINE(); +ff_bwdif_filter_line_c(dst, (void*)prev, (void*)cur, (void*)next, w, + prefs, -prefs, prefs * 2, - prefs * 2, prefs * 3, -prefs * 3, prefs * 4, -prefs * 4, parity, clip_max); +} + void ff_bwdif_filter_edge_c(void *dst1, void *prev1, void *cur1, void *next1, int w, int prefs, int mrefs, int prefs2, int mrefs2, int parity, int clip_max, int spat) @@ -212,6 +237,13 @@ static void filter_edge_16bit(void *dst1, void *prev1, void *cur1, void *next1, FILTER2() } +// Round job start line down to multiple of 4 so that if filter_line3 exists +// and the frame is a multiple of 4 high then filter_line will never be called +static inline int job_start(const int jobnr, const int nb_jobs, const int h) +{ +return jobnr >= nb_jobs ? h : ((h * jobnr) / nb_jobs) & ~3; +} + static int filter_slice(AVFilterContext *ctx, void *arg, int jobnr, int nb_jobs) { BWDIFContext *s = ctx->priv; @@ -221,8 +253,8 @@ static int filter_slice(AVFilterContext *ctx, void *arg, int jobnr, int nb_jobs) int clip_max = (1 << (yadif->csp->comp[td->plane].depth)) - 1; int df = (yadif->csp->comp[td->plane].depth + 7) / 8; int refs = linesize / df; -int slice_start = (td->h * jobnr ) / nb_jobs; -int slice_end = (td->h * (jobnr+1)) / nb_jobs; +int slice_start = job_start(jobnr, nb_jobs, td->h); +int slice_end = job_start(jobnr + 1, nb_jobs, td->h); int y; for (y = slice_start; y < slice_end; y++) { @@ -244,6 +276,11 @@ static int filter_slice(AVFilterContext *ctx, void *arg, int jobnr, int nb_jobs) refs << 1, -(refs << 1),
[FFmpeg-devel] [PATCH v3 5/7] avfilter/vf_bwdif: Add neon for filter_line Exports C filter_line needed for tail fixup of neon code
Signed-off-by: John Cox --- libavfilter/aarch64/vf_bwdif_init_aarch64.c | 21 ++ libavfilter/aarch64/vf_bwdif_neon.S | 208 libavfilter/bwdif.h | 5 + libavfilter/vf_bwdif.c | 10 +- 4 files changed, 239 insertions(+), 5 deletions(-) diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c b/libavfilter/aarch64/vf_bwdif_init_aarch64.c index e75cf2f204..21e67884ab 100644 --- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c +++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c @@ -31,6 +31,26 @@ void ff_bwdif_filter_edge_neon(void *dst1, void *prev1, void *cur1, void *next1, void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, int mrefs, int prefs3, int mrefs3, int parity, int clip_max); +void ff_bwdif_filter_line_neon(void *dst1, void *prev1, void *cur1, void *next1, + int w, int prefs, int mrefs, int prefs2, int mrefs2, + int prefs3, int mrefs3, int prefs4, int mrefs4, + int parity, int clip_max); + + +static void filter_line_helper(void *dst1, void *prev1, void *cur1, void *next1, + int w, int prefs, int mrefs, int prefs2, int mrefs2, + int prefs3, int mrefs3, int prefs4, int mrefs4, + int parity, int clip_max) +{ +const int w0 = clip_max != 255 ? 0 : w & ~15; + +ff_bwdif_filter_line_neon(dst1, prev1, cur1, next1, + w0, prefs, mrefs, prefs2, mrefs2, prefs3, mrefs3, prefs4, mrefs4, parity, clip_max); + +if (w0 < w) +ff_bwdif_filter_line_c((char *)dst1 + w0, (char *)prev1 + w0, (char *)cur1 + w0, (char *)next1 + w0, + w - w0, prefs, mrefs, prefs2, mrefs2, prefs3, mrefs3, prefs4, mrefs4, parity, clip_max); +} static void filter_edge_helper(void *dst1, void *prev1, void *cur1, void *next1, int w, int prefs, int mrefs, int prefs2, int mrefs2, @@ -71,6 +91,7 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth) return; s->filter_intra = filter_intra_helper; +s->filter_line = filter_line_helper; s->filter_edge = filter_edge_helper; } diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S index 389302b813..ae5f09c511 100644 --- a/libavfilter/aarch64/vf_bwdif_neon.S +++ b/libavfilter/aarch64/vf_bwdif_neon.S @@ -154,6 +154,214 @@ const coeffs, align=4 // align 4 means align on 2^4 boundry .hword 5077, 981 // sp[0] = v0.h[6] endconst +// === +// +// void filter_line( +// void *dst1, // x0 +// void *prev1,// x1 +// void *cur1, // x2 +// void *next1,// x3 +// int w, // w4 +// int prefs, // w5 +// int mrefs, // w6 +// int prefs2, // w7 +// int mrefs2, // [sp, #0] +// int prefs3, // [sp, #SP_INT] +// int mrefs3, // [sp, #SP_INT*2] +// int prefs4, // [sp, #SP_INT*3] +// int mrefs4, // [sp, #SP_INT*4] +// int parity, // [sp, #SP_INT*5] +// int clip_max) // [sp, #SP_INT*6] + +function ff_bwdif_filter_line_neon, export=1 +// Sanity check w +cmp w4, #0 +ble 99f + +// Rearrange regs to be the same as line3 for ease of debug! +mov w10, w4 // w10 = loop count +mov w9, w6 // w9 = mref +mov w12, w7 // w12 = pref2 +mov w11, w5 // w11 = pref +ldr w8, [sp, #0] // w8 = mref2 +ldr w7, [sp, #SP_INT*2]// w7 = mref3 +ldr w6, [sp, #SP_INT*4]// w6 = mref4 +ldr w13, [sp, #SP_INT] // w13 = pref3 +ldr w14, [sp, #SP_INT*3]// w14 = pref4 + +mov x4, x3 +mov x3, x2 +mov x2, x1 + +LDR_COEFFS v0, x17 + +// #define prev2 cur +//const uint8_t * restrict next2 = parity ? prev : next; +ldr w17, [sp, #SP_INT*5]// parity +cmp w17, #0 +cselx17, x2, x4, ne + +PUSH_VREGS + +// for (x = 0; x < w; x++) { +// int diff0, diff2; +// int d0, d2; +// int temporal_diff0, temporal_diff2; +// +// int i1, i2; +// int j1, j2; +// int p6, p5, p4, p3, p2, p1, c0, m1, m2, m3, m4; + +10: +// c0 = prev2[0] + next2[0];// c0 = v20, v21 +//
[FFmpeg-devel] [PATCH v3 7/7] avfilter/vf_bwdif: Add neon for filter_line3
Signed-off-by: John Cox --- libavfilter/aarch64/vf_bwdif_init_aarch64.c | 28 ++ libavfilter/aarch64/vf_bwdif_neon.S | 272 2 files changed, 300 insertions(+) diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c b/libavfilter/aarch64/vf_bwdif_init_aarch64.c index 21e67884ab..f52bc4b9b4 100644 --- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c +++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c @@ -36,6 +36,33 @@ void ff_bwdif_filter_line_neon(void *dst1, void *prev1, void *cur1, void *next1, int prefs3, int mrefs3, int prefs4, int mrefs4, int parity, int clip_max); +void ff_bwdif_filter_line3_neon(void * dst1, int d_stride, +const void * prev1, const void * cur1, const void * next1, int s_stride, +int w, int parity, int clip_max); + + +static void filter_line3_helper(void * dst1, int d_stride, +const void * prev1, const void * cur1, const void * next1, int s_stride, +int w, int parity, int clip_max) +{ +// Asm works on 16 byte chunks +// If w is a multiple of 16 then all is good - if not then if width rounded +// up to nearest 16 will fit in both src & dst strides then allow the asm +// to write over the padding bytes as that is almost certainly faster than +// having to invoke the C version to clean up the tail. +const int w1 = FFALIGN(w, 16); +const int w0 = clip_max != 255 ? 0 : + d_stride <= w1 && s_stride <= w1 ? w : w & ~15; + +ff_bwdif_filter_line3_neon(dst1, d_stride, + prev1, cur1, next1, s_stride, + w0, parity, clip_max); + +if (w0 < w) +ff_bwdif_filter_line3_c((char *)dst1 + w0, d_stride, +(const char *)prev1 + w0, (const char *)cur1 + w0, (const char *)next1 + w0, s_stride, +w - w0, parity, clip_max); +} static void filter_line_helper(void *dst1, void *prev1, void *cur1, void *next1, int w, int prefs, int mrefs, int prefs2, int mrefs2, @@ -93,5 +120,6 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth) s->filter_intra = filter_intra_helper; s->filter_line = filter_line_helper; s->filter_edge = filter_edge_helper; +s->filter_line3 = filter_line3_helper; } diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S index ae5f09c511..bc092477b9 100644 --- a/libavfilter/aarch64/vf_bwdif_neon.S +++ b/libavfilter/aarch64/vf_bwdif_neon.S @@ -154,6 +154,278 @@ const coeffs, align=4 // align 4 means align on 2^4 boundry .hword 5077, 981 // sp[0] = v0.h[6] endconst +// === +// +// void ff_bwdif_filter_line3_neon( +// void * dst1, // x0 +// int d_stride,// w1 +// const void * prev1, // x2 +// const void * cur1, // x3 +// const void * next1, // x4 +// int s_stride,// w5 +// int w, // w6 +// int parity, // w7 +// int clip_max); // [sp, #0] (Ignored) + +function ff_bwdif_filter_line3_neon, export=1 +// Sanity check w +cmp w6, #0 +ble 99f + +LDR_COEFFS v0, x17 + +// #define prev2 cur +//const uint8_t * restrict next2 = parity ? prev : next; +cmp w7, #0 +cselx17, x2, x4, ne + +// We want all the V registers - save all the ones we must +PUSH_VREGS + +// Some rearrangement of initial values for nice layout of refs in regs +mov w10, w6 // w10 = loop count +neg w9, w5 // w9 = mref +lsl w8, w9, #1// w8 = mref2 +add w7, w9, w9, LSL #1// w7 = mref3 +lsl w6, w9, #2// w6 = mref4 +mov w11, w5 // w11 = pref +lsl w12, w5, #1// w12 = pref2 +add w13, w5, w5, LSL #1// w13 = pref3 +lsl w14, w5, #2// w14 = pref4 +add w15, w5, w5, LSL #2// w15 = pref5 +add w16, w14, w12 // w16 = pref6 + +lsl w5, w1, #1// w5 = d_stride * 2 + +// for (x = 0; x < w; x++) { +// int diff0, diff2; +// int d0, d2; +// int temporal_diff0, temporal_diff2; +// +// int i1, i2; +// int j1, j2; +//
Re: [FFmpeg-devel] [PATCH v2 05/15] tests/checkasm: Add test for vf_bwdif filter_intra
On Mon, 3 Jul 2023 00:14:16 +0300 (EEST), you wrote: >[snip] >It's a bit of a shame that this only tests things for 8 bit, not 10, but I >guess that's better than nothing. The way the current code is set up to >template both variants of the tests isn't very neat either... Is there actually >8-bit interlaced content out in the wild? I've never seen a single clip. If so where does it come from? Just curious JC >// Martin ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH v4 0/7] avfilter/vf_bwdif: Add aarch64 neon functions
Also adds a filter_line3 method which on aarch64 neon yields approx 30% speedup over 2xfilter_line and a memcpy Differences from v3: Remove a few lines of neon in filter_line that should have been removed when copying from line3 Sorry about the two patch sets in quick succession, but I think I've applied all the requested changes and I didn't want this mistake in the final patchset. (The mistake was benign - it just wasted a few cycles.) John Cox (7): tests/checkasm: Add test for vf_bwdif filter_intra avfilter/vf_bwdif: Add neon for filter_intra tests/checkasm: Add test for vf_bwdif filter_edge avfilter/vf_bwdif: Add neon for filter_edge avfilter/vf_bwdif: Add neon for filter_line avfilter/vf_bwdif: Add a filter_line3 method for optimisation avfilter/vf_bwdif: Add neon for filter_line3 libavfilter/aarch64/Makefile| 2 + libavfilter/aarch64/vf_bwdif_init_aarch64.c | 125 libavfilter/aarch64/vf_bwdif_neon.S | 788 libavfilter/bwdif.h | 20 + libavfilter/vf_bwdif.c | 70 +- tests/checkasm/vf_bwdif.c | 172 + 6 files changed, 1162 insertions(+), 15 deletions(-) create mode 100644 libavfilter/aarch64/vf_bwdif_init_aarch64.c create mode 100644 libavfilter/aarch64/vf_bwdif_neon.S -- 2.39.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH v4 1/7] tests/checkasm: Add test for vf_bwdif filter_intra
Signed-off-by: John Cox --- tests/checkasm/vf_bwdif.c | 37 + 1 file changed, 37 insertions(+) diff --git a/tests/checkasm/vf_bwdif.c b/tests/checkasm/vf_bwdif.c index 46224bb575..034bbabb4c 100644 --- a/tests/checkasm/vf_bwdif.c +++ b/tests/checkasm/vf_bwdif.c @@ -20,6 +20,7 @@ #include "checkasm.h" #include "libavcodec/internal.h" #include "libavfilter/bwdif.h" +#include "libavutil/mem_internal.h" #define WIDTH 256 @@ -81,4 +82,40 @@ void checkasm_check_vf_bwdif(void) BODY(uint16_t, 10); report("bwdif10"); } + +if (check_func(ctx_8.filter_intra, "bwdif8.intra")) { +LOCAL_ALIGNED_16(uint8_t, cur0, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, cur1, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, dst0, [WIDTH*3]); +LOCAL_ALIGNED_16(uint8_t, dst1, [WIDTH*3]); +const int stride = WIDTH; +const int mask = (1<<8)-1; + +declare_func(void, void *dst1, void *cur1, int w, int prefs, int mrefs, + int prefs3, int mrefs3, int parity, int clip_max); + +randomize_buffers( cur0, cur1, mask, 11*WIDTH); +memset(dst0, 0xba, WIDTH * 3); +memset(dst1, 0xba, WIDTH * 3); + +call_ref(dst0 + stride, + cur0 + stride * 4, WIDTH, + stride, -stride, stride * 3, -stride * 3, + 0, mask); +call_new(dst1 + stride, + cur0 + stride * 4, WIDTH, + stride, -stride, stride * 3, -stride * 3, + 0, mask); + +if (memcmp(dst0, dst1, WIDTH*3) +|| memcmp( cur0, cur1, WIDTH*11)) +fail(); + +bench_new(dst1 + stride, + cur0 + stride * 4, WIDTH, + stride, -stride, stride * 3, -stride * 3, + 0, mask); + +report("bwdif8.intra"); +} } -- 2.39.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH v4 2/7] avfilter/vf_bwdif: Add neon for filter_intra
Adds an outline for aarch neon functions Adds common macros and consts for aarch64 neon Exports C filter_intra needed for tail fixup of neon code Adds neon for filter_intra Signed-off-by: John Cox --- libavfilter/aarch64/Makefile| 2 + libavfilter/aarch64/vf_bwdif_init_aarch64.c | 56 libavfilter/aarch64/vf_bwdif_neon.S | 136 libavfilter/bwdif.h | 4 + libavfilter/vf_bwdif.c | 8 +- 5 files changed, 203 insertions(+), 3 deletions(-) create mode 100644 libavfilter/aarch64/vf_bwdif_init_aarch64.c create mode 100644 libavfilter/aarch64/vf_bwdif_neon.S diff --git a/libavfilter/aarch64/Makefile b/libavfilter/aarch64/Makefile index b58daa3a3f..b68209bc94 100644 --- a/libavfilter/aarch64/Makefile +++ b/libavfilter/aarch64/Makefile @@ -1,3 +1,5 @@ +OBJS-$(CONFIG_BWDIF_FILTER) += aarch64/vf_bwdif_init_aarch64.o OBJS-$(CONFIG_NLMEANS_FILTER)+= aarch64/vf_nlmeans_init.o +NEON-OBJS-$(CONFIG_BWDIF_FILTER) += aarch64/vf_bwdif_neon.o NEON-OBJS-$(CONFIG_NLMEANS_FILTER) += aarch64/vf_nlmeans_neon.o diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c b/libavfilter/aarch64/vf_bwdif_init_aarch64.c new file mode 100644 index 00..3ffaa07ab3 --- /dev/null +++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c @@ -0,0 +1,56 @@ +/* + * bwdif aarch64 NEON optimisations + * + * Copyright (c) 2023 John Cox + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libavutil/common.h" +#include "libavfilter/bwdif.h" +#include "libavutil/aarch64/cpu.h" + +void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, int mrefs, +int prefs3, int mrefs3, int parity, int clip_max); + + +static void filter_intra_helper(void *dst1, void *cur1, int w, int prefs, int mrefs, +int prefs3, int mrefs3, int parity, int clip_max) +{ +const int w0 = clip_max != 255 ? 0 : w & ~15; + +ff_bwdif_filter_intra_neon(dst1, cur1, w0, prefs, mrefs, prefs3, mrefs3, parity, clip_max); + +if (w0 < w) +ff_bwdif_filter_intra_c((char *)dst1 + w0, (char *)cur1 + w0, +w - w0, prefs, mrefs, prefs3, mrefs3, parity, clip_max); +} + +void +ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth) +{ +const int cpu_flags = av_get_cpu_flags(); + +if (bit_depth != 8) +return; + +if (!have_neon(cpu_flags)) +return; + +s->filter_intra = filter_intra_helper; +} + diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S new file mode 100644 index 00..e288efbe6c --- /dev/null +++ b/libavfilter/aarch64/vf_bwdif_neon.S @@ -0,0 +1,136 @@ +/* + * bwdif aarch64 NEON optimisations + * + * Copyright (c) 2023 John Cox + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + + +#include "libavutil/aarch64/asm.S" + +// Space taken on the stack by an int (32-bit) +#ifdef __APPLE__ +.setSP_INT, 4 +#else +.setSP_INT, 8 +#endif + +.macro SQSHRUNN b, s0, s1, s2, s3, n +sqshrun \s0\().4h, \s0\().4s, #\n - 8 +sqshrun2\s0\().8h, \s1\().4s, #\n - 8 +sqshrun \s1\().4h, \s2\().4s, #\n - 8 +sqshrun2\s1\().8h, \s3\().4s, #\n - 8 +uzp2\b\().16b, \s0\().16b, \s1\().16b +.endm + +.macro SMULL4K a0, a1, a2, a3, s0, s1, k +smull \a0\().4s
[FFmpeg-devel] [PATCH v4 3/7] tests/checkasm: Add test for vf_bwdif filter_edge
Signed-off-by: John Cox --- tests/checkasm/vf_bwdif.c | 54 +++ 1 file changed, 54 insertions(+) diff --git a/tests/checkasm/vf_bwdif.c b/tests/checkasm/vf_bwdif.c index 034bbabb4c..5fdba09fdc 100644 --- a/tests/checkasm/vf_bwdif.c +++ b/tests/checkasm/vf_bwdif.c @@ -83,6 +83,60 @@ void checkasm_check_vf_bwdif(void) report("bwdif10"); } +{ +LOCAL_ALIGNED_16(uint8_t, prev0, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, prev1, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, next0, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, next1, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, cur0, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, cur1, [11*WIDTH]); +LOCAL_ALIGNED_16(uint8_t, dst0, [WIDTH*3]); +LOCAL_ALIGNED_16(uint8_t, dst1, [WIDTH*3]); +const int stride = WIDTH; +const int mask = (1<<8)-1; +int spat; +int parity; + +for (spat = 0; spat != 2; ++spat) { +for (parity = 0; parity != 2; ++parity) { +if (check_func(ctx_8.filter_edge, "bwdif8.edge.s%d.p%d", spat, parity)) { + +declare_func(void, void *dst1, void *prev1, void *cur1, void *next1, +int w, int prefs, int mrefs, int prefs2, int mrefs2, +int parity, int clip_max, int spat); + +randomize_buffers(prev0, prev1, mask, 11*WIDTH); +randomize_buffers(next0, next1, mask, 11*WIDTH); +randomize_buffers( cur0, cur1, mask, 11*WIDTH); +memset(dst0, 0xba, WIDTH * 3); +memset(dst1, 0xba, WIDTH * 3); + +call_ref(dst0 + stride, + prev0 + stride * 4, cur0 + stride * 4, next0 + stride * 4, WIDTH, + stride, -stride, stride * 2, -stride * 2, + parity, mask, spat); +call_new(dst1 + stride, + prev1 + stride * 4, cur1 + stride * 4, next1 + stride * 4, WIDTH, + stride, -stride, stride * 2, -stride * 2, + parity, mask, spat); + +if (memcmp(dst0, dst1, WIDTH*3) +|| memcmp(prev0, prev1, WIDTH*11) +|| memcmp(next0, next1, WIDTH*11) +|| memcmp( cur0, cur1, WIDTH*11)) +fail(); + +bench_new(dst1 + stride, + prev1 + stride * 4, cur1 + stride * 4, next1 + stride * 4, WIDTH, + stride, -stride, stride * 2, -stride * 2, + parity, mask, spat); +} +} +} + +report("bwdif8.edge"); +} + if (check_func(ctx_8.filter_intra, "bwdif8.intra")) { LOCAL_ALIGNED_16(uint8_t, cur0, [11*WIDTH]); LOCAL_ALIGNED_16(uint8_t, cur1, [11*WIDTH]); -- 2.39.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH v4 4/7] avfilter/vf_bwdif: Add neon for filter_edge
Adds clip and spatial macros for aarch64 neon Exports C filter_edge needed for tail fixup of neon code Adds neon for filter_edge Signed-off-by: John Cox --- libavfilter/aarch64/vf_bwdif_init_aarch64.c | 20 +++ libavfilter/aarch64/vf_bwdif_neon.S | 177 libavfilter/bwdif.h | 4 + libavfilter/vf_bwdif.c | 8 +- 4 files changed, 205 insertions(+), 4 deletions(-) diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c b/libavfilter/aarch64/vf_bwdif_init_aarch64.c index 3ffaa07ab3..e75cf2f204 100644 --- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c +++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c @@ -24,10 +24,29 @@ #include "libavfilter/bwdif.h" #include "libavutil/aarch64/cpu.h" +void ff_bwdif_filter_edge_neon(void *dst1, void *prev1, void *cur1, void *next1, + int w, int prefs, int mrefs, int prefs2, int mrefs2, + int parity, int clip_max, int spat); + void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, int mrefs, int prefs3, int mrefs3, int parity, int clip_max); +static void filter_edge_helper(void *dst1, void *prev1, void *cur1, void *next1, + int w, int prefs, int mrefs, int prefs2, int mrefs2, + int parity, int clip_max, int spat) +{ +const int w0 = clip_max != 255 ? 0 : w & ~15; + +ff_bwdif_filter_edge_neon(dst1, prev1, cur1, next1, w0, prefs, mrefs, prefs2, mrefs2, + parity, clip_max, spat); + +if (w0 < w) +ff_bwdif_filter_edge_c((char *)dst1 + w0, (char *)prev1 + w0, (char *)cur1 + w0, (char *)next1 + w0, + w - w0, prefs, mrefs, prefs2, mrefs2, + parity, clip_max, spat); +} + static void filter_intra_helper(void *dst1, void *cur1, int w, int prefs, int mrefs, int prefs3, int mrefs3, int parity, int clip_max) { @@ -52,5 +71,6 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth) return; s->filter_intra = filter_intra_helper; +s->filter_edge = filter_edge_helper; } diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S index e288efbe6c..389302b813 100644 --- a/libavfilter/aarch64/vf_bwdif_neon.S +++ b/libavfilter/aarch64/vf_bwdif_neon.S @@ -66,6 +66,79 @@ umlsl2 \a3\().4s, \s1\().8h, \k .endm +// int b = m2s1 - m1; +// int f = p2s1 - p1; +// int dc = c0s1 - m1; +// int de = c0s1 - p1; +// int sp_max = FFMIN(p1 - c0s1, m1 - c0s1); +// sp_max = FFMIN(sp_max, FFMAX(-b,-f)); +// int sp_min = FFMIN(c0s1 - p1, c0s1 - m1); +// sp_min = FFMIN(sp_min, FFMAX(b,f)); +// diff = diff == 0 ? 0 : FFMAX3(diff, sp_min, sp_max); +.macro SPAT_CHECK diff, m2s1, m1, c0s1, p1, p2s1, t0, t1, t2, t3 +uqsub \t0\().16b, \p1\().16b, \c0s1\().16b +uqsub \t2\().16b, \m1\().16b, \c0s1\().16b +umin\t2\().16b, \t0\().16b, \t2\().16b + +uqsub \t1\().16b, \m1\().16b, \m2s1\().16b +uqsub \t3\().16b, \p1\().16b, \p2s1\().16b +umax\t3\().16b, \t3\().16b, \t1\().16b +umin\t3\().16b, \t3\().16b, \t2\().16b + +uqsub \t0\().16b, \c0s1\().16b, \p1\().16b +uqsub \t2\().16b, \c0s1\().16b, \m1\().16b +umin\t2\().16b, \t0\().16b, \t2\().16b + +uqsub \t1\().16b, \m2s1\().16b, \m1\().16b +uqsub \t0\().16b, \p2s1\().16b, \p1\().16b +umax\t0\().16b, \t0\().16b, \t1\().16b +umin\t2\().16b, \t2\().16b, \t0\().16b + +cmeq\t1\().16b, \diff\().16b, #0 +umax\diff\().16b, \diff\().16b, \t3\().16b +umax\diff\().16b, \diff\().16b, \t2\().16b +bic \diff\().16b, \diff\().16b, \t1\().16b +.endm + +// i0 = s0; +// if (i0 > d0 + diff0) +// i0 = d0 + diff0; +// else if (i0 < d0 - diff0) +// i0 = d0 - diff0; +// +// i0 = s0 is safe +.macro DIFF_CLIP i0, s0, d0, diff, t0, t1 +uqadd \t0\().16b, \d0\().16b, \diff\().16b +uqsub \t1\().16b, \d0\().16b, \diff\().16b +umin\i0\().16b, \s0\().16b, \t0\().16b +umax\i0\().16b, \i0\().16b, \t1\().16b +.endm + +// i0 = FFABS(m1 - p1) > td0 ? i1 : i2; +// DIFF_CLIP +// +// i0 = i1 is safe +.macro INTERPOL i0, i1, i2, m1, d0, p1, td0, diff, t0, t1, t2 +uabd\t0\().16b, \m1\().16b, \p1\().16b +cmhi\t0\().16b, \t0\().16b, \td0\().16b +bsl \t0\().16b, \i1\().16b, \i2\().16b +DIFF_CLIP \i0, \t0, \d0, \diff, \t1, \t