from:"John Cox"

Re: [FFmpeg-devel] [PATCH V2] avutil/tx: add check against (*ctx)

2019-05-16 Thread John Cox

>Ruiling Song (12019-05-16):
>> ctx is a pointer to pointer here.
>> 
>> Signed-off-by: Ruiling Song 
>> ---
>>  libavutil/tx.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/libavutil/tx.c b/libavutil/tx.c
>> index 934ef27c81..1690604040 100644
>> --- a/libavutil/tx.c
>> +++ b/libavutil/tx.c
>> @@ -697,7 +697,7 @@ static int gen_mdct_exptab(AVTXContext *s, int len4, 
>> double scale)
>>  
>>  av_cold void av_tx_uninit(AVTXContext **ctx)
>>  {
>
>> -if (!ctx)
>> +if (!ctx || !(*ctx))
>
>That would protect somebody stupid enough to call av_tx_uninit(NULL)
>instead of av_tx_uninit(&var). A hard crass is completely warranted in
>this case. An assert would be acceptable.

Actually that is what the original code does.  What you appear to want
is

  if (!*ctx)

which protects against multi-free and is useful in that it can be called
unconditionally in cleanup code (assuming initial null assignments) and
crashes in what you describe as the "stupid" case.

>>  return;
>>  
>>  av_free((*ctx)->pfatab);
>
>Regards,

Regards

John Cox
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] HEVC decoder for Raspberry Pi

2018-11-13 Thread John Cox

Hi

I have been developing a hevc decoder for Raspberry Pi for some time
now. As active development has now pretty much ceased and the code is
believed stable it seems a good time to try presenting it to the group.

You can find the current code on branch test/4.1.0/rpi_main in repo
https://github.com/jc-kynesim/rpi-ffmpeg.git. It is based off tag n4.1
so if you diff it against n4.1 you should get a patch.

This code has been in use by the Raspberry Pi version of Kodi for over
two years now.

If you think it would be a good idea to add this to the main ffmpeg
distribution then I am willing to put reasonable effort into beating it
into an appropriate shape.

If not then it contains a reasonable number of ARM asm functions and
other code that you might like to take/adapt for the current decoder.

You will find the config scripts I have been using and a few notes in
the pi-util directory if you wish to try building it for yourself.

Just in case it isn't obvious: this will only run on a Pi.  Slightly
less obviously you need a Pi2 or better as the Pi0 & Pi1 don't have neon
and are just too slow anyway.



Notes on the hevc_rpi decoder & associated support code
---

There are 3 main parts to the existing code:

1) The decoder - this is all in libavcodec as rpi_hevc*.

2) A few filters to deal with Sand frames and a small patch to
automatically select the sand->i420 converter when required.

3) A kludge in ffmpeg.c to display the decoded video. This could &
should be converted into a proper ffmpeg display module.


Decoder
---

The decoder is a modified version of the existing ffmpeg hevc decoder.
Generally it is ~100% faster than the existing ffmpeg hevc s/w decoder.
More complex bitstreams can be up to ~200% faster but particularly easy
streams can cut its advantage down to ~50%.  This means that a Pi3+ can
display nearly all 8-bit 1080p30 streams and with some overclocking it
can display most lower bitrate 10-bit 1080p30 streams - this latter case
is not helped by the requirement to downsample to 8-bit before display
on a Pi.

It has had co-processor offload added for inter-pred and large block
residual transform.  Various parts have had optimized ARM NEON assembler
added and the existing ARM asm sections have been profiled and
re-optimized for A53. The main C code has been substantially reworked at
its lower levels in an attempt to optimize it and minimize memory
bandwidth. To some extent code paths that deal with frame types that it
doesn't support have been pruned.

It outputs frames in Broadcom Sand format. This is a somewhat annoying
layout that doesn't fit into ffmpegs standard frame descriptions. It has
vertical stripes of 128 horizontal pixels (64 in 10 bit forms) with Y
for the stripe followed by interleaved U & V, that is then followed by
the Y for the next stripe, etc. The final stripe is always padded to
stripe-width. This is used in an attempt to help with cache locality and
cut down on the number of dram bank switches. It is annoying to use for
inter-pred with conventional processing but the way the Pi QPU (which is
used for inter-pred) works means that it has negligible downsides here
and the improved memory performance exceeds the overhead of the
increased complexity in the rest of the code.

Frames must be allocated out of GPU memory (as otherwise they can't be
accessed by the co-processors). Utility functions (in rpi_zc.c) have
been written to make this easier. As the frames are already in GPU
memory they can be displayed by the Pi h/w without any further copying.


Known non-features
--

Frame allocation should probably be done in some other way in order to
fit into the standard framework better.

Sand frames are currently declared as software frames, there is an
argument that they should be hardware frames but they aren't really.

There must be a better way of auto-selecting the hevc_rpi decoder over
the normal s/w hevc decoder, but I became confused by the existing h/w
acceleration framework and what I wanted to do didn't seem to fit in
neatly.

Display should be a proper device rather than a kludge in ffmpeg.c


Regards

John Cox
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] HEVC decoder for Raspberry Pi

2018-11-14 Thread John Cox

Hi

>Hi
>
>On Tue, Nov 13, 2018 at 03:52:18PM +0000, John Cox wrote:
>> Hi
>> 
>> I have been developing a hevc decoder for Raspberry Pi for some time
>> now. As active development has now pretty much ceased and the code is
>> believed stable it seems a good time to try presenting it to the group.
>> 
>> You can find the current code on branch test/4.1.0/rpi_main in repo
>> https://github.com/jc-kynesim/rpi-ffmpeg.git. It is based off tag n4.1
>> so if you diff it against n4.1 you should get a patch.
>> 
>> This code has been in use by the Raspberry Pi version of Kodi for over
>> two years now.
>> 
>> If you think it would be a good idea to add this to the main ffmpeg
>> distribution then I am willing to put reasonable effort into beating it
>> into an appropriate shape.
>> 
>> If not then it contains a reasonable number of ARM asm functions and
>> other code that you might like to take/adapt for the current decoder.
>> 
>> You will find the config scripts I have been using and a few notes in
>> the pi-util directory if you wish to try building it for yourself.
>> 
>> Just in case it isn't obvious: this will only run on a Pi.  Slightly
>> less obviously you need a Pi2 or better as the Pi0 & Pi1 don't have neon
>> and are just too slow anyway.
>
>others may have other oppinions, but i think optimized code in FFmpeg
>for Pi would be a good idea.
>How to integrate this best though i do not know. And i cant know as
>i have just quickly scrolled over the changes not really looked in detail

Well if you want help with understanding what I've done feel free to
email me and I'll do my best to explain.

>But its certainly better to have hw optimizations in main git and
>not have a seperate repository that needs to be maintained seperatly
>for each platform ... and that the user has to find also ... and then
>3rd party apps could have even more issues here  if they wanted to use
>optimized libs ...

As I said I'm happy to put in reasonable amounts of work to make this
happen. If we do want to go ahead then may I suggest that the most
efficient way of proceeding would be that I take advice from one
experienced person who understands the current hevc code (Michael?) by
email until the work is mostly done and then return to the list for
final polish?

Regards

JC
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] HEVC decoder for Raspberry Pi

2018-11-15 Thread John Cox

Hi

>On Wed, Nov 14, 2018 at 11:35:50AM +0000, John Cox wrote:
>> Hi
>> 
>> >Hi
>> >
>> >On Tue, Nov 13, 2018 at 03:52:18PM +, John Cox wrote:
>> >> Hi
>> >> 
>> >> I have been developing a hevc decoder for Raspberry Pi for some time
>> >> now. As active development has now pretty much ceased and the code is
>> >> believed stable it seems a good time to try presenting it to the group.
>> >> 
>> >> You can find the current code on branch test/4.1.0/rpi_main in repo
>> >> https://github.com/jc-kynesim/rpi-ffmpeg.git. It is based off tag n4.1
>> >> so if you diff it against n4.1 you should get a patch.
>> >> 
>> >> This code has been in use by the Raspberry Pi version of Kodi for over
>> >> two years now.
>> >> 
>> >> If you think it would be a good idea to add this to the main ffmpeg
>> >> distribution then I am willing to put reasonable effort into beating it
>> >> into an appropriate shape.
>> >> 
>> >> If not then it contains a reasonable number of ARM asm functions and
>> >> other code that you might like to take/adapt for the current decoder.
>> >> 
>> >> You will find the config scripts I have been using and a few notes in
>> >> the pi-util directory if you wish to try building it for yourself.
>> >> 
>> >> Just in case it isn't obvious: this will only run on a Pi.  Slightly
>> >> less obviously you need a Pi2 or better as the Pi0 & Pi1 don't have neon
>> >> and are just too slow anyway.
>> >
>> >others may have other oppinions, but i think optimized code in FFmpeg
>> >for Pi would be a good idea.
>> >How to integrate this best though i do not know. And i cant know as
>> >i have just quickly scrolled over the changes not really looked in detail
>> 
>> Well if you want help with understanding what I've done feel free to
>> email me and I'll do my best to explain.
>> 
>> >But its certainly better to have hw optimizations in main git and
>> >not have a seperate repository that needs to be maintained seperatly
>> >for each platform ... and that the user has to find also ... and then
>> >3rd party apps could have even more issues here  if they wanted to use
>> >optimized libs ...
>> 
>> As I said I'm happy to put in reasonable amounts of work to make this
>> happen. If we do want to go ahead then may I suggest that the most
>> efficient way of proceeding would be that I take advice from one
>> experienced person who understands the current hevc code (Michael?) by
>> email until the work is mostly done and then return to the list for
>> final polish?
>
>well, there are multiple ways this could be integrated, and its not
>really my decission which way to go. Whats important is that before
>doing substantial work you ensure that theres noone around who has
>an issue with the choice before.
>
>Now one way it could be integrated would be as a seperate decoder
That is how I've currently built it and therefore probably the easiest
option.

>another is inside the hevc decoder
It started life there but became a very uneasy fit with too many ifdefs.
>a 3rd is, similar to the hwaccel stuff
>and a 4th would be that the decoder could be an external lib that
>is used through hwaccel similar to other hwaccel libs
Possibly - this is where I wanted advice as my attempts to understand
how that lot is meant to work simply ended in confusion or a feeling
that what I wanted to do was a very bad fit with the current framework -
some of the issue with that is in vps/sps/pps setup where I build
somewhat different tables to the common code that is used by most other
h/w decodes.

>you need to obtain the communities preferrance here not just my
>oppinion ...
>especially comments from people activly working on hwaccel stuff
>are needed here
I welcome their comments

>But there is surely code in this change which can be integrated
>and which would not change depending on the higher level integration
>design. An example would be the asm that you already mentioned
>You could split that out into patches and submit these
I'd prefer to get the whole thing in, but if someone else wants to
cherry-pick my changes then they are completely welcome.

>another thing that can be worked on may be to reduce code duplication.
Yup

Regards

JC
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

[FFmpeg-devel] How to do (HEVC) decoder fallback?

2017-11-13 Thread John Cox

Hi

I have an HEVC decoder built from the standard ffmpeg hevc decoder.  It
has been heavily optimised for the Raspberry Pi and uses the support
processors (QPU & VPU) of that chip to achieve plausible speed (on a Pi3
it can normally decode 10Mbit/sec 30fps 8-bit 4:2:0 1080p and has a
decent go at 10-bit 1080p but you will need some overclock to get
reliable 30fps)

It only supports 8 & 10bit, 4:2:0 HEVC with a max width of 2048, and
ouputs frames in a somewhat odd Broadcom format (sand) which doesn't fit
any of the existing FFmpeg models as it is arranged in 128 byte wide
vertical stripes rather than any sort of planar format.  I also have a
few functions that deal with sand conversion to raw 420 for conformance
testing.

What I want to do is to add this in such a way that ffmpeg will use it
if the incoming stream is one it can deal with but will fall back to the
standard hevc decoder if it can't.

I've looked at the h/w accel route, but at first sight (I'll admit to
becoming quite confused here) that appears to (a) want the hwaccel to
produce the same format frames as the base deecoder would (which it
doesn't) and (b) to use the same vps/sps/pps processing as the base
decoder (and I've modified that a bit).  What I would really like is for
there to be some sort of fallback route for software decoders that share
the same AVCodecID s.t. if one fails init then the next one is tried but
that doesn't seem to be possible with the current setup.  Am I missing
something?

As it stands the code is built into the main hevc decoder code with a
lot of ifdefs & if (rpi_enable), but I think it would be better off in
its own decoder.  If you want to look at the current state of the art
then you can find it in https://github.com/jc-kynesim/rpi-ffmpeg.git on
branch test/wpp_1 - I do have a separated decoder version but I'd like
to find out how I should integrate it before I commit it.

Many thanks

John Cox
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

[FFmpeg-devel] [RFC] swscale RGB24->YUV420P

2023-08-16 Thread John Cox

Hi

The Pi has a use for a fast RGB24->YUV420P path for encoding camera
video. There is an existing BGR24 converter but if I build a RGB24
converter using the same logic (rearrange the conversion matrix and use
the same code) I get a fate fail on filter-fps-cfr (and possibly others)
which appears to decode a file to RGB24, convert to YUV420P and take the
CRC of that so almost any change to the conversion breaks this
(unrelated?) test.

My initial assumption was that if the code to conversion in
libswscale/rgb2rgb_template:bgr24toyv12_c was good enough for BGR24->YUV
then it should be good enough for RGB24->YUV too. However it breaks this
fate case - what is an acceptable way to resolve this?

A further question assuming that the above can be resolved - I have also
written aarch64 asm for this (RGB24/BGR24->YUV420P). It costs nothing in
the asm to round the output values to nearest rather than just rounding
down as the C template does and doing so improves the accuracy reported
by tests/swscale - however at that point the asm and the C don't produce
identical results. I assume that the x86 asm for BGR24 conversion does
match the C. What is the best thing to do here?

I've tested by hand with libswscale/test/swscale but fate integration
would be obviously better - I'm currently a bit lost in fate, where/how
should I do this?

Many thanks

John Cox
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [RFC] swscale RGB24->YUV420P

2023-08-17 Thread John Cox

On Wed, 16 Aug 2023 19:37:02 +0200, you wrote:

>On Wed, Aug 16, 2023 at 05:15:23PM +0100, John Cox wrote:
>> Hi
>> 
>> The Pi has a use for a fast RGB24->YUV420P path for encoding camera
>> video. There is an existing BGR24 converter but if I build a RGB24
>> converter using the same logic (rearrange the conversion matrix and use
>> the same code) I get a fate fail on filter-fps-cfr (and possibly others)
>> which appears to decode a file to RGB24, convert to YUV420P and take the
>> CRC of that so almost any change to the conversion breaks this
>> (unrelated?) test.
>> 
>> My initial assumption was that if the code to conversion in
>> libswscale/rgb2rgb_template:bgr24toyv12_c was good enough for BGR24->YUV
>> then it should be good enough for RGB24->YUV too. However it breaks this
>> fate case - what is an acceptable way to resolve this?
>
>update the checksum (if needed), and put the code under appropriate bitexact 
>flags checks
>(there may be remaining issues but hard to say without seeing and being
>abel to test the code)

Thanks for the prompt answer. The current test invocation goes:

 /home/jc/work/rpi/ffmpeg2/out/x86/ffmpeg -nostdin -nostats
-noauto_conversion_filters -cpuflags all -auto_conversion_filters
-hwaccel none -threads 1 -thread_type frame+slice -i
/home/jc/rpi/conform/fate-suite/qtrle/apple-animation-variable-fps-bug.mov
-r 30 -vsync cfr -pix_fmt yuv420p -bitexact -f framecrc -

Which appears, at first sight, to already have the required bitexact
flag in it, however it doesn't get passed to the swscale context - in
order for that to happen I need something like:

 /home/jc/work/rpi/ffmpeg2/out/x86/ffmpeg -fflags bitexact -nostdin
-nostats -noauto_conversion_filters -cpuflags all
-auto_conversion_filters -hwaccel none -threads 1 -thread_type
frame+slice -i
/home/jc/rpi/conform/fate-suite/qtrle/apple-animation-variable-fps-bug.mov
-r 30 -vsync cfr -vf scale=sws_flags=bitexact -pix_fmt yuv420p -bitexact
-f framecrc -

i.e. adding an explicit "-vf scale=sws_flags=bitexact". Is this the
correct answer or is it a bug that the auto conversion fails to respect
the existing bitexact flag?

>> A further question assuming that the above can be resolved - I have also
>> written aarch64 asm for this (RGB24/BGR24->YUV420P). It costs nothing in
>> the asm to round the output values to nearest rather than just rounding
>> down as the C template does and doing so improves the accuracy reported
>> by tests/swscale - however at that point the asm and the C don't produce
>> identical results. I assume that the x86 asm for BGR24 conversion does
>> match the C. What is the best thing to do here?
>
>The more differences there are between implementations the more annoying
>it is but there is a bitexact flag that allows differences

Thanks

John Cox

>thx
>
>[...]
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH v1 0/6] swscale: Add dedicated RGB->YUV unscaled functions & aarch64 asm

2023-08-20 Thread John Cox

This patch set expands the set of dedicated RGB->YUV unscaled functions
to help with encoding camera output on a Pi. Obviously there are other
uses but that was the motivation.

It enforces the general bitexact path for the fate tests that depend on
it.
It renames the existing bgr function as bgr... so we don't end up with
the counterintuative situation where BGR is handled by rgb... and BGR
would be handled by rgb..
Adds RGB functions
Improves the rounding in the dedicated function as that improves its
score when tested with test/swscale and fixes it to allow any width
(contrary to the comment any height was already allowed).
Adds XRGB->YUV functions to complete the set
Adds Aarch64 neon for BGR24 & RGB24

I haven't built fate tests for this  as I'm not quite sure what the
appropriate tests would be. The x86 asm doesn't match either the C
template with improved rounding or the previous template (I'm not quite
sure what it does but it produces a different score out of tests/swscale
to either method) so a simple results match isn't going to work.

Regards

John Cox

John Cox (6):
  fate-filter-fps: Set swscale bitexact for tests that do conversions
  swscale: Rename BGR24->YUV conversion functions as bgr...
  swscale: Add explicit rgb24->yv12 conversion
  swscale: RGB24->YUV allow odd widths & improve C rounding
  swscale: Add unscaled XRGB->YUV420P functions
  swscale: Add aarch64 functions for RGB24->YUV420P

 libswscale/aarch64/rgb2rgb.c  |   8 +
 libswscale/aarch64/rgb2rgb_neon.S | 356 ++
 libswscale/bayer_template.c   |   2 +-
 libswscale/rgb2rgb.c  |  25 +++
 libswscale/rgb2rgb.h  |  23 ++
 libswscale/rgb2rgb_template.c | 174 +--
 libswscale/swscale_unscaled.c | 114 +-
 libswscale/x86/rgb2rgb_template.c |  13 +-
 tests/fate/filter-video.mak   |   4 +-
 9 files changed, 694 insertions(+), 25 deletions(-)

-- 
2.39.2

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH v1 1/6] fate-filter-fps: Set swscale bitexact for tests that do conversions

2023-08-20 Thread John Cox

-bitexact as a general flag doesn't affect swscale so add swscale option
too to get correct CRCs in all circumstances.

Signed-off-by: John Cox 
---
 tests/fate/filter-video.mak | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tests/fate/filter-video.mak b/tests/fate/filter-video.mak
index 789ec6414c..811a96d124 100644
--- a/tests/fate/filter-video.mak
+++ b/tests/fate/filter-video.mak
@@ -391,8 +391,8 @@ fate-filter-fps-start-drop: CMD = framecrc -lavfi 
testsrc2=r=7:d=3.5,fps=3:start
 fate-filter-fps-start-fill: CMD = framecrc -lavfi 
testsrc2=r=7:d=1.5,setpts=PTS+14,fps=3:start_time=1.5
 
 FATE_FILTER_SAMPLES-$(call FILTERDEMDEC, FPS SCALE, MOV, QTRLE) += 
fate-filter-fps-cfr fate-filter-fps
-fate-filter-fps-cfr: CMD = framecrc -auto_conversion_filters -i 
$(TARGET_SAMPLES)/qtrle/apple-animation-variable-fps-bug.mov -r 30 -vsync cfr 
-pix_fmt yuv420p
-fate-filter-fps: CMD = framecrc -auto_conversion_filters -i 
$(TARGET_SAMPLES)/qtrle/apple-animation-variable-fps-bug.mov -vf fps=30 
-pix_fmt yuv420p
+fate-filter-fps-cfr: CMD = framecrc -auto_conversion_filters -i 
$(TARGET_SAMPLES)/qtrle/apple-animation-variable-fps-bug.mov -r 30 -vsync cfr 
-vf scale=sws_flags=bitexact -pix_fmt yuv420p
+fate-filter-fps: CMD = framecrc -auto_conversion_filters -i 
$(TARGET_SAMPLES)/qtrle/apple-animation-variable-fps-bug.mov -vf 
fps=30,scale=sws_flags=bitexact -pix_fmt yuv420p
 
 FATE_FILTER_ALPHAEXTRACT_ALPHAMERGE := $(addprefix 
fate-filter-alphaextract_alphamerge_, rgb yuv)
 FATE_FILTER_VSYNTH_PGMYUV-$(call ALLYES, SCALE_FILTER FORMAT_FILTER 
SPLIT_FILTER ALPHAEXTRACT_FILTER ALPHAMERGE_FILTER) += 
$(FATE_FILTER_ALPHAEXTRACT_ALPHAMERGE)
-- 
2.39.2

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH v1 2/6] swscale: Rename BGR24->YUV conversion functions as bgr...

2023-08-20 Thread John Cox

Rename swscale conversion functions for converting BGR24 frames to YUV
as bgr24toyuv12 rather than rgb24toyuv12 as that is just confusing and
would be even more confusing with the addition of RGB24 converters.

Signed-off-by: John Cox 
---
 libswscale/bayer_template.c   | 2 +-
 libswscale/rgb2rgb.c  | 2 +-
 libswscale/rgb2rgb.h  | 4 ++--
 libswscale/rgb2rgb_template.c | 4 ++--
 libswscale/swscale_unscaled.c | 2 +-
 libswscale/x86/rgb2rgb_template.c | 8 
 6 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/libswscale/bayer_template.c b/libswscale/bayer_template.c
index 46b5a4984d..06d917c97f 100644
--- a/libswscale/bayer_template.c
+++ b/libswscale/bayer_template.c
@@ -188,7 +188,7 @@
  * invoke ff_rgb24toyv12 for 2x2 pixels
  */
 #define rgb24toyv12_2x2(src, dstY, dstU, dstV, luma_stride, src_stride, 
rgb2yuv) \
-ff_rgb24toyv12(src, dstY, dstV, dstU, 2, 2, luma_stride, 0, src_stride, 
rgb2yuv)
+ff_bgr24toyv12(src, dstY, dstV, dstU, 2, 2, luma_stride, 0, src_stride, 
rgb2yuv)
 
 static void BAYER_RENAME(rgb24_copy)(const uint8_t *src, int src_stride, 
uint8_t *dst, int dst_stride, int width)
 {
diff --git a/libswscale/rgb2rgb.c b/libswscale/rgb2rgb.c
index e98fdac8ea..8707917800 100644
--- a/libswscale/rgb2rgb.c
+++ b/libswscale/rgb2rgb.c
@@ -78,7 +78,7 @@ void (*yuy2toyv12)(const uint8_t *src, uint8_t *ydst,
uint8_t *udst, uint8_t *vdst,
int width, int height,
int lumStride, int chromStride, int srcStride);
-void (*ff_rgb24toyv12)(const uint8_t *src, uint8_t *ydst,
+void (*ff_bgr24toyv12)(const uint8_t *src, uint8_t *ydst,
uint8_t *udst, uint8_t *vdst,
int width, int height,
int lumStride, int chromStride, int srcStride,
diff --git a/libswscale/rgb2rgb.h b/libswscale/rgb2rgb.h
index f3951d523e..305b830920 100644
--- a/libswscale/rgb2rgb.h
+++ b/libswscale/rgb2rgb.h
@@ -76,7 +76,7 @@ void rgb15tobgr15(const uint8_t *src, uint8_t *dst, int 
src_size);
 void rgb12tobgr12(const uint8_t *src, uint8_t *dst, int src_size);
 voidrgb12to15(const uint8_t *src, uint8_t *dst, int src_size);
 
-void ff_rgb24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst,
+void ff_bgr24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst,
   uint8_t *vdst, int width, int height, int lumStride,
   int chromStride, int srcStride, int32_t *rgb2yuv);
 
@@ -124,7 +124,7 @@ extern void (*yuv422ptouyvy)(const uint8_t *ysrc, const 
uint8_t *usrc, const uin
  * Chrominance data is only taken from every second line, others are ignored.
  * FIXME: Write high quality version.
  */
-extern void (*ff_rgb24toyv12)(const uint8_t *src, uint8_t *ydst, uint8_t 
*udst, uint8_t *vdst,
+extern void (*ff_bgr24toyv12)(const uint8_t *src, uint8_t *ydst, uint8_t 
*udst, uint8_t *vdst,
   int width, int height,
   int lumStride, int chromStride, int srcStride,
   int32_t *rgb2yuv);
diff --git a/libswscale/rgb2rgb_template.c b/libswscale/rgb2rgb_template.c
index 42c69801ba..8ef4a2cf5d 100644
--- a/libswscale/rgb2rgb_template.c
+++ b/libswscale/rgb2rgb_template.c
@@ -646,7 +646,7 @@ static inline void uyvytoyv12_c(const uint8_t *src, uint8_t 
*ydst,
  * others are ignored in the C version.
  * FIXME: Write HQ version.
  */
-void ff_rgb24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst,
+void ff_bgr24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst,
uint8_t *vdst, int width, int height, int lumStride,
int chromStride, int srcStride, int32_t *rgb2yuv)
 {
@@ -979,7 +979,7 @@ static av_cold void rgb2rgb_init_c(void)
 yuv422ptouyvy  = yuv422ptouyvy_c;
 yuy2toyv12 = yuy2toyv12_c;
 planar2x   = planar2x_c;
-ff_rgb24toyv12 = ff_rgb24toyv12_c;
+ff_bgr24toyv12 = ff_bgr24toyv12_c;
 interleaveBytes= interleaveBytes_c;
 deinterleaveBytes  = deinterleaveBytes_c;
 vu9_to_vu12= vu9_to_vu12_c;
diff --git a/libswscale/swscale_unscaled.c b/libswscale/swscale_unscaled.c
index 9af2e7ecc3..32e0d7f63c 100644
--- a/libswscale/swscale_unscaled.c
+++ b/libswscale/swscale_unscaled.c
@@ -1641,7 +1641,7 @@ static int bgr24ToYv12Wrapper(SwsContext *c, const 
uint8_t *src[],
   int srcStride[], int srcSliceY, int srcSliceH,
   uint8_t *dst[], int dstStride[])
 {
-ff_rgb24toyv12(
+ff_bgr24toyv12(
 src[0],
 dst[0] +  srcSliceY   * dstStride[0],
 dst[1] + (srcSliceY >> 1) * dstStride[1],
diff --git a/libswscale/x86/rgb2rgb_template.c 
b/libswscale/x86/rgb2rgb_template.c
index 4aba25dd51..dc2b4e205a 100644
--- a/libswscale/x86/rgb2rgb_template.c
+++ b/libswscale/x86/rgb2rgb_template.c
@@ -1544,7 +1544,7 @@ static inline void RENAME(uyvy

[FFmpeg-devel] [PATCH v1 3/6] swscale: Add explicit rgb24->yv12 conversion

2023-08-20 Thread John Cox

Add a rgb24->yuv420p conversion. Uses the same code as the existing
bgr24->yuv converter but permutes the conversion array to swap R & B
coefficients.

Signed-off-by: John Cox 
---
 libswscale/rgb2rgb.c  |  5 +
 libswscale/rgb2rgb.h  |  7 +++
 libswscale/rgb2rgb_template.c | 38 ++-
 libswscale/swscale_unscaled.c | 24 +-
 4 files changed, 68 insertions(+), 6 deletions(-)

diff --git a/libswscale/rgb2rgb.c b/libswscale/rgb2rgb.c
index 8707917800..de90e5193f 100644
--- a/libswscale/rgb2rgb.c
+++ b/libswscale/rgb2rgb.c
@@ -83,6 +83,11 @@ void (*ff_bgr24toyv12)(const uint8_t *src, uint8_t *ydst,
int width, int height,
int lumStride, int chromStride, int srcStride,
int32_t *rgb2yuv);
+void (*ff_rgb24toyv12)(const uint8_t *src, uint8_t *ydst,
+   uint8_t *udst, uint8_t *vdst,
+   int width, int height,
+   int lumStride, int chromStride, int srcStride,
+   int32_t *rgb2yuv);
 void (*planar2x)(const uint8_t *src, uint8_t *dst, int width, int height,
  int srcStride, int dstStride);
 void (*interleaveBytes)(const uint8_t *src1, const uint8_t *src2, uint8_t *dst,
diff --git a/libswscale/rgb2rgb.h b/libswscale/rgb2rgb.h
index 305b830920..f7a76a92ba 100644
--- a/libswscale/rgb2rgb.h
+++ b/libswscale/rgb2rgb.h
@@ -79,6 +79,9 @@ voidrgb12to15(const uint8_t *src, uint8_t *dst, int 
src_size);
 void ff_bgr24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst,
   uint8_t *vdst, int width, int height, int lumStride,
   int chromStride, int srcStride, int32_t *rgb2yuv);
+void ff_rgb24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst,
+  uint8_t *vdst, int width, int height, int lumStride,
+  int chromStride, int srcStride, int32_t *rgb2yuv);
 
 /**
  * Height should be a multiple of 2 and width should be a multiple of 16.
@@ -128,6 +131,10 @@ extern void (*ff_bgr24toyv12)(const uint8_t *src, uint8_t 
*ydst, uint8_t *udst,
   int width, int height,
   int lumStride, int chromStride, int srcStride,
   int32_t *rgb2yuv);
+extern void (*ff_rgb24toyv12)(const uint8_t *src, uint8_t *ydst, uint8_t 
*udst, uint8_t *vdst,
+  int width, int height,
+  int lumStride, int chromStride, int srcStride,
+  int32_t *rgb2yuv);
 extern void (*planar2x)(const uint8_t *src, uint8_t *dst, int width, int 
height,
 int srcStride, int dstStride);
 
diff --git a/libswscale/rgb2rgb_template.c b/libswscale/rgb2rgb_template.c
index 8ef4a2cf5d..e57bfa6545 100644
--- a/libswscale/rgb2rgb_template.c
+++ b/libswscale/rgb2rgb_template.c
@@ -646,13 +646,14 @@ static inline void uyvytoyv12_c(const uint8_t *src, 
uint8_t *ydst,
  * others are ignored in the C version.
  * FIXME: Write HQ version.
  */
-void ff_bgr24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst,
+static void rgb24toyv12_x(const uint8_t *src, uint8_t *ydst, uint8_t *udst,
uint8_t *vdst, int width, int height, int lumStride,
-   int chromStride, int srcStride, int32_t *rgb2yuv)
+   int chromStride, int srcStride, int32_t *rgb2yuv,
+   const uint8_t x[9])
 {
-int32_t ry = rgb2yuv[RY_IDX], gy = rgb2yuv[GY_IDX], by = rgb2yuv[BY_IDX];
-int32_t ru = rgb2yuv[RU_IDX], gu = rgb2yuv[GU_IDX], bu = rgb2yuv[BU_IDX];
-int32_t rv = rgb2yuv[RV_IDX], gv = rgb2yuv[GV_IDX], bv = rgb2yuv[BV_IDX];
+int32_t ry = rgb2yuv[x[0]], gy = rgb2yuv[x[1]], by = rgb2yuv[x[2]];
+int32_t ru = rgb2yuv[x[3]], gu = rgb2yuv[x[4]], bu = rgb2yuv[x[5]];
+int32_t rv = rgb2yuv[x[6]], gv = rgb2yuv[x[7]], bv = rgb2yuv[x[8]];
 int y;
 const int chromWidth = width >> 1;
 
@@ -707,6 +708,32 @@ void ff_bgr24toyv12_c(const uint8_t *src, uint8_t *ydst, 
uint8_t *udst,
 }
 }
 
+static const uint8_t x_bgr[9] = {
+RY_IDX, GY_IDX, BY_IDX,
+RU_IDX, GU_IDX, BU_IDX,
+RV_IDX, GV_IDX, BV_IDX,
+};
+
+static const uint8_t x_rgb[9] = {
+ BY_IDX, GY_IDX, RY_IDX,
+ BU_IDX, GU_IDX, RU_IDX,
+ BV_IDX, GV_IDX, RV_IDX,
+};
+
+void ff_bgr24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst,
+   uint8_t *vdst, int width, int height, int lumStride,
+   int chromStride, int srcStride, int32_t *rgb2yuv)
+{
+rgb24toyv12_x(src, ydst, udst, vdst, width, height, lumStride, 
chromStride, srcStride, rgb2yuv, x_bgr);
+}
+
+void ff_rgb24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst,
+   uint8_t *vdst, int width, int height, int lumStride,
+   int chromStride, int srcStride, int32_t *rgb2yuv)
+{
+rgb24toyv12_x

[FFmpeg-devel] [PATCH v1 4/6] swscale: RGB24->YUV allow odd widths & improve C rounding

2023-08-20 Thread John Cox

Allow odd widths for conversion it costs very little and simplifies
setup slightly. x86 asm will fall back to the C code if width is odd.
Round to nearest rather than just down. This reduces the Y error
reported by tests/swscale from 3 to 1. x86 asm doesn't mirror the C so
exact correspondence isn't an issue there.

Signed-off-by: John Cox 
---
 libswscale/rgb2rgb_template.c | 42 ++-
 libswscale/swscale_unscaled.c |  5 ++--
 libswscale/x86/rgb2rgb_template.c |  5 
 3 files changed, 32 insertions(+), 20 deletions(-)

diff --git a/libswscale/rgb2rgb_template.c b/libswscale/rgb2rgb_template.c
index e57bfa6545..5503e58a29 100644
--- a/libswscale/rgb2rgb_template.c
+++ b/libswscale/rgb2rgb_template.c
@@ -656,6 +656,8 @@ static void rgb24toyv12_x(const uint8_t *src, uint8_t 
*ydst, uint8_t *udst,
 int32_t rv = rgb2yuv[x[6]], gv = rgb2yuv[x[7]], bv = rgb2yuv[x[8]];
 int y;
 const int chromWidth = width >> 1;
+const int32_t ky = ((16 << 1) + 1) << (RGB2YUV_SHIFT - 1);
+const int32_t kc = ((128 << 1) + 1) << (RGB2YUV_SHIFT - 1);
 
 for (y = 0; y < height; y += 2) {
 int i;
@@ -664,9 +666,9 @@ static void rgb24toyv12_x(const uint8_t *src, uint8_t 
*ydst, uint8_t *udst,
 unsigned int g = src[6 * i + 1];
 unsigned int r = src[6 * i + 2];
 
-unsigned int Y = ((ry * r + gy * g + by * b) >> RGB2YUV_SHIFT) +  
16;
-unsigned int V = ((rv * r + gv * g + bv * b) >> RGB2YUV_SHIFT) + 
128;
-unsigned int U = ((ru * r + gu * g + bu * b) >> RGB2YUV_SHIFT) + 
128;
+unsigned int Y = (ry * r + gy * g + by * b + ky) >> RGB2YUV_SHIFT;
+unsigned int V = (rv * r + gv * g + bv * b + kc) >> RGB2YUV_SHIFT;
+unsigned int U = (ru * r + gu * g + bu * b + kc) >> RGB2YUV_SHIFT;
 
 udst[i] = U;
 vdst[i] = V;
@@ -676,30 +678,36 @@ static void rgb24toyv12_x(const uint8_t *src, uint8_t 
*ydst, uint8_t *udst,
 g = src[6 * i + 4];
 r = src[6 * i + 5];
 
-Y = ((ry * r + gy * g + by * b) >> RGB2YUV_SHIFT) + 16;
+Y = ((ry * r + gy * g + by * b + ky) >> RGB2YUV_SHIFT);
 ydst[2 * i + 1] = Y;
 }
-ydst += lumStride;
-src  += srcStride;
-
-if (y+1 == height)
-break;
-
-for (i = 0; i < chromWidth; i++) {
+if ((width & 1) != 0) {
 unsigned int b = src[6 * i + 0];
 unsigned int g = src[6 * i + 1];
 unsigned int r = src[6 * i + 2];
 
-unsigned int Y = ((ry * r + gy * g + by * b) >> RGB2YUV_SHIFT) + 
16;
+unsigned int Y = (ry * r + gy * g + by * b + ky) >> RGB2YUV_SHIFT;
+unsigned int V = (rv * r + gv * g + bv * b + kc) >> RGB2YUV_SHIFT;
+unsigned int U = (ru * r + gu * g + bu * b + kc) >> RGB2YUV_SHIFT;
 
+udst[i] = U;
+vdst[i] = V;
 ydst[2 * i] = Y;
+}
+ydst += lumStride;
+src  += srcStride;
 
-b = src[6 * i + 3];
-g = src[6 * i + 4];
-r = src[6 * i + 5];
+if (y+1 == height)
+break;
 
-Y = ((ry * r + gy * g + by * b) >> RGB2YUV_SHIFT) + 16;
-ydst[2 * i + 1] = Y;
+for (i = 0; i < width; i++) {
+unsigned int b = src[3 * i + 0];
+unsigned int g = src[3 * i + 1];
+unsigned int r = src[3 * i + 2];
+
+unsigned int Y = (ry * r + gy * g + by * b + ky) >> RGB2YUV_SHIFT;
+
+ydst[i] = Y;
 }
 udst += chromStride;
 vdst += chromStride;
diff --git a/libswscale/swscale_unscaled.c b/libswscale/swscale_unscaled.c
index 751bdcb2e4..e10f967755 100644
--- a/libswscale/swscale_unscaled.c
+++ b/libswscale/swscale_unscaled.c
@@ -1994,7 +1994,6 @@ void ff_get_unscaled_swscale(SwsContext *c)
 const enum AVPixelFormat dstFormat = c->dstFormat;
 const int flags = c->flags;
 const int dstH = c->dstH;
-const int dstW = c->dstW;
 int needsDither;
 
 needsDither = isAnyRGB(dstFormat) &&
@@ -2052,12 +2051,12 @@ void ff_get_unscaled_swscale(SwsContext *c)
 /* bgr24toYV12 */
 if (srcFormat == AV_PIX_FMT_BGR24 &&
 (dstFormat == AV_PIX_FMT_YUV420P || dstFormat == AV_PIX_FMT_YUVA420P) 
&&
-!(flags & (SWS_ACCURATE_RND | SWS_BITEXACT)) && !(dstW&1))
+!(flags & (SWS_ACCURATE_RND | SWS_BITEXACT)))
 c->convert_unscaled = bgr24ToYv12Wrapper;
 /* rgb24toYV12 */
 if (srcFormat == AV_PIX_FMT_RGB24 &&
 (dstFormat == AV_PIX_FMT_YUV420P || dstFormat == AV_PIX_FMT_YUVA420P) 
&&
-!(flags & (SWS_ACCURATE_RND | SWS_BITEXACT)) && !(dstW&1))
+!(flags & (

[FFmpeg-devel] [PATCH v1 5/6] swscale: Add unscaled XRGB->YUV420P functions

2023-08-20 Thread John Cox

Add simple C functions for converting XRGB to YUV420P. Same logic as the
RGB24 functions but dropping the A channel.

Signed-off-by: John Cox 
---
 libswscale/rgb2rgb.c  |  20 +++
 libswscale/rgb2rgb.h  |  16 +
 libswscale/rgb2rgb_template.c | 106 ++
 libswscale/swscale_unscaled.c |  89 
 4 files changed, 231 insertions(+)

diff --git a/libswscale/rgb2rgb.c b/libswscale/rgb2rgb.c
index de90e5193f..b976341e70 100644
--- a/libswscale/rgb2rgb.c
+++ b/libswscale/rgb2rgb.c
@@ -88,6 +88,26 @@ void (*ff_rgb24toyv12)(const uint8_t *src, uint8_t *ydst,
int width, int height,
int lumStride, int chromStride, int srcStride,
int32_t *rgb2yuv);
+void (*ff_rgbxtoyv12)(const uint8_t *src, uint8_t *ydst,
+ uint8_t *udst, uint8_t *vdst,
+ int width, int height,
+ int lumStride, int chromStride, int 
srcStride,
+ int32_t *rgb2yuv);
+void (*ff_bgrxtoyv12)(const uint8_t *src, uint8_t *ydst,
+ uint8_t *udst, uint8_t *vdst,
+ int width, int height,
+ int lumStride, int chromStride, int 
srcStride,
+ int32_t *rgb2yuv);
+void (*ff_xrgbtoyv12)(const uint8_t *src, uint8_t *ydst,
+ uint8_t *udst, uint8_t *vdst,
+ int width, int height,
+ int lumStride, int chromStride, int 
srcStride,
+ int32_t *rgb2yuv);
+void (*ff_xbgrtoyv12)(const uint8_t *src, uint8_t *ydst,
+ uint8_t *udst, uint8_t *vdst,
+ int width, int height,
+ int lumStride, int chromStride, int 
srcStride,
+ int32_t *rgb2yuv);
 void (*planar2x)(const uint8_t *src, uint8_t *dst, int width, int height,
  int srcStride, int dstStride);
 void (*interleaveBytes)(const uint8_t *src1, const uint8_t *src2, uint8_t *dst,
diff --git a/libswscale/rgb2rgb.h b/libswscale/rgb2rgb.h
index f7a76a92ba..0015b1568a 100644
--- a/libswscale/rgb2rgb.h
+++ b/libswscale/rgb2rgb.h
@@ -135,6 +135,22 @@ extern void (*ff_rgb24toyv12)(const uint8_t *src, uint8_t 
*ydst, uint8_t *udst,
   int width, int height,
   int lumStride, int chromStride, int srcStride,
   int32_t *rgb2yuv);
+extern void (*ff_rgbxtoyv12)(const uint8_t *src, uint8_t *ydst, uint8_t *udst, 
uint8_t *vdst,
+ int width, int height,
+ int lumStride, int chromStride, int srcStride,
+ int32_t *rgb2yuv);
+extern void (*ff_bgrxtoyv12)(const uint8_t *src, uint8_t *ydst, uint8_t *udst, 
uint8_t *vdst,
+ int width, int height,
+ int lumStride, int chromStride, int srcStride,
+ int32_t *rgb2yuv);
+extern void (*ff_xrgbtoyv12)(const uint8_t *src, uint8_t *ydst, uint8_t *udst, 
uint8_t *vdst,
+ int width, int height,
+ int lumStride, int chromStride, int srcStride,
+ int32_t *rgb2yuv);
+extern void (*ff_xbgrtoyv12)(const uint8_t *src, uint8_t *ydst, uint8_t *udst, 
uint8_t *vdst,
+ int width, int height,
+ int lumStride, int chromStride, int srcStride,
+ int32_t *rgb2yuv);
 extern void (*planar2x)(const uint8_t *src, uint8_t *dst, int width, int 
height,
 int srcStride, int dstStride);
 
diff --git a/libswscale/rgb2rgb_template.c b/libswscale/rgb2rgb_template.c
index 5503e58a29..22326807c5 100644
--- a/libswscale/rgb2rgb_template.c
+++ b/libswscale/rgb2rgb_template.c
@@ -742,6 +742,108 @@ void ff_rgb24toyv12_c(const uint8_t *src, uint8_t *ydst, 
uint8_t *udst,
 rgb24toyv12_x(src, ydst, udst, vdst, width, height, lumStride, 
chromStride, srcStride, rgb2yuv, x_rgb);
 }
 
+static void rgbxtoyv12_x(const uint8_t *src, uint8_t *ydst, uint8_t *udst,
+   uint8_t *vdst, int width, int height, int lumStride,
+   int chromStride, int srcStride, int32_t *rgb2yuv,
+   const uint8_t x[9])
+{
+int32_t ry = rgb2yuv[x[0]], gy = rgb2yuv[x[1]], by = rgb2yuv[x[2]];
+int32_t ru = rgb2yuv[x[3]], gu = rgb2yuv[x[4]], bu = rgb2yuv[x[5]];
+int32_t rv = rgb2yuv[x[6]], gv = rgb2yuv[x[7]], bv = rgb2yuv[x[8]];
+int y;
+const int chromWidth = width >

[FFmpeg-devel] [PATCH v1 6/6] swscale: Add aarch64 functions for RGB24->YUV420P

2023-08-20 Thread John Cox

Neon RGB24->YUV420P and BGR24->YUV420P functions. Works on 16 pixel
blocks and can do any width or height, though for widths less than 32 or
so the C is likely faster.

Signed-off-by: John Cox 
---
 libswscale/aarch64/rgb2rgb.c  |   8 +
 libswscale/aarch64/rgb2rgb_neon.S | 356 ++
 2 files changed, 364 insertions(+)

diff --git a/libswscale/aarch64/rgb2rgb.c b/libswscale/aarch64/rgb2rgb.c
index a9bf6ff9e0..b2d68c1df3 100644
--- a/libswscale/aarch64/rgb2rgb.c
+++ b/libswscale/aarch64/rgb2rgb.c
@@ -30,6 +30,12 @@
 void ff_interleave_bytes_neon(const uint8_t *src1, const uint8_t *src2,
   uint8_t *dest, int width, int height,
   int src1Stride, int src2Stride, int dstStride);
+void ff_bgr24toyv12_neon(const uint8_t *src, uint8_t *ydst, uint8_t *udst,
+ uint8_t *vdst, int width, int height, int lumStride,
+ int chromStride, int srcStride, int32_t *rgb2yuv);
+void ff_rgb24toyv12_neon(const uint8_t *src, uint8_t *ydst, uint8_t *udst,
+ uint8_t *vdst, int width, int height, int lumStride,
+ int chromStride, int srcStride, int32_t *rgb2yuv);
 
 av_cold void rgb2rgb_init_aarch64(void)
 {
@@ -37,5 +43,7 @@ av_cold void rgb2rgb_init_aarch64(void)
 
 if (have_neon(cpu_flags)) {
 interleaveBytes = ff_interleave_bytes_neon;
+ff_rgb24toyv12 = ff_rgb24toyv12_neon;
+ff_bgr24toyv12 = ff_bgr24toyv12_neon;
 }
 }
diff --git a/libswscale/aarch64/rgb2rgb_neon.S 
b/libswscale/aarch64/rgb2rgb_neon.S
index d81110ec57..b15e69a3bd 100644
--- a/libswscale/aarch64/rgb2rgb_neon.S
+++ b/libswscale/aarch64/rgb2rgb_neon.S
@@ -77,3 +77,359 @@ function ff_interleave_bytes_neon, export=1
 0:
 ret
 endfunc
+
+// Expand rgb2 into r0+r1/g0+g1/b0+b1
+.macro XRGB3Y r0, g0, b0, r1, g1, b1, r2, g2, b2
+uxtl\r0\().8h, \r2\().8b
+uxtl\g0\().8h, \g2\().8b
+uxtl\b0\().8h, \b2\().8b
+
+uxtl2   \r1\().8h, \r2\().16b
+uxtl2   \g1\().8h, \g2\().16b
+uxtl2   \b1\().8h, \b2\().16b
+.endm
+
+// Expand rgb2 into r0+r1/g0+g1/b0+b1
+// and pick every other el to put back into rgb2 for chroma
+.macro XRGB3YC r0, g0, b0, r1, g1, b1, r2, g2, b2
+XRGB3Y  \r0, \g0, \b0, \r1, \g1, \b1, \r2, \g2, \b2
+
+bic \r2\().8h, #0xff, LSL #8
+bic \g2\().8h, #0xff, LSL #8
+bic \b2\().8h, #0xff, LSL #8
+.endm
+
+.macro SMLAL3 d0, d1, s0, s1, s2, c0, c1, c2
+smull   \d0\().4s, \s0\().4h, \c0
+smlal   \d0\().4s, \s1\().4h, \c1
+smlal   \d0\().4s, \s2\().4h, \c2
+smull2  \d1\().4s, \s0\().8h, \c0
+smlal2  \d1\().4s, \s1\().8h, \c1
+smlal2  \d1\().4s, \s2\().8h, \c2
+.endm
+
+// d0 may be s0
+// s0, s2 corrupted
+.macro SHRN_Y d0, s0, s1, s2, s3, k128h
+shrn\s0\().4h, \s0\().4s, #12
+shrn2   \s0\().8h, \s1\().4s, #12
+add \s0\().8h, \s0\().8h, \k128h\().8h // +128 (>> 3 = 
16)
+sqrshrun\d0\().8b, \s0\().8h, #3
+shrn\s2\().4h, \s2\().4s, #12
+shrn2   \s2\().8h, \s3\().4s, #12
+add \s2\().8h, \s2\().8h, \k128h\().8h
+sqrshrun2   \d0\().16b, v28.8h, #3
+.endm
+
+.macro SHRN_C d0, s0, s1, k128b
+shrn\s0\().4h, \s0\().4s, #14
+shrn2   \s0\().8h, \s1\().4s, #14
+sqrshrn \s0\().8b, \s0\().8h, #1
+add \d0\().8b, \s0\().8b, \k128b\().8b // +128
+.endm
+
+.macro STB2V s0, n, a
+st1 {\s0\().b}[(\n+0)], [\a], #1
+st1 {\s0\().b}[(\n+1)], [\a], #1
+.endm
+
+.macro STB4V s0, n, a
+STB2V   \s0, (\n+0), \a
+STB2V   \s0, (\n+2), \a
+.endm
+
+
+// void ff_bgr24toyv12_neon(
+//  const uint8_t *src, // x0
+//  uint8_t *ydst,  // x1
+//  uint8_t *udst,  // x2
+//  uint8_t *vdst,  // x3
+//  int width,  // w4
+//  int height, // w5
+//  int lumStride,  // w6
+//  int chromStride,// w7
+//  int srcStr, // [sp, #0]
+//  int32_t *rgb2yuv);  // [sp, #8]
+
+function ff_bgr24toyv12_neon, export=1
+ldr x15, [sp, #8]
+ld3 {v3.s, v4.s, v5.s}[0], [x15], #12
+ld3 {v3.s, v4.s, v5.s}[1], [x15], #12
+ld3 {v3.s, v4.s, v5.s}[2], [x15]
+mov v6.16b, v3.16b
+mov v3.16b, v5.16b
+mov v5.16b, v6.16b
+b

Re: [FFmpeg-devel] [PATCH v1 3/6] swscale: Add explicit rgb24->yv12 conversion

2023-08-20 Thread John Cox

On Sun, 20 Aug 2023 19:16:14 +0200, you wrote:

>On Sun, Aug 20, 2023 at 03:10:19PM +0000, John Cox wrote:
>> Add a rgb24->yuv420p conversion. Uses the same code as the existing
>> bgr24->yuv converter but permutes the conversion array to swap R & B
>> coefficients.
>> 
>> Signed-off-by: John Cox 
>> ---
>>  libswscale/rgb2rgb.c  |  5 +
>>  libswscale/rgb2rgb.h  |  7 +++
>>  libswscale/rgb2rgb_template.c | 38 ++-
>>  libswscale/swscale_unscaled.c | 24 +-
>>  4 files changed, 68 insertions(+), 6 deletions(-)
>> 
>> diff --git a/libswscale/rgb2rgb.c b/libswscale/rgb2rgb.c
>> index 8707917800..de90e5193f 100644
>> --- a/libswscale/rgb2rgb.c
>> +++ b/libswscale/rgb2rgb.c
>> @@ -83,6 +83,11 @@ void (*ff_bgr24toyv12)(const uint8_t *src, uint8_t *ydst,
>> int width, int height,
>> int lumStride, int chromStride, int srcStride,
>> int32_t *rgb2yuv);
>> +void (*ff_rgb24toyv12)(const uint8_t *src, uint8_t *ydst,
>> +   uint8_t *udst, uint8_t *vdst,
>> +   int width, int height,
>> +   int lumStride, int chromStride, int srcStride,
>> +   int32_t *rgb2yuv);
>>  void (*planar2x)(const uint8_t *src, uint8_t *dst, int width, int height,
>>   int srcStride, int dstStride);
>>  void (*interleaveBytes)(const uint8_t *src1, const uint8_t *src2, uint8_t 
>> *dst,
>> diff --git a/libswscale/rgb2rgb.h b/libswscale/rgb2rgb.h
>> index 305b830920..f7a76a92ba 100644
>> --- a/libswscale/rgb2rgb.h
>> +++ b/libswscale/rgb2rgb.h
>> @@ -79,6 +79,9 @@ voidrgb12to15(const uint8_t *src, uint8_t *dst, int 
>> src_size);
>>  void ff_bgr24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst,
>>uint8_t *vdst, int width, int height, int lumStride,
>>int chromStride, int srcStride, int32_t *rgb2yuv);
>> +void ff_rgb24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst,
>> +  uint8_t *vdst, int width, int height, int lumStride,
>> +  int chromStride, int srcStride, int32_t *rgb2yuv);
>>  
>>  /**
>>   * Height should be a multiple of 2 and width should be a multiple of 16.
>> @@ -128,6 +131,10 @@ extern void (*ff_bgr24toyv12)(const uint8_t *src, 
>> uint8_t *ydst, uint8_t *udst,
>>int width, int height,
>>int lumStride, int chromStride, int srcStride,
>>int32_t *rgb2yuv);
>> +extern void (*ff_rgb24toyv12)(const uint8_t *src, uint8_t *ydst, uint8_t 
>> *udst, uint8_t *vdst,
>> +  int width, int height,
>> +  int lumStride, int chromStride, int srcStride,
>> +  int32_t *rgb2yuv);
>>  extern void (*planar2x)(const uint8_t *src, uint8_t *dst, int width, int 
>> height,
>>  int srcStride, int dstStride);
>>  
>> diff --git a/libswscale/rgb2rgb_template.c b/libswscale/rgb2rgb_template.c
>> index 8ef4a2cf5d..e57bfa6545 100644
>> --- a/libswscale/rgb2rgb_template.c
>> +++ b/libswscale/rgb2rgb_template.c
>
>
>> @@ -646,13 +646,14 @@ static inline void uyvytoyv12_c(const uint8_t *src, 
>> uint8_t *ydst,
>>   * others are ignored in the C version.
>>   * FIXME: Write HQ version.
>>   */
>> -void ff_bgr24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst,
>> +static void rgb24toyv12_x(const uint8_t *src, uint8_t *ydst, uint8_t *udst,
>
>this probably should be inline

Could do, and I will if you deem it important, but the only bit that
inline is going to help is the matrix coefficient loading and that
happens once outside the main loops.

>also i see now "FIXME: Write HQ version." above here. Do you really want to
>add a low quality rgb24toyv12 ?
>(it is vissible on the diagonal border (cyan / red )) in
> ./ffmpeg -f lavfi -i testsrc=size=5632x3168 -pix_fmt yuv420p -vframes 1 
> -qscale 1 -strict -1 new.jpg
>
> also on smaller sizes but for some reason its clearer on the big one zoomed 
> in 400% with gimp
>(the gimp test was done with the whole patchset not after this patch)

On the whole - yes - in the encode path on the Pi that I'm writing this
for speed is more important than quality - the existing path is too slow
to be usable. And honestly - using your example above comparing (Windows
photo viewer zoomed in s.t. pixels are clearly

Re: [FFmpeg-devel] [PATCH v1 3/6] swscale: Add explicit rgb24->yv12 conversion

2023-08-20 Thread John Cox

On Sun, 20 Aug 2023 19:45:11 +0200, you wrote:

>On Sun, Aug 20, 2023 at 07:16:14PM +0200, Michael Niedermayer wrote:
>> On Sun, Aug 20, 2023 at 03:10:19PM +0000, John Cox wrote:
>> > Add a rgb24->yuv420p conversion. Uses the same code as the existing
>> > bgr24->yuv converter but permutes the conversion array to swap R & B
>> > coefficients.
>> > 
>> > Signed-off-by: John Cox 
>> > ---
>> >  libswscale/rgb2rgb.c  |  5 +
>> >  libswscale/rgb2rgb.h  |  7 +++
>> >  libswscale/rgb2rgb_template.c | 38 ++-
>> >  libswscale/swscale_unscaled.c | 24 +-
>> >  4 files changed, 68 insertions(+), 6 deletions(-)
>> > 
>> > diff --git a/libswscale/rgb2rgb.c b/libswscale/rgb2rgb.c
>> > index 8707917800..de90e5193f 100644
>> > --- a/libswscale/rgb2rgb.c
>> > +++ b/libswscale/rgb2rgb.c
>> > @@ -83,6 +83,11 @@ void (*ff_bgr24toyv12)(const uint8_t *src, uint8_t 
>> > *ydst,
>> > int width, int height,
>> > int lumStride, int chromStride, int srcStride,
>> > int32_t *rgb2yuv);
>> > +void (*ff_rgb24toyv12)(const uint8_t *src, uint8_t *ydst,
>> > +   uint8_t *udst, uint8_t *vdst,
>> > +   int width, int height,
>> > +   int lumStride, int chromStride, int srcStride,
>> > +   int32_t *rgb2yuv);
>> >  void (*planar2x)(const uint8_t *src, uint8_t *dst, int width, int height,
>> >   int srcStride, int dstStride);
>> >  void (*interleaveBytes)(const uint8_t *src1, const uint8_t *src2, uint8_t 
>> > *dst,
>> > diff --git a/libswscale/rgb2rgb.h b/libswscale/rgb2rgb.h
>> > index 305b830920..f7a76a92ba 100644
>> > --- a/libswscale/rgb2rgb.h
>> > +++ b/libswscale/rgb2rgb.h
>> > @@ -79,6 +79,9 @@ voidrgb12to15(const uint8_t *src, uint8_t *dst, int 
>> > src_size);
>> >  void ff_bgr24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst,
>> >uint8_t *vdst, int width, int height, int lumStride,
>> >int chromStride, int srcStride, int32_t *rgb2yuv);
>> > +void ff_rgb24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst,
>> > +  uint8_t *vdst, int width, int height, int lumStride,
>> > +  int chromStride, int srcStride, int32_t *rgb2yuv);
>> >  
>> >  /**
>> >   * Height should be a multiple of 2 and width should be a multiple of 16.
>> > @@ -128,6 +131,10 @@ extern void (*ff_bgr24toyv12)(const uint8_t *src, 
>> > uint8_t *ydst, uint8_t *udst,
>> >int width, int height,
>> >int lumStride, int chromStride, int 
>> > srcStride,
>> >int32_t *rgb2yuv);
>> > +extern void (*ff_rgb24toyv12)(const uint8_t *src, uint8_t *ydst, uint8_t 
>> > *udst, uint8_t *vdst,
>> > +  int width, int height,
>> > +  int lumStride, int chromStride, int 
>> > srcStride,
>> > +  int32_t *rgb2yuv);
>> >  extern void (*planar2x)(const uint8_t *src, uint8_t *dst, int width, int 
>> > height,
>> >  int srcStride, int dstStride);
>> >  
>> > diff --git a/libswscale/rgb2rgb_template.c b/libswscale/rgb2rgb_template.c
>> > index 8ef4a2cf5d..e57bfa6545 100644
>> > --- a/libswscale/rgb2rgb_template.c
>> > +++ b/libswscale/rgb2rgb_template.c
>> 
>> 
>> > @@ -646,13 +646,14 @@ static inline void uyvytoyv12_c(const uint8_t *src, 
>> > uint8_t *ydst,
>> >   * others are ignored in the C version.
>> >   * FIXME: Write HQ version.
>> >   */
>> > -void ff_bgr24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst,
>> > +static void rgb24toyv12_x(const uint8_t *src, uint8_t *ydst, uint8_t 
>> > *udst,
>> 
>> this probably should be inline
>> 
>> also i see now "FIXME: Write HQ version." above here. Do you really want to
>> add a low quality rgb24toyv12 ?
>> (it is vissible on the diagonal border (cyan / red )) in
>>  ./ffmpeg -f lavfi -i testsrc=size=5632x3168 -pix_fmt yuv420p -vframes 1 
>> -qscale 1 -strict -1 new.jpg
>> 
>>  also on smaller sizes but for some reason its clearer on the big one z

Re: [FFmpeg-devel] [PATCH v1 3/6] swscale: Add explicit rgb24->yv12 conversion

2023-08-22 Thread John Cox

On Mon, 21 Aug 2023 21:15:37 +0200, you wrote:

>On Sun, Aug 20, 2023 at 07:28:40PM +0100, John Cox wrote:
>> On Sun, 20 Aug 2023 19:45:11 +0200, you wrote:
>> 
>> >On Sun, Aug 20, 2023 at 07:16:14PM +0200, Michael Niedermayer wrote:
>> >> On Sun, Aug 20, 2023 at 03:10:19PM +, John Cox wrote:
>> >> > Add a rgb24->yuv420p conversion. Uses the same code as the existing
>> >> > bgr24->yuv converter but permutes the conversion array to swap R & B
>> >> > coefficients.
>> >> > 
>> >> > Signed-off-by: John Cox 
>> >> > ---
>> >> >  libswscale/rgb2rgb.c  |  5 +
>> >> >  libswscale/rgb2rgb.h  |  7 +++
>> >> >  libswscale/rgb2rgb_template.c | 38 ++-
>> >> >  libswscale/swscale_unscaled.c | 24 +-
>> >> >  4 files changed, 68 insertions(+), 6 deletions(-)
>> >> > 
>> >> > diff --git a/libswscale/rgb2rgb.c b/libswscale/rgb2rgb.c
>> >> > index 8707917800..de90e5193f 100644
>> >> > --- a/libswscale/rgb2rgb.c
>> >> > +++ b/libswscale/rgb2rgb.c
>> >> > @@ -83,6 +83,11 @@ void (*ff_bgr24toyv12)(const uint8_t *src, uint8_t 
>> >> > *ydst,
>> >> > int width, int height,
>> >> > int lumStride, int chromStride, int srcStride,
>> >> > int32_t *rgb2yuv);
>> >> > +void (*ff_rgb24toyv12)(const uint8_t *src, uint8_t *ydst,
>> >> > +   uint8_t *udst, uint8_t *vdst,
>> >> > +   int width, int height,
>> >> > +   int lumStride, int chromStride, int srcStride,
>> >> > +   int32_t *rgb2yuv);
>> >> >  void (*planar2x)(const uint8_t *src, uint8_t *dst, int width, int 
>> >> > height,
>> >> >   int srcStride, int dstStride);
>> >> >  void (*interleaveBytes)(const uint8_t *src1, const uint8_t *src2, 
>> >> > uint8_t *dst,
>> >> > diff --git a/libswscale/rgb2rgb.h b/libswscale/rgb2rgb.h
>> >> > index 305b830920..f7a76a92ba 100644
>> >> > --- a/libswscale/rgb2rgb.h
>> >> > +++ b/libswscale/rgb2rgb.h
>> >> > @@ -79,6 +79,9 @@ voidrgb12to15(const uint8_t *src, uint8_t *dst, 
>> >> > int src_size);
>> >> >  void ff_bgr24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst,
>> >> >uint8_t *vdst, int width, int height, int 
>> >> > lumStride,
>> >> >int chromStride, int srcStride, int32_t 
>> >> > *rgb2yuv);
>> >> > +void ff_rgb24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst,
>> >> > +  uint8_t *vdst, int width, int height, int 
>> >> > lumStride,
>> >> > +  int chromStride, int srcStride, int32_t 
>> >> > *rgb2yuv);
>> >> >  
>> >> >  /**
>> >> >   * Height should be a multiple of 2 and width should be a multiple of 
>> >> > 16.
>> >> > @@ -128,6 +131,10 @@ extern void (*ff_bgr24toyv12)(const uint8_t *src, 
>> >> > uint8_t *ydst, uint8_t *udst,
>> >> >int width, int height,
>> >> >int lumStride, int chromStride, int 
>> >> > srcStride,
>> >> >int32_t *rgb2yuv);
>> >> > +extern void (*ff_rgb24toyv12)(const uint8_t *src, uint8_t *ydst, 
>> >> > uint8_t *udst, uint8_t *vdst,
>> >> > +  int width, int height,
>> >> > +  int lumStride, int chromStride, int 
>> >> > srcStride,
>> >> > +  int32_t *rgb2yuv);
>> >> >  extern void (*planar2x)(const uint8_t *src, uint8_t *dst, int width, 
>> >> > int height,
>> >> >  int srcStride, int dstStride);
>> >> >  
>> >> > diff --git a/libswscale/rgb2rgb_template.c 
>> >> > b/libswscale/rgb2rgb_template.c
>> >> > index 8ef4a2cf5d..e57bfa6545 100644
>> >> > --- a/libswscale/rgb2rgb_template.c
>> >> > +++ b/libswscale/rgb2rgb_template.c
>> >> 
>&g

[FFmpeg-devel] Does rtspenc actually support AVFMT_GLOBALHEADER?

2024-08-19 Thread John Cox

Hi

Does rtspenc actually support AVFMT_GLOBALHEADER? It is specified in the
FFOutputFormat flags but I can't see anywhere in the code where
extradata is referenced like it is in other output formats which support
that flag.

I ask because I have an encoder that supports the flag and when set
removes SPS/PPS from the stream and puts them in extradata instead which
I believe is the correct behavior - if it isn't then that is my problem
and I'd appreciate clarification of what is meant to occur. The
transmitted RTSP stream then doesn't contain SPS/PPS.

Removal of AVFMT_GLOBALHEADER from the flags in rtspenc.c fixes my
problem and I'll very happily submit a patch to that effect, but first
I'd like to know if that is in fact the root of my problem - my
understanding of the RTSP code is very limited and I'd appreciate advice
from someone who knows something about it.

Many thanks

John Cox
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] Does rtspenc actually support AVFMT_GLOBALHEADER?

2024-08-20 Thread John Cox

On Mon, 19 Aug 2024 at 19:32, Martin Storsjö  wrote:
>
> On Mon, 19 Aug 2024, John Cox wrote:
>
> > Does rtspenc actually support AVFMT_GLOBALHEADER? It is specified in the
> > FFOutputFormat flags but I can't see anywhere in the code where
> > extradata is referenced like it is in other output formats which support
> > that flag.
> >
> > I ask because I have an encoder that supports the flag and when set
> > removes SPS/PPS from the stream and puts them in extradata instead which
> > I believe is the correct behavior - if it isn't then that is my problem
> > and I'd appreciate clarification of what is meant to occur. The
> > transmitted RTSP stream then doesn't contain SPS/PPS.
>
> That's correct, the SPS/PPS gets transmitted in the SDP description, not
> in-band.

Many thanks for the info. I thought something like that should occur
but I couldn't find it.
Now I know where I should be looking.

John Cox
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH] libavdevice: Add KMS/DRM output device

2021-01-19 Thread John Cox

On Mon, 18 Jan 2021 23:37:09 +, you wrote:
>On 16/01/2021 22:12, Nicolas Caramelli wrote:
>> This patch adds KMS/DRM output device for rendering a video stream
>> using KMS/DRM dumb buffer.
>> The proposed implementation is very basic, only bgr0 pixel format is
>> currently supported (the most common format with KMS/DRM).
>> To enable this output device you need to configure FFmpeg with 
>> --enable-libdrm.
>> Example: ffmpeg -re -i INPUT -pix_fmt bgr0 -f kmsdumb /dev/dri/card0
>
>If you want to render things to a normal display device why not use a normal 
>video player?  Or even ffplay?
>
>IMO something like this would be of more value as a simple video player 
>example with the documentation rather than including it as weirdly constrained 
>library code which will see very little use.
>
>(Note that I would argue against adding more general display output devices 
>which are already present, like fb and xv, because they are of essentially no 
>value to libavdevice users.  Removing legacy code is harder, though.)

I take your point but I personally have found it very useful to have
simple display devices on the output of ffmpeg for testing purposes.
Though I guess that if I want that then the device should be bundled
with the application rather than in a library.

John Cox
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (especially ARM)

2016-01-19 Thread John Cox

Hi

I've just done a fair bit of work on hevc_cabac decode for the Rasberry
Pi2 and I think that the patch is generally applicable.  Patch is
attached but you may prefer to take it from git:

https://github.com/jc-kynesim/rpi-ffmpeg.git
branch: test/ff_hevc_cabac_3
commit: 423e160e639d301feb2b4ba220199d112def0164

On the Pi2 playing a 10Mbit 1080p H.265 clip (A bit of the Hobbit) it
reduces the time in ff_hevc_hls_residual_coding (until transform) from
~26Gcycles to ~18Gcycles and it almost halves the time spent in the
"core" bit of the function (from decoding the greater1 bits to the end
of decode).  This was measured using the CPU cycle counter.  Tests done
at Rasberry Pi suggests that on their ffmpeg branch it reduces overall
CPU loading by ~10% whislt playing H.265.  I haven't profiled it on any
other platform - but I would expect useful improvements on most streams
on most platforms.

I have not yet run fate over it as I haven't yet finished downloading
the samples (the internet connection here isn't wildly fast), but I have
run it against the H265.1 conformance streams on both x86 and ARM and it
causes no regressions.

Known unknowns / possible issues:
1) I haven't tested it on anything with 64-bit ints (I don't have an
appropriate m/c) - whilst I've coded in a manner that should hopefully
be OK there I can see that there might be issues.

2) Only tested on gcc 4.8 and later (5.1 & 5.3).  I've used an anonymous
union to avoid changing other cabac code - I could believe this was a
no-no and I'll have to change that.

3) Uses clz which doesn't seem to exist in the ffmpeg int libs (though
ctz does)

I'll happily accept suggestions as to what is considered better practice
for these points.

Regards

John Cox


0001-H.265-residual-decode-performance-improvements.patch
Description: Binary data
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (especially ARM)

2016-01-19 Thread John Cox

Hi

>On Tue, Jan 19, 2016 at 7:46 AM, John Cox  wrote:
>
>> Hi
>>
>> I've just done a fair bit of work on hevc_cabac decode for the Rasberry
>> Pi2 and I think that the patch is generally applicable.  Patch is
>> attached but you may prefer to take it from git:
>
>
>Cool! Two non-technical comments first, I'll try to make time to review
>in-depth/technically soon:
>
>1:
>
>> +#define UNCHECKED_BITSTREAM_READER 1
>
>I don't think that's right, and is a security issue.

I added that line as (nearly) every other decoder in liavcodec has it -
in particular h264_cabac.c has it.

Going forward: Checking bitstream position on every load is terribly
wasteful - if at all possible a better idea is to allocate more space
than is required in the input bitstream buffer so some overrun is
permssible without seg fault and only check position at the end of every
block or other medium sized unit. (You can nearly always work out what
the worst case overread can be.)

>2: your indentation of function declarations is weird. E.g.:
>
>+static inline uint32_t get_greaterx_bits(HEVCContext * const s, const
>unsigned int n_end, int * const levels,
>+int * const pprev_subset_coded, int * const psum,
>+const unsigned int idx0_gt1, const unsigned int idx_gt2)
>
>We tend to indent the second line so it aligns with the opening bracket of
>the first line.

Fair enough

>Alike, your indentation of const variable declarations:
>
>+uint8_t * const state0 = s->HEVClc->cabac_state + idx0_gt1;
>
>doesn't need a space between * and const.

If that is required style then I'll abide by it, but I think that
detracts noticably from readability.

>Like I said, all non-technical, I'll do technical bits soon if I find time.
>If other people like it and I haven't responded yet, just commit it and we
>can address them post-push.

Thanks

JC

>Ronald
>___
>ffmpeg-devel mailing list
>ffmpeg-devel@ffmpeg.org
>http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (especially ARM)

2016-01-19 Thread John Cox

On Tue, 19 Jan 2016 15:59:39 + (UTC), you wrote:

>John Cox  kynesim.co.uk> writes:
>
>> >> +#define UNCHECKED_BITSTREAM_READER 1
>> >
>> >I don't think that's right, and is a security issue.
>> 
>> I added that line as (nearly) every other decoder in 
>> liavcodec has it -
>
>Sure?

OK - not all:

h263dec.c
h264.c
h264_cabac.c
h264_cavlc.c
huffyuvdec.c
ituh263dec.c
mpegl2dec.c
mpeg12.c
mpeg4videodec.c
mpeg4video_parser.c

But that probably covers 90% of the video streams decoded with ffmpeg

>> in particular h264_cabac.c has it.
>
>Extensive testing was done before it was added.

Testing that it doesn't seg-fault no matter what the input or some other
sort of testing?

>Could you confirm how much of the speedup comes 
>only from this change?

Not an awful lot - a few % of the total improvement, but I was looking
for everything I can get.  I'll happily take it out of this patch if it
is controversial.

>While we definitely all welcome a noticeable speedup 
>of hevc decoding (and while my opinion on your patch 
>has limited relevance) I believe that the patch 
>absolutely has to be split: First step would be to 
>have a split between changes in the general code and 
>changes to arm assembly, I believe the first patch 
>then may be split further.

Happy to split out the arm asm.  Splitting the rest of it will be harder
if you want it to continue working at all intermediate points.

>I am a little surprised that you wrote some asm 
>functions that are slower than what the compiler 
>produces: Did you analyze this?

Yeah - they aren't much, if at all, slower but unless they are actively
faster it seems silly to use difficult to maintain asm where the C will
do.  In the end it came down to the asm constraining the order in which
stuff happens in the surrounding code and that wasn't always good.

Regards

JC

>Carl Eugen
>
>___
>ffmpeg-devel mailing list
>ffmpeg-devel@ffmpeg.org
>http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (especially ARM)

2016-01-19 Thread John Cox

>John Cox  kynesim.co.uk> writes:
>
>> On Tue, 19 Jan 2016 15:59:39 + (UTC), you wrote:
>> 
>> >John Cox  kynesim.co.uk> writes:
>> >
>> >> >> +#define UNCHECKED_BITSTREAM_READER 1
>> >> >
>> >> >I don't think that's right, and is a security issue.
>> >> 
>> >> I added that line as (nearly) every other decoder in 
>> >> liavcodec has it -
>> >
>> >Sure?
>> 
>> OK - not all:
>> 
>> h263dec.c
>> h264.c
>> h264_cabac.c
>> h264_cavlc.c
>> huffyuvdec.c
>> ituh263dec.c
>> mpegl2dec.c
>> mpeg12.c
>> mpeg4videodec.c
>> mpeg4video_parser.c
>> 
>> But that probably covers 90% of the video streams 
>> decoded with ffmpeg
>
>The three decoders mpegvideo, h263/asp and h264 are 
>not "(nearly) every other decoder", sorry!

Sorry - I (obviously) misremembered the number of hits I got when I last
did that search.

>> >> in particular h264_cabac.c has it.
>> >
>> >Extensive testing was done before it was added.
>> 
>> Testing that it doesn't seg-fault no matter what the 
>> input or some other sort of testing?
>
>Yes, tests that show that fuzzed input does not crash 
>the decoder are needed.
>
>But afaict, the change is unrelated to the rest of your 
>patch and should be discussed separately (imo).

Yup - perfectly happy to put that can of worms to one side.

>> >Could you confirm how much of the speedup comes 
>> >only from this change?
>> 
>> Not an awful lot - a few % of the total improvement, but 
>> I was looking for everything I can get.  I'll happily 
>> take it out of this patch if it is controversial.
>
>I wouldn't say controversial (I am all for it, sorry if 
>this wasn't clear) but I think it can be discussed after 
>your speedup was committed.

Yup - at this point it is simply a distraction

>> >While we definitely all welcome a noticeable speedup 
>> >of hevc decoding (and while my opinion on your patch 
>> >has limited relevance) I believe that the patch 
>> >absolutely has to be split: First step would be to 
>> >have a split between changes in the general code and 
>> >changes to arm assembly, I believe the first patch 
>> >then may be split further.
>> 
>> Happy to split out the arm asm.
>
>Please do, my suggestion would be to start with the 
>changes to the C code. But it may be wise to wait for a 
>real review first.

I've done enough review processes to know that waiting till the comments
die down before doing anything is the way to go :-)

JC

>Carl Eugen
>
>___
>ffmpeg-devel mailing list
>ffmpeg-devel@ffmpeg.org
>http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (especially ARM)

2016-01-19 Thread John Cox

>On 1/19/2016 9:46 AM, John Cox wrote:
>> +// Helper fns
>> +#ifndef hevc_mem_bits32
>> +static av_always_inline uint32_t hevc_mem_bits32(const void * buf, const 
>> unsigned int offset)
>> +{
>> +return AV_RB32((const uint8_t *)buf + (offset >> 3)) << (offset & 7);
>> +}
>> +#endif
>> +
>> +#if AV_GCC_VERSION_AT_LEAST(3,4) && !defined(hevc_clz32)
>> +#define hevc_clz32 hevc_clz32_builtin
>> +static av_always_inline unsigned int hevc_clz32_builtin(const uint32_t x)
>> +{
>> +// __builtin_clz says it works on ints - so adjust if int is >32 bits 
>> long
>> +return __builtin_clz(x) - (sizeof(int) * 8 - 32);
>
>Why aren't you simply using ff_clz?

Because it doesn't exist? or at least I can't find it.

>> +}
>> +#endif
>> +
>> +// It is unlikely that we will ever need this but include for completeness
>
>There are at least two compilers we support that don't define __GNUC__, so
>it would be used.
>And in any case, isn't all this duplicating ff_clz, which is available in
>libavutil/inthmath.h?

Are you sure of that?  I can find ff_ctz but no ff_clz...
I would happily be wrong.

[snip]

JC
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (especially ARM)

2016-01-19 Thread John Cox

On Tue, 19 Jan 2016 14:09:22 -0300, you wrote:

>On 1/19/2016 2:05 PM, John Cox wrote:
>>> On 1/19/2016 9:46 AM, John Cox wrote:
>>>> +// Helper fns
>>>> +#ifndef hevc_mem_bits32
>>>> +static av_always_inline uint32_t hevc_mem_bits32(const void * buf, const 
>>>> unsigned int offset)
>>>> +{
>>>> +return AV_RB32((const uint8_t *)buf + (offset >> 3)) << (offset & 7);
>>>> +}
>>>> +#endif
>>>> +
>>>> +#if AV_GCC_VERSION_AT_LEAST(3,4) && !defined(hevc_clz32)
>>>> +#define hevc_clz32 hevc_clz32_builtin
>>>> +static av_always_inline unsigned int hevc_clz32_builtin(const uint32_t x)
>>>> +{
>>>> +// __builtin_clz says it works on ints - so adjust if int is >32 bits 
>>>> long
>>>> +return __builtin_clz(x) - (sizeof(int) * 8 - 32);
>>>
>>> Why aren't you simply using ff_clz?
>> 
>> Because it doesn't exist? or at least I can't find it.
>> 
>>>> +}
>>>> +#endif
>>>> +
>>>> +// It is unlikely that we will ever need this but include for completeness
>>>
>>> There are at least two compilers we support that don't define __GNUC__, so
>>> it would be used.
>>> And in any case, isn't all this duplicating ff_clz, which is available in
>>> libavutil/inthmath.h?
>> 
>> Are you sure of that?  I can find ff_ctz but no ff_clz...
>> I would happily be wrong.
>
>I assume you're writing this patch for the ffmpeg 2.8 branch or older, which 
>you shouldn't.
>Always use the master branch. You'll find ff_clz there.

Yes/no - the code I wrote had to work against 2.8 as that is what
Rasperry Pi are using at the moment.  This patch is meant to be against
master so I can/will happily remove that code. (And I had the wrong
version checked out when commenting previously)

By the way - can you tell me what the behaviour of ff_clz is when ints
are 64 bits long or is that never the case?  Does it count up to 63 (I
am aware that the behaviour applied against 0 may be undefined) or does
it just work on the low 32 bits?  (I assume the former)

Thanks

JC


___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (especially ARM)

2016-01-19 Thread John Cox

>On 1/19/2016 2:24 PM, John Cox wrote:
>> On Tue, 19 Jan 2016 14:09:22 -0300, you wrote:
>> 
>>> On 1/19/2016 2:05 PM, John Cox wrote:
>>>>> On 1/19/2016 9:46 AM, John Cox wrote:
>>>>>> +// Helper fns
>>>>>> +#ifndef hevc_mem_bits32
>>>>>> +static av_always_inline uint32_t hevc_mem_bits32(const void * buf, 
>>>>>> const unsigned int offset)
>>>>>> +{
>>>>>> +return AV_RB32((const uint8_t *)buf + (offset >> 3)) << (offset & 
>>>>>> 7);
>>>>>> +}
>>>>>> +#endif
>>>>>> +
>>>>>> +#if AV_GCC_VERSION_AT_LEAST(3,4) && !defined(hevc_clz32)
>>>>>> +#define hevc_clz32 hevc_clz32_builtin
>>>>>> +static av_always_inline unsigned int hevc_clz32_builtin(const uint32_t 
>>>>>> x)
>>>>>> +{
>>>>>> +// __builtin_clz says it works on ints - so adjust if int is >32 
>>>>>> bits long
>>>>>> +return __builtin_clz(x) - (sizeof(int) * 8 - 32);
>>>>>
>>>>> Why aren't you simply using ff_clz?
>>>>
>>>> Because it doesn't exist? or at least I can't find it.
>>>>
>>>>>> +}
>>>>>> +#endif
>>>>>> +
>>>>>> +// It is unlikely that we will ever need this but include for 
>>>>>> completeness
>>>>>
>>>>> There are at least two compilers we support that don't define __GNUC__, so
>>>>> it would be used.
>>>>> And in any case, isn't all this duplicating ff_clz, which is available in
>>>>> libavutil/inthmath.h?
>>>>
>>>> Are you sure of that?  I can find ff_ctz but no ff_clz...
>>>> I would happily be wrong.
>>>
>>> I assume you're writing this patch for the ffmpeg 2.8 branch or older, 
>>> which you shouldn't.
>>> Always use the master branch. You'll find ff_clz there.
>> 
>> Yes/no - the code I wrote had to work against 2.8 as that is what
>> Rasperry Pi are using at the moment.  This patch is meant to be against
>> master so I can/will happily remove that code. (And I had the wrong
>> version checked out when commenting previously)
>> 
>> By the way - can you tell me what the behaviour of ff_clz is when ints
>> are 64 bits long or is that never the case?  Does it count up to 63 (I
>> am aware that the behaviour applied against 0 may be undefined) or does
>> it just work on the low 32 bits?  (I assume the former)
>
>The generic version checks sizeof(unsigned), so the former.
>The GNU specific version using the builtin is meant to work with an unsigned
>int and not a fixed width data type, so it's probably safe to assume it will.

In that case then it would appear that the definition of ff_log2 is
wrong as that seems to assume a max 31:

#if HAVE_FAST_CLZ
#if AV_GCC_VERSION_AT_LEAST(3,4)
#ifndef ff_log2
#   define ff_log2(x) (31 - __builtin_clz((x)|1))
#   ifndef ff_log2_16bit
#  define ff_log2_16bit av_log2
#   endif
#endif /* ff_log2 */
#endif /* AV_GCC_VERSION_AT_LEAST(3,4) */
#endif

Regards

JC
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (especially ARM)

2016-01-20 Thread John Cox

On Wed, 20 Jan 2016 13:26:05 +0100, you wrote:

>Hi,
>
>2016-01-19 13:46 GMT+01:00 John Cox :
>> I've just done a fair bit of work on hevc_cabac decode for the Rasberry
>> Pi2 and I think that the patch is generally applicable.  Patch is
>> attached but you may prefer to take it from git:
>
>This work is certainly impressive, and most people would have come
>only with some of the "tricks" you used.
>Although it already represents quite a bit of work, I echo others'
>suggestions to have more incremental changes.
>
>> I have not yet run fate over it as I haven't yet finished downloading
>> the samples (the internet connection here isn't wildly fast), but I have
>> run it against the H265.1 conformance streams on both x86 and ARM and it
>> causes no regressions.
>
>Your patch fails on the later fate tests linked to range extensions
>(RExt sequences) on Win64. I didn't investigate why. Random thoughts:
>transform_skip, cross-channel residual, some bypass-coded elements (eg
>SAO).

Yeah - that does fail (and I'm not sure why either at the moment) - I
only tested against the published H.265.1 conformance suite and that
doesn't contain the RExt tests.

Do you believe that master ffmpeg produces the right answer for these
tests?  I didn't spot any RExt logic in the scale code when I rewrote it
(it does affect  how numbers are processed there) and it warns that it
isn't supported when ffmpeg runs.  Having said that I would still have
expected my code to produce the same result as the old code so I'll look
into it.

>> 3) Uses clz which doesn't seem to exist in the ffmpeg int libs (though
>> ctz does)
>
>That could be a patch in and by itself.

Apparently ff_clz is now on master - but wasn't in 2.8 (which is what
RPi need)

>So, referring to your changes, it would be nice to have the following
>changes split in their own patches:
>1) significant coeff flag decoding, which probably is the largest gain
>(and therefore would be even nicer if further sliced):
>  a) for instance, you avoid an indirection by flattening/merging
>context tables;
>  b) other parts, which I fear may not translate that well for other
>platforms (at least without matching x86 code), or sequences
>2) you use native sized integers in some places (not sure if that can
>cause issues);
>3) bypass-coded stuff is a fairly large change (both in terms of code,
>review and impacting the cabac struct also used by h264); it would be
>nice knowing how much you gain here
>4) the replacing of !!something by something when the flag is already 0/1
>5) coefficient saturation

I don't have formal numbers for everything but from the profiling I did
in development:

The by22 code gained me an overall factor of two in the abs level decode
- the gains do depend a lot on the quantity of residual - you gain a lot
more on I-frames than you do otherwise as they tend to have much longer
residuals.  The higher the bitrate the more useful this code is.  But as
you note it didn't use vast amounts of time relative to everything else
anyway.

The reworking / simplification of the loop(s) around the abs level
decode and the scaling gave me the biggest single improvement.

After that the reworking of get_sig_ceoff_flag_idxs was a useful gain

Special caseing the single coeff path gave a similar gain

After that the scale rework - now probably 75% faster than it was
previously but it wasn't taking a huge amount of time.

And after that all the other bits - my experience with optimising this
sort of code (I did a lot of work on a TI H.264 implementation in the
past) is that no single change is going to do everything, you just have
to polish everything until it goes fast enough.

>3) is indeed the largest chunk. I don't know what your profiling
>indicated, but the original code didn't seem that high-profile. But I
>haven't split it to see what it actually provided, but overall numbers
>look good:
>
>I quickly hacked (quickly being the keyword as it also means poor and
>potentially resulting in faulty conclusion) something that is close to
>2) + 4) for reference.
>Benching REF+1)a) vs REF+1), it did seem slower on Win64/Haswell for
>significant flag decoding by a few cycles (around 1% of the codeblock)
>Benching REF+1)a) vs your patch, I see around 3% improvement with
>something that is fairly more optimized overall than ffmpeg's master,
>ie ff_hevc_hls_residual_coding is a lot more prevalent, which is
>probably also the case in your rpi2 benchmarks.

Sorry - I don't quite understand what you've said here.

>Note: I don't think I'll review next iterations of the patch(set) with
>any shape of diligence, but some of the above parts (1.a, 4 and 5) are
>ok if not the cause of the fate issues.
>
>Best regards,

Thanks

JC
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (especially ARM)

2016-01-20 Thread John Cox

On Wed, 20 Jan 2016 13:26:05 +0100, you wrote:

>Hi,
>
>2016-01-19 13:46 GMT+01:00 John Cox :
>> I've just done a fair bit of work on hevc_cabac decode for the Rasberry
>> Pi2 and I think that the patch is generally applicable.  Patch is
>> attached but you may prefer to take it from git:
>
>This work is certainly impressive, and most people would have come
>only with some of the "tricks" you used.
>Although it already represents quite a bit of work, I echo others'
>suggestions to have more incremental changes.
>
>> I have not yet run fate over it as I haven't yet finished downloading
>> the samples (the internet connection here isn't wildly fast), but I have
>> run it against the H265.1 conformance streams on both x86 and ARM and it
>> causes no regressions.
>
>Your patch fails on the later fate tests linked to range extensions
>(RExt sequences) on Win64. I didn't investigate why. Random thoughts:
>transform_skip, cross-channel residual, some bypass-coded elements (eg
>SAO).

Thanks for that - bug in my persistent rice processing.  Apparently
untested by the main conformance suite.  Code now passes fate (x86
anyway).

[snip]

Regards

JC
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

[FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (v2)

2016-01-21 Thread John Cox

Hi

v2 of my hevc residual patch

I've fixed the fate regression
I've split it into more pieces
Now uses ff_clz
Some reformating of function headers

The patches can also be found on
https://github.com/jc-kynesim/rpi-ffmpeg.git on branch
test/ff_hevc_cabac_4 from tag ff_hevc_cabac_4_base

Note that I will be going on holiday from the end of Friday (UK time)
till the 1st Feb and will be unable to edit code or read this list
during that period.

Regards

JC


0001-cabac-Ensure-2-byte-cabac-loads-are-on-2-byte-boundr.patch
Description: Binary data


0002-cabac_functions-Cound-zeros-with-ctz-if-it-is-fast.patch
Description: Binary data


0003-cabac_functions-Allow-more-functions-to-be-overridde.patch
Description: Binary data


0004-hevc_cabac-Optimize-ff_hevc_hls_residual_coding.patch
Description: Binary data


0005-hevc_cabac-Add-bulk-bypass-decoding.patch
Description: Binary data


0006-hevc_cabac-Add-ARM-asm-functions.patch
Description: Binary data
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (v2)

2016-01-22 Thread John Cox

>On Fri, Jan 22, 2016 at 01:41:11AM +0100, Michael Niedermayer wrote:
>> On Thu, Jan 21, 2016 at 10:45:55AM +0000, John Cox wrote:
>> > Hi
>> > 
>> > v2 of my hevc residual patch
>> > 
>> > I've fixed the fate regression
>> > I've split it into more pieces
>> > Now uses ff_clz
>> > Some reformating of function headers
>> > 
>> > The patches can also be found on
>> > https://github.com/jc-kynesim/rpi-ffmpeg.git on branch
>> > test/ff_hevc_cabac_4 from tag ff_hevc_cabac_4_base
>> > 
>> > Note that I will be going on holiday from the end of Friday (UK time)
>> > till the 1st Feb and will be unable to edit code or read this list
>> > during that period.
>> 
>> seems failing here (with qemu)
>>  --cc='ccache arm-linux-gnueabi-gcc-4.5' --extra-cflags='-mfpu=neon 
>> -mfloat-abi=softfp' --cpu=cortex-a8 --arch=armv7 --target-os=linux 
>> --enable-cross-compile --disable-iconv --disable-pthreads 
>> --enable-neon-clobber-test
>> tried without --enable-neon-clobber-test too
>> 
>> qemu-arm version 1.1.0, Copyright (c) 2003-2008
>> also tried qemu-arm version 1.6.50
>> 
>> arm-linux-gnueabi-gcc-4.5 (Ubuntu/Linaro 4.5.3-12ubuntu2) 4.5.3
>> 
>> also tried your branch
>
>fate-hevc passes with patch 1-5, so the issue is likely in the last
>
>[...]

Thanks - I'll fix it

JC
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (v2)

2016-01-22 Thread John Cox

On Fri, 22 Jan 2016 01:57:58 +0100, you wrote:

>On Fri, Jan 22, 2016 at 01:41:11AM +0100, Michael Niedermayer wrote:
>> On Thu, Jan 21, 2016 at 10:45:55AM +0000, John Cox wrote:
>> > Hi
>> > 
>> > v2 of my hevc residual patch
>> > 
>> > I've fixed the fate regression
>> > I've split it into more pieces
>> > Now uses ff_clz
>> > Some reformating of function headers
>> > 
>> > The patches can also be found on
>> > https://github.com/jc-kynesim/rpi-ffmpeg.git on branch
>> > test/ff_hevc_cabac_4 from tag ff_hevc_cabac_4_base
>> > 
>> > Note that I will be going on holiday from the end of Friday (UK time)
>> > till the 1st Feb and will be unable to edit code or read this list
>> > during that period.
>> 
>> seems failing here (with qemu)
>>  --cc='ccache arm-linux-gnueabi-gcc-4.5' --extra-cflags='-mfpu=neon 
>> -mfloat-abi=softfp' --cpu=cortex-a8 --arch=armv7 --target-os=linux 
>> --enable-cross-compile --disable-iconv --disable-pthreads 
>> --enable-neon-clobber-test
>> tried without --enable-neon-clobber-test too
>> 
>> qemu-arm version 1.1.0, Copyright (c) 2003-2008
>> also tried qemu-arm version 1.6.50
>> 
>> arm-linux-gnueabi-gcc-4.5 (Ubuntu/Linaro 4.5.3-12ubuntu2) 4.5.3
>> 
>> also tried your branch
>
>fate-hevc passes with patch 1-5, so the issue is likely in the last
>
>[...]

Yup - bug in the arm update_rice (again - sorry).  Now passes fate on
ARM too (now I've learnt how to run fate on my Pi in a finite time).

New version of patch 6 attached - all others should still be good

Regards

JC


0006-hevc_cabac-Add-ARM-asm-functions-v2.patch
Description: Binary data
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (especially ARM)

2016-01-22 Thread John Cox

On Fri, 22 Jan 2016 12:18:29 +0100, you wrote:

>Hi,
>
>2016-01-20 15:27 GMT+01:00 John Cox :
>> The by22 code gained me an overall factor of two in the abs level decode
>> - the gains do depend a lot on the quantity of residual - you gain a lot
>> more on I-frames than you do otherwise as they tend to have much longer
>> residuals.  The higher the bitrate the more useful this code is.  But as
>> you note it didn't use vast amounts of time relative to everything else
>> anyway.
>>
>> The reworking / simplification of the loop(s) around the abs level
>> decode and the scaling gave me the biggest single improvement.
>
>The thing is, it provided no gain on no Win64 system I had at hand. Or
>very minor, once I switched off things. The amount of new/changed code
>would make it worth discussing, were it not for actual gains on arm.

I think on ARM that things fitted with its register limit more often -
either way it was useful.  Much of the simplificatin work was structural
so it was possible for me to extract simple functions to code in asm.

>> After that the reworking of get_sig_ceoff_flag_idxs was a useful gain
>
>Yes, this is the most agreeable part of the non-applied parts.
>
>> Special caseing the single coeff path gave a similar gain
>
>This is a big slowdown on Win64 and UHD-bluray like sequences, but
>that can be switched off in that case.

I'm a bit surprised that it generated a big slowdown - some cache must
be running just on the edge, but yes if you normally have hi-bitrate
stuff then it isn't wanted.  On my test streams the bitrates were
normally quite low - quite unlike what I would expect from blu-ray
sequences.

Default it to off on x86 but on on ARM?

>> After that the scale rework - now probably 75% faster than it was
>> previously but it wasn't taking a huge amount of time.
>
>The work is done, I don't mind.
>
>> And after that all the other bits - my experience with optimising this
>> sort of code (I did a lot of work on a TI H.264 implementation in the
>> past) is that no single change is going to do everything, you just have
>> to polish everything until it goes fast enough.
>
>Sure. There may be positive interactions, but my own figures showed
>the sigmap/greater than flags were the only ones worth optimizing on
>Win64.

Very plausibly

>> Sorry - I don't quite understand what you've said here.
>
>Doesn't matter anymore, I think I have just laid out the parts
>actually mattering, and for haswell/Win64 (ie x86_64).

I think you've cleared up my misunderstanding in the expanded comments
above.

>I'll reply more in depth to the new patchset, but not until you're on
>holidays. Which should leave me more time for reviewing it, so all the
>better.

Good oh.

JC
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (v2)

2016-01-22 Thread John Cox

On Fri, 22 Jan 2016 14:42:27 +0100, you wrote:

> [snip]
>> >fate-hevc passes with patch 1-5, so the issue is likely in the last
>> >
>> >[...]
>> 
>> Yup - bug in the arm update_rice (again - sorry).  Now passes fate on
>> ARM too (now I've learnt how to run fate on my Pi in a finite time).
>> 
>> New version of patch 6 attached - all others should still be good
>
>fate passes on qemu now

Hurrah! Many thanks. Sorry about the false starts.

>also you may want to add yourself to the MAINTAINERs file (in a patch)
>for the parts you added

I'll happily add myself once I have some substantial code on master to
maintain.

Regards

JC
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (especially ARM)

2016-01-22 Thread John Cox

Hi

>Hi,
>
>2016-01-22 14:29 GMT+01:00 John Cox :
>>>This is a big slowdown on Win64 and UHD-bluray like sequences, but
>>>that can be switched off in that case.
>>
>> I'm a bit surprised that it generated a big slowdown - some cache must
>> be running just on the edge, but yes if you normally have hi-bitrate
>> stuff then it isn't wanted.  On my test streams the bitrates were
>> normally quite low - quite unlike what I would expect from blu-ray
>> sequences.
>
>Initial (4 sequences):
>6553 decicycles in g, 8387110 runs,   1498 skips
>5916 decicycles in g,33546118 runs,   8314 skips
>5028 decicycles in g,67101499 runs,   7365 skips
>4729 decicycles in g,33548420 runs,   6012 skips
>
>Deactivating USE_N_END_1:
>4746 decicycles in g,16774296 runs,   2920 skips
>5373 decicycles in g,33545629 runs,   8803 skips
>4141 decicycles in g,67098928 runs,   9936 skips
>3869 decicycles in g,33544593 runs,   9839 skips
>
>But I see the first one surprisingly having half the iterations (but
>this has almost converged at this point).
>So 10-20%.

Coo - that is big.
How are you profiling that and with what streams?

>I think it has more to do with cache pressure, both code, which
>increases from 8 to 9.5KB, and data, with already "large" tables in a
>loop that may need to tight.

I agreee (and it is what I was trying to suggest in my previous
comment).  It also suggests that on x86 you might benefit from
non-inlined cabac_gets to keep the code size small.

>> Default it to off on x86 but on on ARM?
>
>Yes, I think so.
Is ARCH_X86/ARM an appropriate switch for this?

Regards

JC
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (v2)

2016-01-22 Thread John Cox

On Fri, 22 Jan 2016 18:52:23 +0100, you wrote:

>Hi,
>
>2016-01-21 11:45 GMT+01:00 John Cox :
>> Hi
>>
>> v2 of my hevc residual patch
>
>I'll review the bit not related to significant coeffs first, because I
>think it is the most performance-sensitive. Also there are bits that
>could be moved to other patches, at least some are related to the
>later bypass patch. Here's a list you'll see detailed below:
>- coefficient saturation, which I think is OK to commit
>- bypass-related stuff
>- boolean stuff (!!stuff), which I think is OK to commit
>- cosmetics (like renaming a variable or introducing a shorthand)
>- sig(nificant coefficients )map
>
>The fact is I've benchmarked parts of the code and seeing slowdowns as
>well as speedups on x86_64, hence why it would be nice to be able to
>test and evaluate each of those parts separately.

Fair enough - though given that your slowdowns are almost certainly
cache-related the whole may be quite different from the sum of the
parts.

>> +// Helper fns
>> +#ifndef hevc_mem_bits32
>> +static av_always_inline uint32_t hevc_mem_bits32(const void * buf,
>> const unsigned int offset)
>> +{
>> +return AV_RB32((const uint8_t *)buf + (offset >> 3)) << (offset & 7);
>> +}
>> +#endif
>> +
>> +#if !defined(hevc_clz32)
>> +#define hevc_clz32 hevc_clz32_builtin
>> +static av_always_inline unsigned int hevc_clz32_builtin(const uint32_t x)
>> +{
>> +// ff_clz says works on ints (probably) - so adjust if int is >32 bits 
>> long
>> +// the fact that x is passed in as uint32_t will have cleared the top 
>> bits
>> +return ff_clz(x) - (sizeof(int) * 8 - 32);
>> +}
>> +#endif
>> +
>> +#define bypass_start(s)
>> +#define bypass_finish(s)
>
>bypass-related?
>
>>  void ff_hevc_save_states(HEVCContext *s, int ctb_addr_ts)
>>  {
>>  if (s->ps.pps->entropy_coding_sync_enabled_flag &&
>> @@ -863,19 +928,19 @@ int ff_hevc_cbf_luma_decode(HEVCContext *s, int
>> trafo_depth)
>>  return GET_CABAC(elem_offset[CBF_LUMA] + !trafo_depth);
>>  }
>>
>> -static int hevc_transform_skip_flag_decode(HEVCContext *s, int c_idx)
>> +static int hevc_transform_skip_flag_decode(HEVCContext *s, int c_idx_nz)
>>  {
>> -return GET_CABAC(elem_offset[TRANSFORM_SKIP_FLAG] + !!c_idx);
>> +return GET_CABAC(elem_offset[TRANSFORM_SKIP_FLAG] + c_idx_nz);
>>  }
>>
>> -static int explicit_rdpcm_flag_decode(HEVCContext *s, int c_idx)
>> +static int explicit_rdpcm_flag_decode(HEVCContext *s, int c_idx_nz)
>>  {
>> -return GET_CABAC(elem_offset[EXPLICIT_RDPCM_FLAG] + !!c_idx);
>> +return GET_CABAC(elem_offset[EXPLICIT_RDPCM_FLAG] + c_idx_nz);
>>  }
>>
>> -static int explicit_rdpcm_dir_flag_decode(HEVCContext *s, int c_idx)
>> +static int explicit_rdpcm_dir_flag_decode(HEVCContext *s, int c_idx_nz)
>>  {
>> -return GET_CABAC(elem_offset[EXPLICIT_RDPCM_DIR_FLAG] + !!c_idx);
>> +return GET_CABAC(elem_offset[EXPLICIT_RDPCM_DIR_FLAG] + c_idx_nz);
>>  }
>
>Boolean stuff. Ideally, the whole boolean stuff topic would be better
>as a separate patch, with which I would be OK.
>
>>  int ff_hevc_log2_res_scale_abs(HEVCContext *s, int idx) {
>> @@ -891,14 +956,14 @@ int ff_hevc_res_scale_sign_flag(HEVCContext *s, int 
>> idx) {
>>  return GET_CABAC(elem_offset[RES_SCALE_SIGN_FLAG] + idx);
>>  }
>>
>> -static av_always_inline void
>> last_significant_coeff_xy_prefix_decode(HEVCContext *s, int c_idx,
>> +static av_always_inline void
>> last_significant_coeff_xy_prefix_decode(HEVCContext *s, int c_idx_nz,
>> int log2_size, int
>> *last_scx_prefix, int *last_scy_prefix)
>>  {
>>  int i = 0;
>>  int max = (log2_size << 1) - 1;
>>  int ctx_offset, ctx_shift;
>>
>> -if (!c_idx) {
>> +if (!c_idx_nz) {
>>  ctx_offset = 3 * (log2_size - 2)  + ((log2_size - 1) >> 2);
>>  ctx_shift = (log2_size + 1) >> 2;
>>  } else {
>> @@ -929,22 +994,16 @@ static av_always_inline int
>> last_significant_coeff_suffix_decode(HEVCContext *s,
>>  return value;
>>  }
>>
>> -static av_always_inline int
>> significant_coeff_group_flag_decode(HEVCContext *s, int c_idx, int
>> ctx_cg)
>> +static av_always_inline int
>> significant_coeff_group_flag_decode(HEVCContext *s, int c_idx_nz, int
>> ctx_cg)
>
>cosmetics?

I renamed the variable, because c_idx can have values 0..2

[FFmpeg-devel] Allocating a single YUV buffer rather than 3?

2016-02-01 Thread John Cox

Hi

In order to get a copy-free display on my target h/w I need to have my
decode output YUV planes contiguous.  The default allocater gets each
plane separately (so they aren't or at least aren't always).  Is there a
simple preferred way of getting this to work?  I've got slightly lost in
the maze of twisty little frame/buffer allocation functions and a
pointer to the right place would be extremely helpful.

If methods vary by decoder/format then I'm only really interested in
H.265 8-bit 4:2:0 at the moment.

Many thanks

JC
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (v2)

2016-02-02 Thread John Cox

Hi

On Tue, 2 Feb 2016 12:52:15 +0100, you wrote:

>Hi,
>
>as a motus operandi for this review, I have no time for a proper one,
>or at least not fitting with John's timeframe. I'll try to close as
>many pending discussions, and would prefer if someone else completed
>the review/validation/commit.
Thanks

>2016-01-22 19:33 GMT+01:00 John Cox :
>> Fair enough - though given that your slowdowns are almost certainly
>> cache-related the whole may be quite different from the sum of the
>> parts.
>
>True, they don't always translate to anything noticeable, but that's
>the best tool we have to objectively decide.
Yes, but it isn't always a good one. I have spent substantial time in
the past optimising TI DSP based codecs and it was not uncommon that
some patches would make life slightly slower until enough of them were
applied and then the whole thing suddenly gained a jump in speed.

Either way I'm not averse to splitting stuff up and, at least on ARM,
none of the patches caused a slowdown.
 
>>>cosmetics?
>>
>> I renamed the variable, because c_idx can have values 0..2 and c_idx_nz
>> is a boolean with 0..1 and in the rewrite of the inc var it is important
>> that we are using the _nz variant so having the var named appropriately
>> seemed sensible.
>
>I don't really mind mixing some form of cosmetics (=supposedly without
>code generation consequences) although other people prefer splitting
>for easier review and regression testing.
>
>This is not a blocking item for me, just that finding the most
>appropriate commit would be nice.

My point was that I changed the inputs to that fn and so I changed the
vars name to make the point clearer - it should be part of the c_idx_nz
patch.

>>>I suppose branch prediction could help here, but less likely than for
>>>get_cabac_sig_coeff_flag_idxs.
>>>
>>>Also for this and some others: why inline over av_always_inline?
>> No particularly good reason for this one - though for any fn that might
>> be called from multiple places there is a strong argument for just
>> "inline" as it allows the compiler to make a judgment call based on how
>> big L1 cache is and how bad the call penalty.
>
>Anyway, those kinds of micro-optimizations I'm suggesting need more
>testing (sequences, platforms), so let's roll with this.
>
>>>AV_WN64 of 0x0101010101010101ULL, or even a memset, as it would be
>>>inlined by the compiler, given its size, and done the best possible
>>>way.
>>
>> levels is int *, not char *
>
>Ok, sorry, then 0x00010001ULL. But you can ignore this, it'll
>probably make no difference outside of a micro-benchmark.

My experience with compilers is that this is the sort of thing that they
can and will do off their own bat. (Certainly MS C has been unrolling
this sort of memset loop for the past two decades and I'd be stunned if
gcc doesn't too),

>>>Saturation, could be a separate patch, with which I would be ok.
>
>btw and iirc, a comment indicated assumptions on what is a "legit"
>(instead of conforming ) bitstream/coeffs, making a conscious
>decision.
>
>I know Ronald, ffvp9's author, specifically decided to handle
>equivalent cases in bitstreams (hint) from Argon Designs. I have no
>opinion, but others might.
>
>>>Related to but not strictly bypass ?
>>
>> Not bypass per se, more the general optimisation of abs_level_remaining
>> - it is pulled out because I had a slightly better arm asm version of
>> the fn.  So it could go in that patch, but this allows other asm to
>> override it if they so desire.
>
>What I meant: would better be there than in another commit.
>
>>>Doing:
>>>if (get_cabac(c, state0 + ctx_map[n]))
>>>*p++ = n;
>>>while (--n != 0) {
>>>if (get_cabac(c, state0 + ctx_map[n]))
>>>*p++ = n;
>>>}
>>>is most likely faster, probably also on arm, if the branch prediction is 
>>>good.
>>
>> Not convinced.  That will increase code size (as get_cabac will inline)
>> which can be pure poison as you have found out with USE_N_END_1.
>
>That's 300B, not 1.5KB. And I *know* it can help, just not on all
>platforms and sequences. The same decision was made for ffh264's
>equivalent, iirc.

I'll have to take your word for it but it seems very strange to me that

fn(x);
while(cond)
  fn(x);

is faster than

do {
  fn(x);
} while (cond);

I guess that it might be a branch prediction thing, but the second form
uses no more conditions and the first and is shorter. (And the compiler
always has the option of unrolling

Re: [FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (v2)

2016-02-03 Thread John Cox

On Tue, 2 Feb 2016 12:52:15 +0100, you wrote:

>Hi,
>
>as a motus operandi for this review, I have no time for a proper one,
>or at least not fitting with John's timeframe. I'll try to close as
>many pending discussions, and would prefer if someone else completed
>the review/validation/commit.

Do we have another volunteer?

>[snip]

Many thanks

JC
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

[FFmpeg-devel] [PATCH] configure fix arm inline defines

2018-05-30 Thread John Cox

Hi

I believe there is a bug in the arm feature detection for inline asm in
configure and I have a patch for it.

Currently using a command line like:

./configure --enable-cross-compile --arch=arm --cpu=cortex-a7
--target-os=linux --cross-prefix=arm-linux-gnueabihf-

gives in config.h:

#define HAVE_ARMV5TE 1
#define HAVE_ARMV6 1
#define HAVE_ARMV6T2 1
#define HAVE_ARMV8 0
#define HAVE_NEON 1
#define HAVE_VFP 1
#define HAVE_VFPV3 1
#define HAVE_SETEND 1
...
#define HAVE_ARMV5TE_EXTERNAL 1
#define HAVE_ARMV6_EXTERNAL 1
#define HAVE_ARMV6T2_EXTERNAL 1
#define HAVE_ARMV8_EXTERNAL 0
#define HAVE_NEON_EXTERNAL 0
#define HAVE_VFP_EXTERNAL 1
#define HAVE_VFPV3_EXTERNAL 1
#define HAVE_SETEND_EXTERNAL 1
...
#define HAVE_ARMV5TE_INLINE 0
#define HAVE_ARMV6_INLINE 0
#define HAVE_ARMV6T2_INLINE 0
#define HAVE_ARMV8_INLINE 0
#define HAVE_NEON_INLINE 0
#define HAVE_VFP_INLINE 0
#define HAVE_VFPV3_INLINE 0
#define HAVE_SETEND_INLINE 0

With the patch below you get

...
#define HAVE_ARMV5TE 1
#define HAVE_ARMV6 1
#define HAVE_ARMV6T2 1
#define HAVE_ARMV8 0
#define HAVE_NEON 1
#define HAVE_VFP 1
#define HAVE_VFPV3 1
#define HAVE_SETEND 1
...
#define HAVE_ARMV5TE_EXTERNAL 1
#define HAVE_ARMV6_EXTERNAL 1
#define HAVE_ARMV6T2_EXTERNAL 1
#define HAVE_ARMV8_EXTERNAL 0
#define HAVE_NEON_EXTERNAL 0
#define HAVE_VFP_EXTERNAL 1
#define HAVE_VFPV3_EXTERNAL 1
#define HAVE_SETEND_EXTERNAL 1
...
#define HAVE_ARMV5TE_INLINE 1
#define HAVE_ARMV6_INLINE 1
#define HAVE_ARMV6T2_INLINE 1
#define HAVE_ARMV8_INLINE 0
#define HAVE_NEON_INLINE 0
#define HAVE_VFP_INLINE 1
#define HAVE_VFPV3_INLINE 1
#define HAVE_SETEND_INLINE 1

If I want to get Neon enabled as well then I need to have a --mfpu=neon
on the command line too.  I'm not sure how to get it there unless I pass
it as extra flags.

This patch adds quotes around the asm that is in the __asm__ statement

Regards

John Cox

diff --git a/configure b/configure
index 22eeca22a5..4dbee8d349 100755
--- a/configure
+++ b/configure
@@ -1040,7 +1040,7 @@ EOF

 check_insn(){
 log check_insn "$@"
-check_inline_asm ${1}_inline "$2"
+check_inline_asm ${1}_inline "\"$2\""
 check_as ${1}_external "$2"
 }
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

[FFmpeg-devel] [PATCH] use av_clip_uintp2_c where clip is variable

2018-05-31 Thread John Cox

Hi

I enclose a patch that changes av_clip_uintp2 to av_clip_uintp2_c where
the bit depth is variable.  This fixes compilation issues if
HAVE_ARMV6_INLINE is 1 and therefore allows arm inline detection to be
fixed too.

Regards

John Cox


variable_clip.patch
Description: Binary data
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] Patch: Replace quotes for inline asm detection.

2018-05-31 Thread John Cox

>On 5/30/2018 10:32 PM, Michael Niedermayer wrote:
>> On Wed, May 30, 2018 at 09:48:51AM -0700, Frank Liberato wrote:
>>> Please find attached a one line patch:
>>>
>>>
>>>> Commit 8c893aa3cd5 removed quotes that were required to detect
>>>> inline asm in clank:
>>>>
>>>> check_insn armv5te qadd r0, r0, r0
>>>> .../test.c:1:34: error: expected string literal in 'asm'
>>>> void foo(void){ __asm__ volatile(qadd r0, r0, r0); }
>>>>
>>>> The correct code is:
>>>>
>>>> void foo(void){ __asm__ volatile("qadd r0, r0, r0"); }
>>>
>>>
>>> Thanks
>>> Frank
>> 
>>>  configure |2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>> 2d51797903ad2f3cab321e72bf5e7209116c3dae  
>>> 0001-Replace-quotes-for-inline-asm-detection.patch
>>> From 58c96127b6f1510b956b2280049d1c3778e3cab4 Mon Sep 17 00:00:00 2001
>>> From: "liber...@chromium.org" 
>>> Date: Tue, 29 May 2018 11:35:04 -0700
>>> Subject: [PATCH] Replace quotes for inline asm detection.
>>>
>>> Commit 8c893aa3cd5 removed quotes that were required to detect
>>> inline asm in clank:
>>>
>>> check_insn armv5te qadd r0, r0, r0
>>> .../test.c:1:34: error: expected string literal in 'asm'
>>> void foo(void){ __asm__ volatile(qadd r0, r0, r0); }
>>>
>>> The correct code is:
>>>
>>> void foo(void){ __asm__ volatile("qadd r0, r0, r0"); }
>>> ---
>>>  configure | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/configure b/configure
>>> index 22eeca22a5..4dbee8d349 100755
>>> --- a/configure
>>> +++ b/configure
>>> @@ -1040,7 +1040,7 @@ EOF
>>>  
>>>  check_insn(){
>>>  log check_insn "$@"
>>> -check_inline_asm ${1}_inline "$2"
>>> +check_inline_asm ${1}_inline "\"$2\""
>>>  check_as ${1}_external "$2"
>>>  }
>> 
>> This seems to break my arm qemu build:
>
>That'd be because vf_amplify is calling av_clip_uintp2() with a non
>immediate value. The arm optimized function makes an immediate value as
>second argument a requirement, so av_clip_uintp2_c() should be used
>there instead.
>
>This means 3c56d673418/8c893aa3cd5 broke detection of arm inline asm
>features for your qemu builds as well, and this patch restores that
>functionality.
>
>> 
>> In file included from src/libavutil/intmath.h:30:0,
>>  from src/libavutil/common.h:106,
>>  from src/libavutil/avutil.h:296,
>>  from src/libavutil/imgutils.h:30,
>>  from src/libavfilter/vf_amplify.c:21:
>> src/libavutil/arm/intmath.h: In function ‘amplify_frame’:
>> src/libavutil/arm/intmath.h:77:5: warning: asm operand 2 probably doesn’t 
>> match constraints [enabled by default]
>> src/libavutil/arm/intmath.h:77:5: error: impossible constraint in ‘asm’
>> make: *** [libavfilter/vf_amplify.o] Error 1
>> make: *** Waiting for unfinished jobs
>> src/libavfilter/src_movie.c: In function ‘open_stream’:
>> src/libavfilter/src_movie.c:175:5: warning: ‘refcounted_frames’ is 
>> deprecated (declared at src/libavcodec/avcodec.h:2345) 
>> [-Wdeprecated-declarations]
>> src/libavfilter/src_movie.c: In function ‘movie_push_frame’:
>> src/libavfilter/src_movie.c:529:9: warning: ‘avcodec_decode_video2’ is 
>> deprecated (declared at src/libavcodec/avcodec.h:4756) 
>> [-Wdeprecated-declarations]
>> src/libavfilter/src_movie.c:532:9: warning: ‘avcodec_decode_audio4’ is 
>> deprecated (declared at src/libavcodec/avcodec.h:4707) 
>> [-Wdeprecated-declarations]
>> src/libavfilter/vaf_spectrumsynth.c: In function ‘try_push_frame’:
>> src/libavfilter/vaf_spectrumsynth.c:429:12: warning: ‘end’ may be used 
>> uninitialized in this function [-Wuninitialized]
>> src/libavfilter/vaf_spectrumsynth.c:428:14: warning: ‘start’ may be used 
>> uninitialized in this function [-Wuninitialized]
>> src/libavfilter/vaf_spectrumsynth.c: In function ‘try_push_frames’:
>> src/libavfilter/vaf_spectrumsynth.c:437:9: warning: ‘ret’ may be used 
>> uninitialized in this function [-Wuninitialized]
>> 
>> arm-linux-gnueabi-gcc-4.6 (Debian 4.6.3-15) 4.6.3

master is now patched s.t. these should compile with HAVE_ARMV6_INLINE
set

Regards

John Cox
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

[FFmpeg-devel] [PATCH v2] configure fix arm inline defines

2018-06-04 Thread John Cox

Hi

Actually this is the same patch as before but master has been fixed s.t.
enabling arm inline asm no longer breaks it:

I believe there is a bug in the arm feature detection for inline asm in
configure and I have a patch for it.

Currently using a command line like:

./configure --enable-cross-compile --arch=arm --cpu=cortex-a7
--target-os=linux --cross-prefix=arm-linux-gnueabihf-

gives in config.h:

#define HAVE_ARMV5TE 1
#define HAVE_ARMV6 1
#define HAVE_ARMV6T2 1
#define HAVE_ARMV8 0
#define HAVE_NEON 1
#define HAVE_VFP 1
#define HAVE_VFPV3 1
#define HAVE_SETEND 1
...
#define HAVE_ARMV5TE_EXTERNAL 1
#define HAVE_ARMV6_EXTERNAL 1
#define HAVE_ARMV6T2_EXTERNAL 1
#define HAVE_ARMV8_EXTERNAL 0
#define HAVE_NEON_EXTERNAL 0
#define HAVE_VFP_EXTERNAL 1
#define HAVE_VFPV3_EXTERNAL 1
#define HAVE_SETEND_EXTERNAL 1
...
#define HAVE_ARMV5TE_INLINE 0
#define HAVE_ARMV6_INLINE 0
#define HAVE_ARMV6T2_INLINE 0
#define HAVE_ARMV8_INLINE 0
#define HAVE_NEON_INLINE 0
#define HAVE_VFP_INLINE 0
#define HAVE_VFPV3_INLINE 0
#define HAVE_SETEND_INLINE 0

With the patch below you get

...
#define HAVE_ARMV5TE 1
#define HAVE_ARMV6 1
#define HAVE_ARMV6T2 1
#define HAVE_ARMV8 0
#define HAVE_NEON 1
#define HAVE_VFP 1
#define HAVE_VFPV3 1
#define HAVE_SETEND 1
...
#define HAVE_ARMV5TE_EXTERNAL 1
#define HAVE_ARMV6_EXTERNAL 1
#define HAVE_ARMV6T2_EXTERNAL 1
#define HAVE_ARMV8_EXTERNAL 0
#define HAVE_NEON_EXTERNAL 0
#define HAVE_VFP_EXTERNAL 1
#define HAVE_VFPV3_EXTERNAL 1
#define HAVE_SETEND_EXTERNAL 1
...
#define HAVE_ARMV5TE_INLINE 1
#define HAVE_ARMV6_INLINE 1
#define HAVE_ARMV6T2_INLINE 1
#define HAVE_ARMV8_INLINE 0
#define HAVE_NEON_INLINE 0
#define HAVE_VFP_INLINE 1
#define HAVE_VFPV3_INLINE 1
#define HAVE_SETEND_INLINE 1

If I want to get Neon enabled as well then I need to have a --mfpu=neon
on the command line too.  I'm not sure how to get it there unless I pass
it as extra flags.

This patch adds quotes around the asm that is in the __asm__ statement

Regards

John Cox

diff --git a/configure b/configure
index 22eeca22a5..4dbee8d349 100755
--- a/configure
+++ b/configure
@@ -1040,7 +1040,7 @@ EOF

 check_insn(){
 log check_insn "$@"
-check_inline_asm ${1}_inline "$2"
+check_inline_asm ${1}_inline "\"$2\""
 check_as ${1}_external "$2"
 }
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH v2] configure fix arm inline defines

2018-06-06 Thread John Cox

>Hi
>
>Actually this is the same patch as before but master has been fixed s.t.
>enabling arm inline asm no longer breaks it:
>
>I believe there is a bug in the arm feature detection for inline asm in
>configure and I have a patch for it.
>
>Currently using a command line like:
>
>./configure --enable-cross-compile --arch=arm --cpu=cortex-a7
>--target-os=linux --cross-prefix=arm-linux-gnueabihf-
>
>gives in config.h:
>
>#define HAVE_ARMV5TE 1
>#define HAVE_ARMV6 1
>#define HAVE_ARMV6T2 1
>#define HAVE_ARMV8 0
>#define HAVE_NEON 1
>#define HAVE_VFP 1
>#define HAVE_VFPV3 1
>#define HAVE_SETEND 1
>...
>#define HAVE_ARMV5TE_EXTERNAL 1
>#define HAVE_ARMV6_EXTERNAL 1
>#define HAVE_ARMV6T2_EXTERNAL 1
>#define HAVE_ARMV8_EXTERNAL 0
>#define HAVE_NEON_EXTERNAL 0
>#define HAVE_VFP_EXTERNAL 1
>#define HAVE_VFPV3_EXTERNAL 1
>#define HAVE_SETEND_EXTERNAL 1
>...
>#define HAVE_ARMV5TE_INLINE 0
>#define HAVE_ARMV6_INLINE 0
>#define HAVE_ARMV6T2_INLINE 0
>#define HAVE_ARMV8_INLINE 0
>#define HAVE_NEON_INLINE 0
>#define HAVE_VFP_INLINE 0
>#define HAVE_VFPV3_INLINE 0
>#define HAVE_SETEND_INLINE 0
>
>With the patch below you get
>
>...
>#define HAVE_ARMV5TE 1
>#define HAVE_ARMV6 1
>#define HAVE_ARMV6T2 1
>#define HAVE_ARMV8 0
>#define HAVE_NEON 1
>#define HAVE_VFP 1
>#define HAVE_VFPV3 1
>#define HAVE_SETEND 1
>...
>#define HAVE_ARMV5TE_EXTERNAL 1
>#define HAVE_ARMV6_EXTERNAL 1
>#define HAVE_ARMV6T2_EXTERNAL 1
>#define HAVE_ARMV8_EXTERNAL 0
>#define HAVE_NEON_EXTERNAL 0
>#define HAVE_VFP_EXTERNAL 1
>#define HAVE_VFPV3_EXTERNAL 1
>#define HAVE_SETEND_EXTERNAL 1
>...
>#define HAVE_ARMV5TE_INLINE 1
>#define HAVE_ARMV6_INLINE 1
>#define HAVE_ARMV6T2_INLINE 1
>#define HAVE_ARMV8_INLINE 0
>#define HAVE_NEON_INLINE 0
>#define HAVE_VFP_INLINE 1
>#define HAVE_VFPV3_INLINE 1
>#define HAVE_SETEND_INLINE 1
>
>If I want to get Neon enabled as well then I need to have a --mfpu=neon
>on the command line too.  I'm not sure how to get it there unless I pass
>it as extra flags.
>
>This patch adds quotes around the asm that is in the __asm__ statement
>
>Regards
>
>John Cox
>
>diff --git a/configure b/configure
>index 22eeca22a5..4dbee8d349 100755
>--- a/configure
>+++ b/configure
>@@ -1040,7 +1040,7 @@ EOF
>
> check_insn(){
> log check_insn "$@"
>-check_inline_asm ${1}_inline "$2"
>+check_inline_asm ${1}_inline "\"$2\""
> check_as ${1}_external "$2"
> }
>___
>ffmpeg-devel mailing list
>ffmpeg-devel@ffmpeg.org
>http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Ping

This fixes the regression whereby no arm inline asm is ever enabled.

There is still the neon inline regression, but that will be another
patch.
Master now compiles OK with arm inline asm enabled. (Which it didn't 1st
time this patch was suggested)

JC
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

[FFmpeg-devel] [PATCH] configure: fix inline neon regression

2018-06-07 Thread John Cox

Hi

This patch fixes the regression whereby inline neon is not enabled

Actually I'm a bit unsure about this patch (despite the fact I'm
submitting it).  It does do its job in that if you specify an armv7a cpu
then it will try to enable neon, but it is a bit mucky due to
uncertainties about exactly what capabilities each cpu actually has.

Really configure probably wants a --fpu= option, but my understanding of
how it is meant to work isn't up to that, so for the moment if the fpu
type is specified by the user then I expect it to turn up in
cextra_flags.

I'll also note that probe_arm_arch ends up setting subarch to armv7-a
when the other bits of the script expect armv7a (although gcc wants
armv7-a in -march).  Again I am confused by this but I'm not sure what
the right answer is let alone the correct fix.  Maybe whoever wrote this
bit of configure could revisit it?

Regards

John Cox


neon_inline.patch
Description: Binary data
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

[FFmpeg-devel] [PATCH] avfilter/vf_bwdif: Add capability to deinterlace NV12

2024-01-12 Thread John Cox

As bwdif takes no account of horizontally adjacent pixels the same
code can be used on planes that have multiple components as is used
on single component planes. Update the filtering code to cope with
multi-component planes and add NV12 to the list of supported formats.

Signed-off-by: John Cox 
---
 libavfilter/vf_bwdif.c | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c
index 353cd0b61a..e07783ff70 100644
--- a/libavfilter/vf_bwdif.c
+++ b/libavfilter/vf_bwdif.c
@@ -115,19 +115,28 @@ static void filter(AVFilterContext *ctx, AVFrame *dstpic,
 YADIFContext *yadif = &bwdif->yadif;
 ThreadData td = { .frame = dstpic, .parity = parity, .tff = tff };
 int i;
+int last_plane = -1;
 
 for (i = 0; i < yadif->csp->nb_components; i++) {
 int w = dstpic->width;
 int h = dstpic->height;
+const AVComponentDescriptor * const comp = yadif->csp->comp + i;
+
+// If the last plane was the same as this plane assume we've dealt
+// with all the pels already
+if (last_plane == comp->plane)
+continue;
+last_plane = comp->plane;
 
 if (i == 1 || i == 2) {
 w = AV_CEIL_RSHIFT(w, yadif->csp->log2_chroma_w);
 h = AV_CEIL_RSHIFT(h, yadif->csp->log2_chroma_h);
 }
 
-td.w = w;
-td.h = h;
-td.plane = i;
+// comp step is in bytes but td.w is in pels
+td.w   = w * comp->step / ((comp->depth + 7) / 8);
+td.h   = h;
+td.plane   = comp->plane;
 
 ff_filter_execute(ctx, filter_slice, &td, NULL,
   FFMIN((h+3)/4, ff_filter_get_nb_threads(ctx)));
@@ -162,6 +171,7 @@ static const enum AVPixelFormat pix_fmts[] = {
 AV_PIX_FMT_YUVA420P9, AV_PIX_FMT_YUVA422P9, AV_PIX_FMT_YUVA444P9,
 AV_PIX_FMT_YUVA420P10, AV_PIX_FMT_YUVA422P10, AV_PIX_FMT_YUVA444P10,
 AV_PIX_FMT_YUVA420P16, AV_PIX_FMT_YUVA422P16, AV_PIX_FMT_YUVA444P16,
+AV_PIX_FMT_NV12,
 AV_PIX_FMT_GBRP, AV_PIX_FMT_GBRP9, AV_PIX_FMT_GBRP10,
 AV_PIX_FMT_GBRP12, AV_PIX_FMT_GBRP14, AV_PIX_FMT_GBRP16,
 AV_PIX_FMT_GBRAP, AV_PIX_FMT_GBRAP16,
-- 
2.40.1

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH 00/15] avfilter/vf_bwdif: Add aarch64 neon functions

2023-06-29 Thread John Cox

Also adds a filter_line3 method which on aarch64 neon yields approx 30%
speedup over 2xfilter_line and a memcpy

John Cox (15):
  avfilter/vf_bwdif: Add outline for aarch neon functions
  avfilter/vf_bwdif: Add common macros and consts for aarch64 neon
  avfilter/vf_bwdif: Export C filter_intra
  avfilter/vf_bwdif: Add neon for filter_intra
  tests/checkasm: Add test for vf_bwdif filter_intra
  avfilter/vf_bwdif: Add clip and spatial macros for aarch64 neon
  avfilter/vf_bwdif: Export C filter_edge
  avfilter/vf_bwdif: Add neon for filter_edge
  tests/checkasm: Add test for vf_bwdif filter_edge
  avfilter/vf_bwdif: Export C filter_line
  avfilter/vf_bwdif: Add neon for filter_line
  avfilter/vf_bwdif: Add a filter_line3 method for optimisation
  avfilter/vf_bwdif: Add neon for filter_line3
  tests/checkasm: Add test for vf_bwdif filter_line3
  avfilter/vf_bwdif: Block filter slices into a multiple of 4 lines

 libavfilter/aarch64/Makefile|   2 +
 libavfilter/aarch64/vf_bwdif_init_aarch64.c | 125 
 libavfilter/aarch64/vf_bwdif_neon.S | 780 
 libavfilter/bwdif.h |  20 +
 libavfilter/vf_bwdif.c  |  70 +-
 tests/checkasm/vf_bwdif.c   | 172 +
 6 files changed, 1154 insertions(+), 15 deletions(-)
 create mode 100644 libavfilter/aarch64/vf_bwdif_init_aarch64.c
 create mode 100644 libavfilter/aarch64/vf_bwdif_neon.S

-- 
2.39.2

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH 01/15] avfilter/vf_bwdif: Add outline for aarch neon functions

2023-06-29 Thread John Cox

Outline but no actual functions.

Signed-off-by: John Cox 
---
 libavfilter/aarch64/Makefile|  2 ++
 libavfilter/aarch64/vf_bwdif_init_aarch64.c | 39 +
 libavfilter/aarch64/vf_bwdif_neon.S | 25 +
 libavfilter/bwdif.h |  1 +
 libavfilter/vf_bwdif.c  |  2 ++
 5 files changed, 69 insertions(+)
 create mode 100644 libavfilter/aarch64/vf_bwdif_init_aarch64.c
 create mode 100644 libavfilter/aarch64/vf_bwdif_neon.S

diff --git a/libavfilter/aarch64/Makefile b/libavfilter/aarch64/Makefile
index b58daa3a3f..b68209bc94 100644
--- a/libavfilter/aarch64/Makefile
+++ b/libavfilter/aarch64/Makefile
@@ -1,3 +1,5 @@
+OBJS-$(CONFIG_BWDIF_FILTER)  += aarch64/vf_bwdif_init_aarch64.o
 OBJS-$(CONFIG_NLMEANS_FILTER)+= aarch64/vf_nlmeans_init.o
 
+NEON-OBJS-$(CONFIG_BWDIF_FILTER) += aarch64/vf_bwdif_neon.o
 NEON-OBJS-$(CONFIG_NLMEANS_FILTER)   += aarch64/vf_nlmeans_neon.o
diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c 
b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
new file mode 100644
index 00..86d53b2ca1
--- /dev/null
+++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
@@ -0,0 +1,39 @@
+/*
+ * bwdif aarch64 NEON optimisations
+ *
+ * Copyright (c) 2023 John Cox 
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/common.h"
+#include "libavfilter/bwdif.h"
+#include "libavutil/aarch64/cpu.h"
+
+void
+ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
+{
+const int cpu_flags = av_get_cpu_flags();
+
+if (bit_depth != 8)
+return;
+
+if (!have_neon(cpu_flags))
+return;
+
+}
+
diff --git a/libavfilter/aarch64/vf_bwdif_neon.S 
b/libavfilter/aarch64/vf_bwdif_neon.S
new file mode 100644
index 00..639ab22998
--- /dev/null
+++ b/libavfilter/aarch64/vf_bwdif_neon.S
@@ -0,0 +1,25 @@
+/*
+ * bwdif aarch64 NEON optimisations
+ *
+ * Copyright (c) 2023 John Cox 
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+
+#include "libavutil/aarch64/asm.S"
+
diff --git a/libavfilter/bwdif.h b/libavfilter/bwdif.h
index 5749345f78..6a0f70487a 100644
--- a/libavfilter/bwdif.h
+++ b/libavfilter/bwdif.h
@@ -39,5 +39,6 @@ typedef struct BWDIFContext {
 
 void ff_bwdif_init_filter_line(BWDIFContext *bwdif, int bit_depth);
 void ff_bwdif_init_x86(BWDIFContext *bwdif, int bit_depth);
+void ff_bwdif_init_aarch64(BWDIFContext *bwdif, int bit_depth);
 
 #endif /* AVFILTER_BWDIF_H */
diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c
index e278cf1217..39a51429ac 100644
--- a/libavfilter/vf_bwdif.c
+++ b/libavfilter/vf_bwdif.c
@@ -369,6 +369,8 @@ av_cold void ff_bwdif_init_filter_line(BWDIFContext *s, int 
bit_depth)
 
 #if ARCH_X86
 ff_bwdif_init_x86(s, bit_depth);
+#elif ARCH_AARCH64
+ff_bwdif_init_aarch64(s, bit_depth);
 #endif
 }
 
-- 
2.39.2

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH 02/15] avfilter/vf_bwdif: Add common macros and consts for aarch64 neon

2023-06-29 Thread John Cox

Add macros for dual scalar half->single multiply and accumulate
Add macro for shift, saturate and shorten single to byte
Add filter constants

Signed-off-by: John Cox 
---
 libavfilter/aarch64/vf_bwdif_neon.S | 46 +
 1 file changed, 46 insertions(+)

diff --git a/libavfilter/aarch64/vf_bwdif_neon.S 
b/libavfilter/aarch64/vf_bwdif_neon.S
index 639ab22998..a8f0ed525a 100644
--- a/libavfilter/aarch64/vf_bwdif_neon.S
+++ b/libavfilter/aarch64/vf_bwdif_neon.S
@@ -23,3 +23,49 @@
 
 #include "libavutil/aarch64/asm.S"
 
+.macro SQSHRUNN b, s0, s1, s2, s3, n
+sqshrun \s0\().4h, \s0\().4s, #\n - 8
+sqshrun2\s0\().8h, \s1\().4s, #\n - 8
+sqshrun \s1\().4h, \s2\().4s, #\n - 8
+sqshrun2\s1\().8h, \s3\().4s, #\n - 8
+uzp2\b\().16b, \s0\().16b, \s1\().16b
+.endm
+
+.macro SMULL4K a0, a1, a2, a3, s0, s1, k
+smull   \a0\().4s, \s0\().4h, \k
+smull2  \a1\().4s, \s0\().8h, \k
+smull   \a2\().4s, \s1\().4h, \k
+smull2  \a3\().4s, \s1\().8h, \k
+.endm
+
+.macro UMULL4K a0, a1, a2, a3, s0, s1, k
+umull   \a0\().4s, \s0\().4h, \k
+umull2  \a1\().4s, \s0\().8h, \k
+umull   \a2\().4s, \s1\().4h, \k
+umull2  \a3\().4s, \s1\().8h, \k
+.endm
+
+.macro UMLAL4K a0, a1, a2, a3, s0, s1, k
+umlal   \a0\().4s, \s0\().4h, \k
+umlal2  \a1\().4s, \s0\().8h, \k
+umlal   \a2\().4s, \s1\().4h, \k
+umlal2  \a3\().4s, \s1\().8h, \k
+.endm
+
+.macro UMLSL4K a0, a1, a2, a3, s0, s1, k
+umlsl   \a0\().4s, \s0\().4h, \k
+umlsl2  \a1\().4s, \s0\().8h, \k
+umlsl   \a2\().4s, \s1\().4h, \k
+umlsl2  \a3\().4s, \s1\().8h, \k
+.endm
+
+// static const uint16_t coef_lf[2] = { 4309, 213 };
+// static const uint16_t coef_hf[3] = { 5570, 3801, 1016 };
+// static const uint16_t coef_sp[2] = { 5077, 981 };
+
+.align 16
+coeffs:
+.hword  4309 * 4, 213 * 4   // lf[0]*4 = v0.h[0]
+.hword  5570, 3801, 1016, -3801 // hf[0] = v0.h[2], 
-hf[1] = v0.h[5]
+.hword  5077, 981   // sp[0] = v0.h[6]
+
-- 
2.39.2

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH 03/15] avfilter/vf_bwdif: Export C filter_intra

2023-06-29 Thread John Cox

Needed for tail fixup of neon code

Signed-off-by: John Cox 
---
 libavfilter/bwdif.h| 3 +++
 libavfilter/vf_bwdif.c | 6 +++---
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/libavfilter/bwdif.h b/libavfilter/bwdif.h
index 6a0f70487a..ae6f6ce223 100644
--- a/libavfilter/bwdif.h
+++ b/libavfilter/bwdif.h
@@ -41,4 +41,7 @@ void ff_bwdif_init_filter_line(BWDIFContext *bwdif, int 
bit_depth);
 void ff_bwdif_init_x86(BWDIFContext *bwdif, int bit_depth);
 void ff_bwdif_init_aarch64(BWDIFContext *bwdif, int bit_depth);
 
+void ff_bwdif_filter_intra_c(void *dst1, void *cur1, int w, int prefs, int 
mrefs,
+ int prefs3, int mrefs3, int parity, int clip_max);
+
 #endif /* AVFILTER_BWDIF_H */
diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c
index 39a51429ac..035fc58670 100644
--- a/libavfilter/vf_bwdif.c
+++ b/libavfilter/vf_bwdif.c
@@ -122,8 +122,8 @@ typedef struct ThreadData {
 next2++; \
 }
 
-static void filter_intra(void *dst1, void *cur1, int w, int prefs, int mrefs,
- int prefs3, int mrefs3, int parity, int clip_max)
+void ff_bwdif_filter_intra_c(void *dst1, void *cur1, int w, int prefs, int 
mrefs,
+ int prefs3, int mrefs3, int parity, int clip_max)
 {
 uint8_t *dst = dst1;
 uint8_t *cur = cur1;
@@ -362,7 +362,7 @@ av_cold void ff_bwdif_init_filter_line(BWDIFContext *s, int 
bit_depth)
 s->filter_line  = filter_line_c_16bit;
 s->filter_edge  = filter_edge_16bit;
 } else {
-s->filter_intra = filter_intra;
+s->filter_intra = ff_bwdif_filter_intra_c;
 s->filter_line  = filter_line_c;
 s->filter_edge  = filter_edge;
 }
-- 
2.39.2

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH 04/15] avfilter/vf_bwdif: Add neon for filter_intra

2023-06-29 Thread John Cox

Signed-off-by: John Cox 
---
 libavfilter/aarch64/vf_bwdif_init_aarch64.c | 17 +++
 libavfilter/aarch64/vf_bwdif_neon.S | 53 +
 2 files changed, 70 insertions(+)

diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c 
b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
index 86d53b2ca1..3ffaa07ab3 100644
--- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c
+++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
@@ -24,6 +24,22 @@
 #include "libavfilter/bwdif.h"
 #include "libavutil/aarch64/cpu.h"
 
+void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, int 
mrefs,
+int prefs3, int mrefs3, int parity, int 
clip_max);
+
+
+static void filter_intra_helper(void *dst1, void *cur1, int w, int prefs, int 
mrefs,
+int prefs3, int mrefs3, int parity, int 
clip_max)
+{
+const int w0 = clip_max != 255 ? 0 : w & ~15;
+
+ff_bwdif_filter_intra_neon(dst1, cur1, w0, prefs, mrefs, prefs3, mrefs3, 
parity, clip_max);
+
+if (w0 < w)
+ff_bwdif_filter_intra_c((char *)dst1 + w0, (char *)cur1 + w0,
+w - w0, prefs, mrefs, prefs3, mrefs3, parity, 
clip_max);
+}
+
 void
 ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
 {
@@ -35,5 +51,6 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
 if (!have_neon(cpu_flags))
 return;
 
+s->filter_intra = filter_intra_helper;
 }
 
diff --git a/libavfilter/aarch64/vf_bwdif_neon.S 
b/libavfilter/aarch64/vf_bwdif_neon.S
index a8f0ed525a..b863b3447d 100644
--- a/libavfilter/aarch64/vf_bwdif_neon.S
+++ b/libavfilter/aarch64/vf_bwdif_neon.S
@@ -69,3 +69,56 @@ coeffs:
 .hword  5570, 3801, 1016, -3801 // hf[0] = v0.h[2], 
-hf[1] = v0.h[5]
 .hword  5077, 981   // sp[0] = v0.h[6]
 
+// 
+//
+// void ff_bwdif_filter_intra_neon(
+//  void *dst1, // x0
+//  void *cur1, // x1
+//  int w,  // w2
+//  int prefs,  // w3
+//  int mrefs,  // w4
+//  int prefs3, // w5
+//  int mrefs3, // w6
+//  int parity, // w7   unused
+//  int clip_max)   // [sp, #0] unused
+
+function ff_bwdif_filter_intra_neon, export=1
+cmp w2, #0
+ble 99f
+
+ldr q0, coeffs
+
+//for (x = 0; x < w; x++) {
+10:
+
+//interpol = (coef_sp[0] * (cur[mrefs] + cur[prefs]) - coef_sp[1] * 
(cur[mrefs3] + cur[prefs3])) >> 13;
+ldr q31, [x1, w4, SXTW]
+ldr q30, [x1, w3, SXTW]
+ldr q29, [x1, w6, SXTW]
+ldr q28, [x1, w5, SXTW]
+
+uaddl   v20.8h,  v31.8b,  v30.8b
+uaddl2  v21.8h,  v31.16b, v30.16b
+
+UMULL4K v2, v3, v4, v5, v20, v21, v0.h[6]
+
+uaddl   v20.8h,  v29.8b,  v28.8b
+uaddl2  v21.8h,  v29.16b, v28.16b
+
+UMLSL4K v2, v3, v4, v5, v20, v21, v0.h[7]
+
+//dst[0] = av_clip(interpol, 0, clip_max);
+SQSHRUNNv2, v2, v3, v4, v5, 13
+str q2, [x0], #16
+
+//dst++;
+//cur++;
+//}
+
+subsw2,  w2,  #16
+add x1,  x1,  #16
+bgt 10b
+
+99:
+ret
+endfunc
-- 
2.39.2

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH 05/15] tests/checkasm: Add test for vf_bwdif filter_intra

2023-06-29 Thread John Cox

Signed-off-by: John Cox 
---
 tests/checkasm/vf_bwdif.c | 37 +
 1 file changed, 37 insertions(+)

diff --git a/tests/checkasm/vf_bwdif.c b/tests/checkasm/vf_bwdif.c
index 46224bb575..034bbabb4c 100644
--- a/tests/checkasm/vf_bwdif.c
+++ b/tests/checkasm/vf_bwdif.c
@@ -20,6 +20,7 @@
 #include "checkasm.h"
 #include "libavcodec/internal.h"
 #include "libavfilter/bwdif.h"
+#include "libavutil/mem_internal.h"
 
 #define WIDTH 256
 
@@ -81,4 +82,40 @@ void checkasm_check_vf_bwdif(void)
 BODY(uint16_t, 10);
 report("bwdif10");
 }
+
+if (check_func(ctx_8.filter_intra, "bwdif8.intra")) {
+LOCAL_ALIGNED_16(uint8_t, cur0,  [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, cur1,  [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, dst0,  [WIDTH*3]);
+LOCAL_ALIGNED_16(uint8_t, dst1,  [WIDTH*3]);
+const int stride = WIDTH;
+const int mask = (1<<8)-1;
+
+declare_func(void, void *dst1, void *cur1, int w, int prefs, int mrefs,
+ int prefs3, int mrefs3, int parity, int clip_max);
+
+randomize_buffers( cur0,  cur1, mask, 11*WIDTH);
+memset(dst0, 0xba, WIDTH * 3);
+memset(dst1, 0xba, WIDTH * 3);
+
+call_ref(dst0 + stride,
+ cur0 + stride * 4, WIDTH,
+ stride, -stride, stride * 3, -stride * 3,
+ 0, mask);
+call_new(dst1 + stride,
+ cur0 + stride * 4, WIDTH,
+ stride, -stride, stride * 3, -stride * 3,
+ 0, mask);
+
+if (memcmp(dst0, dst1, WIDTH*3)
+|| memcmp( cur0,  cur1, WIDTH*11))
+fail();
+
+bench_new(dst1 + stride,
+  cur0 + stride * 4, WIDTH,
+  stride, -stride, stride * 3, -stride * 3,
+  0, mask);
+
+report("bwdif8.intra");
+}
 }
-- 
2.39.2

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH 06/15] avfilter/vf_bwdif: Add clip and spatial macros for aarch64 neon

2023-06-29 Thread John Cox

Signed-off-by: John Cox 
---
 libavfilter/aarch64/vf_bwdif_neon.S | 59 +
 1 file changed, 59 insertions(+)

diff --git a/libavfilter/aarch64/vf_bwdif_neon.S 
b/libavfilter/aarch64/vf_bwdif_neon.S
index b863b3447d..6c5d1598f4 100644
--- a/libavfilter/aarch64/vf_bwdif_neon.S
+++ b/libavfilter/aarch64/vf_bwdif_neon.S
@@ -59,6 +59,65 @@
 umlsl2  \a3\().4s, \s1\().8h, \k
 .endm
 
+//  int b = m2s1 - m1;
+//  int f = p2s1 - p1;
+//  int dc = c0s1 - m1;
+//  int de = c0s1 - p1;
+//  int sp_max = FFMIN(p1 - c0s1, m1 - c0s1);
+//  sp_max = FFMIN(sp_max, FFMAX(-b,-f));
+//  int sp_min = FFMIN(c0s1 - p1, c0s1 - m1);
+//  sp_min = FFMIN(sp_min, FFMAX(b,f));
+//  diff = diff == 0 ? 0 : FFMAX3(diff, sp_min, sp_max);
+.macro SPAT_CHECK diff, m2s1, m1, c0s1, p1, p2s1, t0, t1, t2, t3
+uqsub   \t0\().16b, \p1\().16b, \c0s1\().16b
+uqsub   \t2\().16b, \m1\().16b, \c0s1\().16b
+umin\t2\().16b, \t0\().16b, \t2\().16b
+
+uqsub   \t1\().16b, \m1\().16b, \m2s1\().16b
+uqsub   \t3\().16b, \p1\().16b, \p2s1\().16b
+umax\t3\().16b, \t3\().16b, \t1\().16b
+umin\t3\().16b, \t3\().16b, \t2\().16b
+
+uqsub   \t0\().16b, \c0s1\().16b, \p1\().16b
+uqsub   \t2\().16b, \c0s1\().16b, \m1\().16b
+umin\t2\().16b, \t0\().16b, \t2\().16b
+
+uqsub   \t1\().16b, \m2s1\().16b, \m1\().16b
+uqsub   \t0\().16b, \p2s1\().16b, \p1\().16b
+umax\t0\().16b, \t0\().16b, \t1\().16b
+umin\t2\().16b, \t2\().16b, \t0\().16b
+
+cmeq\t1\().16b, \diff\().16b, #0
+umax\diff\().16b, \diff\().16b, \t3\().16b
+umax\diff\().16b, \diff\().16b, \t2\().16b
+bic \diff\().16b, \diff\().16b, \t1\().16b
+.endm
+
+//  i0 = s0;
+//  if (i0 > d0 + diff0)
+//  i0 = d0 + diff0;
+//  else if (i0 < d0 - diff0)
+//  i0 = d0 - diff0;
+//
+// i0 = s0 is safe
+.macro DIFF_CLIP i0, s0, d0, diff, t0, t1
+uqadd   \t0\().16b, \d0\().16b, \diff\().16b
+uqsub   \t1\().16b, \d0\().16b, \diff\().16b
+umin\i0\().16b, \s0\().16b, \t0\().16b
+umax\i0\().16b, \i0\().16b, \t1\().16b
+.endm
+
+//  i0 = FFABS(m1 - p1) > td0 ? i1 : i2;
+//  DIFF_CLIP
+//
+// i0 = i1 is safe
+.macro INTERPOL i0, i1, i2, m1, d0, p1, td0, diff, t0, t1, t2
+uabd\t0\().16b, \m1\().16b, \p1\().16b
+cmhi\t0\().16b, \t0\().16b, \td0\().16b
+bsl \t0\().16b, \i1\().16b, \i2\().16b
+DIFF_CLIP   \i0, \t0, \d0, \diff, \t1, \t2
+.endm
+
 // static const uint16_t coef_lf[2] = { 4309, 213 };
 // static const uint16_t coef_hf[3] = { 5570, 3801, 1016 };
 // static const uint16_t coef_sp[2] = { 5077, 981 };
-- 
2.39.2

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH 13/15] avfilter/vf_bwdif: Add neon for filter_line3

2023-06-29 Thread John Cox

Signed-off-by: John Cox 
---
 libavfilter/aarch64/vf_bwdif_init_aarch64.c |  28 ++
 libavfilter/aarch64/vf_bwdif_neon.S | 278 
 2 files changed, 306 insertions(+)

diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c 
b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
index 21e67884ab..f52bc4b9b4 100644
--- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c
+++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
@@ -36,6 +36,33 @@ void ff_bwdif_filter_line_neon(void *dst1, void *prev1, void 
*cur1, void *next1,
int prefs3, int mrefs3, int prefs4, int mrefs4,
int parity, int clip_max);
 
+void ff_bwdif_filter_line3_neon(void * dst1, int d_stride,
+const void * prev1, const void * cur1, const 
void * next1, int s_stride,
+int w, int parity, int clip_max);
+
+
+static void filter_line3_helper(void * dst1, int d_stride,
+const void * prev1, const void * cur1, const 
void * next1, int s_stride,
+int w, int parity, int clip_max)
+{
+// Asm works on 16 byte chunks
+// If w is a multiple of 16 then all is good - if not then if width rounded
+// up to nearest 16 will fit in both src & dst strides then allow the asm
+// to write over the padding bytes as that is almost certainly faster than
+// having to invoke the C version to clean up the tail.
+const int w1 = FFALIGN(w, 16);
+const int w0 = clip_max != 255 ? 0 :
+   d_stride <= w1 && s_stride <= w1 ? w : w & ~15;
+
+ff_bwdif_filter_line3_neon(dst1, d_stride,
+   prev1, cur1, next1, s_stride,
+   w0, parity, clip_max);
+
+if (w0 < w)
+ff_bwdif_filter_line3_c((char *)dst1 + w0, d_stride,
+(const char *)prev1 + w0, (const char *)cur1 + 
w0, (const char *)next1 + w0, s_stride,
+w - w0, parity, clip_max);
+}
 
 static void filter_line_helper(void *dst1, void *prev1, void *cur1, void 
*next1,
int w, int prefs, int mrefs, int prefs2, int 
mrefs2,
@@ -93,5 +120,6 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
 s->filter_intra = filter_intra_helper;
 s->filter_line  = filter_line_helper;
 s->filter_edge  = filter_edge_helper;
+s->filter_line3 = filter_line3_helper;
 }
 
diff --git a/libavfilter/aarch64/vf_bwdif_neon.S 
b/libavfilter/aarch64/vf_bwdif_neon.S
index 675e97d966..bcffbe5793 100644
--- a/libavfilter/aarch64/vf_bwdif_neon.S
+++ b/libavfilter/aarch64/vf_bwdif_neon.S
@@ -128,6 +128,284 @@ coeffs:
 .hword  5570, 3801, 1016, -3801 // hf[0] = v0.h[2], 
-hf[1] = v0.h[5]
 .hword  5077, 981   // sp[0] = v0.h[6]
 
+// ===
+//
+// void ff_bwdif_filter_line3_neon(
+// void * dst1, // x0
+// int d_stride,// w1
+// const void * prev1,  // x2
+// const void * cur1,   // x3
+// const void * next1,  // x4
+// int s_stride,// w5
+// int w,   // w6
+// int parity,  // w7
+// int clip_max);   // [sp, #0] (Ignored)
+
+function ff_bwdif_filter_line3_neon, export=1
+// Sanity check w
+cmp w6, #0
+ble 99f
+
+// #define prev2 cur
+//const uint8_t * restrict next2 = parity ? prev : next;
+cmp w7, #0
+cselx17, x2, x4, ne
+
+// We want all the V registers - save all the ones we must
+stp d14, d15, [sp, #-64]!
+stp d8,  d9,  [sp, #48]
+stp d10, d11, [sp, #32]
+stp d12, d13, [sp, #16]
+
+ldr q0, coeffs
+
+// Some rearrangement of initial values for nice layout of refs in regs
+mov w10, w6 // w10 = loop count
+neg w9,  w5 // w9  = mref
+lsl w8,  w9,  #1// w8 =  mref2
+add w7,  w9,  w9, LSL #1// w7  = mref3
+lsl w6,  w9,  #2// w6  = mref4
+mov w11, w5 // w11 = pref
+lsl w12, w5,  #1// w12 = pref2
+add w13, w5,  w5, LSL #1// w13 = pref3
+lsl w14, w5,  #2// w14 = pref4
+add w15, w5,  w5, LSL #2// w15 = pref5
+add w16, w14, w12   // w16 = pref6
+
+lsl w5,  w1,  #1// w5 = d_stride * 2
+
+// for (x = 0; x

[FFmpeg-devel] [PATCH 07/15] avfilter/vf_bwdif: Export C filter_edge

2023-06-29 Thread John Cox

Needed for tail fixup of neon code

Signed-off-by: John Cox 
---
 libavfilter/bwdif.h| 4 
 libavfilter/vf_bwdif.c | 8 
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/libavfilter/bwdif.h b/libavfilter/bwdif.h
index ae6f6ce223..ae1616d366 100644
--- a/libavfilter/bwdif.h
+++ b/libavfilter/bwdif.h
@@ -41,6 +41,10 @@ void ff_bwdif_init_filter_line(BWDIFContext *bwdif, int 
bit_depth);
 void ff_bwdif_init_x86(BWDIFContext *bwdif, int bit_depth);
 void ff_bwdif_init_aarch64(BWDIFContext *bwdif, int bit_depth);
 
+void ff_bwdif_filter_edge_c(void *dst1, void *prev1, void *cur1, void *next1,
+int w, int prefs, int mrefs, int prefs2, int 
mrefs2,
+int parity, int clip_max, int spat);
+
 void ff_bwdif_filter_intra_c(void *dst1, void *cur1, int w, int prefs, int 
mrefs,
  int prefs3, int mrefs3, int parity, int clip_max);
 
diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c
index 035fc58670..bec83111b4 100644
--- a/libavfilter/vf_bwdif.c
+++ b/libavfilter/vf_bwdif.c
@@ -150,9 +150,9 @@ static void filter_line_c(void *dst1, void *prev1, void 
*cur1, void *next1,
 FILTER2()
 }
 
-static void filter_edge(void *dst1, void *prev1, void *cur1, void *next1,
-int w, int prefs, int mrefs, int prefs2, int mrefs2,
-int parity, int clip_max, int spat)
+void ff_bwdif_filter_edge_c(void *dst1, void *prev1, void *cur1, void *next1,
+int w, int prefs, int mrefs, int prefs2, int 
mrefs2,
+int parity, int clip_max, int spat)
 {
 uint8_t *dst   = dst1;
 uint8_t *prev  = prev1;
@@ -364,7 +364,7 @@ av_cold void ff_bwdif_init_filter_line(BWDIFContext *s, int 
bit_depth)
 } else {
 s->filter_intra = ff_bwdif_filter_intra_c;
 s->filter_line  = filter_line_c;
-s->filter_edge  = filter_edge;
+s->filter_edge  = ff_bwdif_filter_edge_c;
 }
 
 #if ARCH_X86
-- 
2.39.2

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH 14/15] tests/checkasm: Add test for vf_bwdif filter_line3

2023-06-29 Thread John Cox

Signed-off-by: John Cox 
---
 tests/checkasm/vf_bwdif.c | 81 +++
 1 file changed, 81 insertions(+)

diff --git a/tests/checkasm/vf_bwdif.c b/tests/checkasm/vf_bwdif.c
index 5fdba09fdc..3399cacdf7 100644
--- a/tests/checkasm/vf_bwdif.c
+++ b/tests/checkasm/vf_bwdif.c
@@ -28,6 +28,10 @@
 for (size_t i = 0; i < count; i++) \
 buf0[i] = buf1[i] = rnd() & mask
 
+#define randomize_overflow_check(buf0, buf1, mask, count) \
+for (size_t i = 0; i < count; i++) \
+buf0[i] = buf1[i] = (rnd() & 1) != 0 ? mask : 0;
+
 #define BODY(type, depth)  
\
 do {   
\
 type prev0[9*WIDTH], prev1[9*WIDTH];   
\
@@ -83,6 +87,83 @@ void checkasm_check_vf_bwdif(void)
 report("bwdif10");
 }
 
+if (!ctx_8.filter_line3)
+ctx_8.filter_line3 = ff_bwdif_filter_line3_c;
+
+{
+LOCAL_ALIGNED_16(uint8_t, prev0, [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, prev1, [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, next0, [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, next1, [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, cur0,  [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, cur1,  [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, dst0,  [WIDTH*3]);
+LOCAL_ALIGNED_16(uint8_t, dst1,  [WIDTH*3]);
+const int stride = WIDTH;
+const int mask = (1<<8)-1;
+int parity;
+
+for (parity = 0; parity != 2; ++parity) {
+if (check_func(ctx_8.filter_line3, "bwdif8.line3.rnd.p%d", 
parity)) {
+
+declare_func(void, void * dst1, int d_stride,
+  const void * prev1, const void * 
cur1, const void * next1, int prefs,
+  int w, int parity, int clip_max);
+
+randomize_buffers(prev0, prev1, mask, 11*WIDTH);
+randomize_buffers(next0, next1, mask, 11*WIDTH);
+randomize_buffers( cur0,  cur1, mask, 11*WIDTH);
+
+call_ref(dst0, stride,
+ prev0 + stride * 4, cur0 + stride * 4, next0 + stride 
* 4, stride,
+ WIDTH, parity, mask);
+call_new(dst1, stride,
+ prev1 + stride * 4, cur1 + stride * 4, next1 + stride 
* 4, stride,
+ WIDTH, parity, mask);
+
+if (memcmp(dst0, dst1, WIDTH*3)
+|| memcmp(prev0, prev1, WIDTH*11)
+|| memcmp(next0, next1, WIDTH*11)
+|| memcmp( cur0,  cur1, WIDTH*11))
+fail();
+
+bench_new(dst1, stride,
+ prev1 + stride * 4, cur1 + stride * 4, next1 + stride 
* 4, stride,
+ WIDTH, parity, mask);
+}
+}
+
+// Use just 0s and ~0s to try to provoke bad cropping or overflow
+// Parity makes no difference to this test so just test 0
+if (check_func(ctx_8.filter_line3, "bwdif8.line3.overflow")) {
+
+declare_func(void, void * dst1, int d_stride,
+  const void * prev1, const void * cur1, 
const void * next1, int prefs,
+  int w, int parity, int clip_max);
+
+randomize_overflow_check(prev0, prev1, mask, 11*WIDTH);
+randomize_overflow_check(next0, next1, mask, 11*WIDTH);
+randomize_overflow_check( cur0,  cur1, mask, 11*WIDTH);
+
+call_ref(dst0, stride,
+ prev0 + stride * 4, cur0 + stride * 4, next0 + stride * 
4, stride,
+ WIDTH, 0, mask);
+call_new(dst1, stride,
+ prev1 + stride * 4, cur1 + stride * 4, next1 + stride * 
4, stride,
+ WIDTH, 0, mask);
+
+if (memcmp(dst0, dst1, WIDTH*3)
+|| memcmp(prev0, prev1, WIDTH*11)
+|| memcmp(next0, next1, WIDTH*11)
+|| memcmp( cur0,  cur1, WIDTH*11))
+fail();
+
+// No point to benching
+}
+
+report("bwdif8.line3");
+}
+
 {
 LOCAL_ALIGNED_16(uint8_t, prev0, [11*WIDTH]);
 LOCAL_ALIGNED_16(uint8_t, prev1, [11*WIDTH]);
-- 
2.39.2

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH 08/15] avfilter/vf_bwdif: Add neon for filter_edge

2023-06-29 Thread John Cox

Signed-off-by: John Cox 
---
 libavfilter/aarch64/vf_bwdif_init_aarch64.c |  20 
 libavfilter/aarch64/vf_bwdif_neon.S | 104 
 2 files changed, 124 insertions(+)

diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c 
b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
index 3ffaa07ab3..e75cf2f204 100644
--- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c
+++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
@@ -24,10 +24,29 @@
 #include "libavfilter/bwdif.h"
 #include "libavutil/aarch64/cpu.h"
 
+void ff_bwdif_filter_edge_neon(void *dst1, void *prev1, void *cur1, void 
*next1,
+   int w, int prefs, int mrefs, int prefs2, int 
mrefs2,
+   int parity, int clip_max, int spat);
+
 void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, int 
mrefs,
 int prefs3, int mrefs3, int parity, int 
clip_max);
 
 
+static void filter_edge_helper(void *dst1, void *prev1, void *cur1, void 
*next1,
+   int w, int prefs, int mrefs, int prefs2, int 
mrefs2,
+   int parity, int clip_max, int spat)
+{
+const int w0 = clip_max != 255 ? 0 : w & ~15;
+
+ff_bwdif_filter_edge_neon(dst1, prev1, cur1, next1, w0, prefs, mrefs, 
prefs2, mrefs2,
+  parity, clip_max, spat);
+
+if (w0 < w)
+ff_bwdif_filter_edge_c((char *)dst1 + w0, (char *)prev1 + w0, (char 
*)cur1 + w0, (char *)next1 + w0,
+   w - w0, prefs, mrefs, prefs2, mrefs2,
+   parity, clip_max, spat);
+}
+
 static void filter_intra_helper(void *dst1, void *cur1, int w, int prefs, int 
mrefs,
 int prefs3, int mrefs3, int parity, int 
clip_max)
 {
@@ -52,5 +71,6 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
 return;
 
 s->filter_intra = filter_intra_helper;
+s->filter_edge  = filter_edge_helper;
 }
 
diff --git a/libavfilter/aarch64/vf_bwdif_neon.S 
b/libavfilter/aarch64/vf_bwdif_neon.S
index 6c5d1598f4..a33b235882 100644
--- a/libavfilter/aarch64/vf_bwdif_neon.S
+++ b/libavfilter/aarch64/vf_bwdif_neon.S
@@ -128,6 +128,110 @@ coeffs:
 .hword  5570, 3801, 1016, -3801 // hf[0] = v0.h[2], 
-hf[1] = v0.h[5]
 .hword  5077, 981   // sp[0] = v0.h[6]
 
+// 
+//
+// void ff_bwdif_filter_edge_neon(
+//  void *dst1, // x0
+//  void *prev1,// x1
+//  void *cur1, // x2
+//  void *next1,// x3
+//  int w,  // w4
+//  int prefs,  // w5
+//  int mrefs,  // w6
+//  int prefs2, // w7
+//  int mrefs2, // [sp, #0]
+//  int parity, // [sp, #8]
+//  int clip_max,   // [sp, #16]  unused
+//  int spat);  // [sp, #24]
+
+function ff_bwdif_filter_edge_neon, export=1
+// Sanity check w
+cmp w4, #0
+ble 99f
+
+// #define prev2 cur
+// const uint8_t * restrict next2 = parity ? prev : next;
+
+ldr w8,  [sp, #0]   // mrefs2
+
+ldr w17, [sp, #8]   // parity
+ldr w16,  [sp, #24] // spat
+cmp w17, #0
+cselx17, x1, x3, ne
+
+// for (x = 0; x < w; x++) {
+
+10:
+//int m1 = cur[mrefs];
+//int d = (prev2[0] + next2[0]) >> 1;
+//int p1 = cur[prefs];
+//int temporal_diff0 = FFABS(prev2[0] - next2[0]);
+//int temporal_diff1 =(FFABS(prev[mrefs] - m1) + FFABS(prev[prefs] - 
p1)) >> 1;
+//int temporal_diff2 =(FFABS(next[mrefs] - m1) + FFABS(next[prefs] - 
p1)) >> 1;
+//int diff = FFMAX3(temporal_diff0 >> 1, temporal_diff1, 
temporal_diff2);
+ldr q31, [x2]
+ldr q21, [x17]
+uhadd   v16.16b, v31.16b, v21.16b   // d0 = v16
+uabdv17.16b, v31.16b, v21.16b   // td0 = v17
+ldr q24, [x2, w6, SXTW] // m1 = v24
+ldr q22, [x2, w5, SXTW] // p1 = v22
+
+ldr q0,  [x1, w6, SXTW] // prev[mrefs]
+ldr q2,  [x1, w5, SXTW] // prev[prefs]
+ldr q1,  [x3, w6, SXTW] // next[mrefs]
+ldr q3,  [x3, w5, SXTW] // next[prefs]
+
+ushrv29.16b, v17.16b, #1
+
+uabdv31.16b, v0.16b,  v24.16b
+uabdv30.16b, v2.16b,  v22.16b
+uhadd   v0.16b,  v31.16b, v30.16b   // td1 = q0
+
+uabdv31.16b, v1.16b,  v24.16b
+uabdv30.16b, v3.16b,  v22.16b
+uhadd   v1.16b,  v31.16b, v30.16b

[FFmpeg-devel] [PATCH 15/15] avfilter/vf_bwdif: Block filter slices into a multiple of 4 lines

2023-06-29 Thread John Cox

Round job start lines down to a multiple of 4. This means that if
filter_line3 exists then filter_line will not sometimes be called
once at the end of a slice depending on thread count. The final slice
may do up to 3 extra lines but filter_edge is faster than filter_line
so it is unlikely to create any noticable thread load variation.

Signed-off-by: John Cox 
---
 libavfilter/vf_bwdif.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c
index 52bc676cf8..6701208efe 100644
--- a/libavfilter/vf_bwdif.c
+++ b/libavfilter/vf_bwdif.c
@@ -237,6 +237,13 @@ static void filter_edge_16bit(void *dst1, void *prev1, 
void *cur1, void *next1,
 FILTER2()
 }
 
+// Round job start line down to multiple of 4 so that if filter_line3 exists
+// and the frame is a multiple of 4 high then filter_line will never be called
+static inline int job_start(const int jobnr, const int nb_jobs, const int h)
+{
+return jobnr >= nb_jobs ? h : ((h * jobnr) / nb_jobs) & ~3;
+}
+
 static int filter_slice(AVFilterContext *ctx, void *arg, int jobnr, int 
nb_jobs)
 {
 BWDIFContext *s = ctx->priv;
@@ -246,8 +253,8 @@ static int filter_slice(AVFilterContext *ctx, void *arg, 
int jobnr, int nb_jobs)
 int clip_max = (1 << (yadif->csp->comp[td->plane].depth)) - 1;
 int df = (yadif->csp->comp[td->plane].depth + 7) / 8;
 int refs = linesize / df;
-int slice_start = (td->h *  jobnr   ) / nb_jobs;
-int slice_end   = (td->h * (jobnr+1)) / nb_jobs;
+int slice_start = job_start(jobnr, nb_jobs, td->h);
+int slice_end   = job_start(jobnr + 1, nb_jobs, td->h);
 int y;
 
 for (y = slice_start; y < slice_end; y++) {
@@ -310,7 +317,7 @@ static void filter(AVFilterContext *ctx, AVFrame *dstpic,
 td.plane = i;
 
 ff_filter_execute(ctx, filter_slice, &td, NULL,
-  FFMIN(h, ff_filter_get_nb_threads(ctx)));
+  FFMIN((h+3)/4, ff_filter_get_nb_threads(ctx)));
 }
 if (yadif->current_field == YADIF_FIELD_END) {
 yadif->current_field = YADIF_FIELD_NORMAL;
-- 
2.39.2

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH 09/15] tests/checkasm: Add test for vf_bwdif filter_edge

2023-06-29 Thread John Cox

Signed-off-by: John Cox 
---
 tests/checkasm/vf_bwdif.c | 54 +++
 1 file changed, 54 insertions(+)

diff --git a/tests/checkasm/vf_bwdif.c b/tests/checkasm/vf_bwdif.c
index 034bbabb4c..5fdba09fdc 100644
--- a/tests/checkasm/vf_bwdif.c
+++ b/tests/checkasm/vf_bwdif.c
@@ -83,6 +83,60 @@ void checkasm_check_vf_bwdif(void)
 report("bwdif10");
 }
 
+{
+LOCAL_ALIGNED_16(uint8_t, prev0, [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, prev1, [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, next0, [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, next1, [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, cur0,  [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, cur1,  [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, dst0,  [WIDTH*3]);
+LOCAL_ALIGNED_16(uint8_t, dst1,  [WIDTH*3]);
+const int stride = WIDTH;
+const int mask = (1<<8)-1;
+int spat;
+int parity;
+
+for (spat = 0; spat != 2; ++spat) {
+for (parity = 0; parity != 2; ++parity) {
+if (check_func(ctx_8.filter_edge, "bwdif8.edge.s%d.p%d", spat, 
parity)) {
+
+declare_func(void, void *dst1, void *prev1, void *cur1, 
void *next1,
+int w, int prefs, int mrefs, int 
prefs2, int mrefs2,
+int parity, int clip_max, int 
spat);
+
+randomize_buffers(prev0, prev1, mask, 11*WIDTH);
+randomize_buffers(next0, next1, mask, 11*WIDTH);
+randomize_buffers( cur0,  cur1, mask, 11*WIDTH);
+memset(dst0, 0xba, WIDTH * 3);
+memset(dst1, 0xba, WIDTH * 3);
+
+call_ref(dst0 + stride,
+ prev0 + stride * 4, cur0 + stride * 4, next0 + 
stride * 4, WIDTH,
+ stride, -stride, stride * 2, -stride * 2,
+ parity, mask, spat);
+call_new(dst1 + stride,
+ prev1 + stride * 4, cur1 + stride * 4, next1 + 
stride * 4, WIDTH,
+ stride, -stride, stride * 2, -stride * 2,
+ parity, mask, spat);
+
+if (memcmp(dst0, dst1, WIDTH*3)
+|| memcmp(prev0, prev1, WIDTH*11)
+|| memcmp(next0, next1, WIDTH*11)
+|| memcmp( cur0,  cur1, WIDTH*11))
+fail();
+
+bench_new(dst1 + stride,
+ prev1 + stride * 4, cur1 + stride * 4, next1 + 
stride * 4, WIDTH,
+ stride, -stride, stride * 2, -stride * 2,
+ parity, mask, spat);
+}
+}
+}
+
+report("bwdif8.edge");
+}
+
 if (check_func(ctx_8.filter_intra, "bwdif8.intra")) {
 LOCAL_ALIGNED_16(uint8_t, cur0,  [11*WIDTH]);
 LOCAL_ALIGNED_16(uint8_t, cur1,  [11*WIDTH]);
-- 
2.39.2

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH 10/15] avfilter/vf_bwdif: Export C filter_line

2023-06-29 Thread John Cox

Needed for tail fixup of neon code

Signed-off-by: John Cox 
---
 libavfilter/bwdif.h|  5 +
 libavfilter/vf_bwdif.c | 10 +-
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/libavfilter/bwdif.h b/libavfilter/bwdif.h
index ae1616d366..cce99953f3 100644
--- a/libavfilter/bwdif.h
+++ b/libavfilter/bwdif.h
@@ -48,4 +48,9 @@ void ff_bwdif_filter_edge_c(void *dst1, void *prev1, void 
*cur1, void *next1,
 void ff_bwdif_filter_intra_c(void *dst1, void *cur1, int w, int prefs, int 
mrefs,
  int prefs3, int mrefs3, int parity, int clip_max);
 
+void ff_bwdif_filter_line_c(void *dst1, void *prev1, void *cur1, void *next1,
+int w, int prefs, int mrefs, int prefs2, int 
mrefs2,
+int prefs3, int mrefs3, int prefs4, int mrefs4,
+int parity, int clip_max);
+
 #endif /* AVFILTER_BWDIF_H */
diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c
index bec83111b4..26349da1fd 100644
--- a/libavfilter/vf_bwdif.c
+++ b/libavfilter/vf_bwdif.c
@@ -132,10 +132,10 @@ void ff_bwdif_filter_intra_c(void *dst1, void *cur1, int 
w, int prefs, int mrefs
 FILTER_INTRA()
 }
 
-static void filter_line_c(void *dst1, void *prev1, void *cur1, void *next1,
-  int w, int prefs, int mrefs, int prefs2, int mrefs2,
-  int prefs3, int mrefs3, int prefs4, int mrefs4,
-  int parity, int clip_max)
+void ff_bwdif_filter_line_c(void *dst1, void *prev1, void *cur1, void *next1,
+int w, int prefs, int mrefs, int prefs2, int 
mrefs2,
+int prefs3, int mrefs3, int prefs4, int mrefs4,
+int parity, int clip_max)
 {
 uint8_t *dst   = dst1;
 uint8_t *prev  = prev1;
@@ -363,7 +363,7 @@ av_cold void ff_bwdif_init_filter_line(BWDIFContext *s, int 
bit_depth)
 s->filter_edge  = filter_edge_16bit;
 } else {
 s->filter_intra = ff_bwdif_filter_intra_c;
-s->filter_line  = filter_line_c;
+s->filter_line  = ff_bwdif_filter_line_c;
 s->filter_edge  = ff_bwdif_filter_edge_c;
 }
 
-- 
2.39.2

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH 11/15] avfilter/vf_bwdif: Add neon for filter_line

2023-06-29 Thread John Cox

Signed-off-by: John Cox 
---
 libavfilter/aarch64/vf_bwdif_init_aarch64.c |  21 ++
 libavfilter/aarch64/vf_bwdif_neon.S | 215 
 2 files changed, 236 insertions(+)

diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c 
b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
index e75cf2f204..21e67884ab 100644
--- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c
+++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
@@ -31,6 +31,26 @@ void ff_bwdif_filter_edge_neon(void *dst1, void *prev1, void 
*cur1, void *next1,
 void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, int 
mrefs,
 int prefs3, int mrefs3, int parity, int 
clip_max);
 
+void ff_bwdif_filter_line_neon(void *dst1, void *prev1, void *cur1, void 
*next1,
+   int w, int prefs, int mrefs, int prefs2, int 
mrefs2,
+   int prefs3, int mrefs3, int prefs4, int mrefs4,
+   int parity, int clip_max);
+
+
+static void filter_line_helper(void *dst1, void *prev1, void *cur1, void 
*next1,
+   int w, int prefs, int mrefs, int prefs2, int 
mrefs2,
+   int prefs3, int mrefs3, int prefs4, int mrefs4,
+   int parity, int clip_max)
+{
+const int w0 = clip_max != 255 ? 0 : w & ~15;
+
+ff_bwdif_filter_line_neon(dst1, prev1, cur1, next1,
+  w0, prefs, mrefs, prefs2, mrefs2, prefs3, 
mrefs3, prefs4, mrefs4, parity, clip_max);
+
+if (w0 < w)
+ff_bwdif_filter_line_c((char *)dst1 + w0, (char *)prev1 + w0, (char 
*)cur1 + w0, (char *)next1 + w0,
+   w - w0, prefs, mrefs, prefs2, mrefs2, prefs3, 
mrefs3, prefs4, mrefs4, parity, clip_max);
+}
 
 static void filter_edge_helper(void *dst1, void *prev1, void *cur1, void 
*next1,
int w, int prefs, int mrefs, int prefs2, int 
mrefs2,
@@ -71,6 +91,7 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
 return;
 
 s->filter_intra = filter_intra_helper;
+s->filter_line  = filter_line_helper;
 s->filter_edge  = filter_edge_helper;
 }
 
diff --git a/libavfilter/aarch64/vf_bwdif_neon.S 
b/libavfilter/aarch64/vf_bwdif_neon.S
index a33b235882..675e97d966 100644
--- a/libavfilter/aarch64/vf_bwdif_neon.S
+++ b/libavfilter/aarch64/vf_bwdif_neon.S
@@ -128,6 +128,221 @@ coeffs:
 .hword  5570, 3801, 1016, -3801 // hf[0] = v0.h[2], 
-hf[1] = v0.h[5]
 .hword  5077, 981   // sp[0] = v0.h[6]
 
+// ===
+//
+// void filter_line(
+//  void *dst1, // x0
+//  void *prev1,// x1
+//  void *cur1, // x2
+//  void *next1,// x3
+//  int w,  // w4
+//  int prefs,  // w5
+//  int mrefs,  // w6
+//  int prefs2, // w7
+//  int mrefs2, // [sp, #0]
+//  int prefs3, // [sp, #8]
+//  int mrefs3, // [sp, #16]
+//  int prefs4, // [sp, #24]
+//  int mrefs4, // [sp, #32]
+//  int parity, // [sp, #40]
+//  int clip_max)   // [sp, #48]
+
+function ff_bwdif_filter_line_neon, export=1
+// Sanity check w
+cmp w4, #0
+ble 99f
+
+// Rearrange regs to be the same as line3 for ease of debug!
+mov w10, w4 // w10 = loop count
+mov w9,  w6 // w9  = mref
+mov w12, w7 // w12 = pref2
+mov w11, w5 // w11 = pref
+ldr w8,  [sp, #0]   // w8 =  mref2
+ldr w7,  [sp, #16]  // w7  = mref3
+ldr w6,  [sp, #32]  // w6  = mref4
+ldr w13, [sp, #8]   // w13 = pref3
+ldr w14, [sp, #24]  // w14 = pref4
+
+mov x4,  x3
+mov x3,  x2
+mov x2,  x1
+
+// #define prev2 cur
+//const uint8_t * restrict next2 = parity ? prev : next;
+ldr w17, [sp, #40]  // parity
+cmp w17, #0
+cselx17, x2, x4, ne
+
+// We want all the V registers - save all the ones we must
+stp d14, d15, [sp, #-64]!
+stp d8,  d9,  [sp, #48]
+stp d10, d11, [sp, #32]
+stp d12, d13, [sp, #16]
+
+ldr q0, coeffs
+
+// for (x = 0; x < w; x++) {
+// int diff0, diff2;
+// int d0, d2;
+// int temporal_diff0, temporal_diff2;
+//
+// int i1, i2;
+// int j1, j2;
+// int p6, p5, p4, p3, p2, p1,

[FFmpeg-devel] [PATCH 12/15] avfilter/vf_bwdif: Add a filter_line3 method for optimisation

2023-06-29 Thread John Cox

Add an optional filter_line3 to the available optimisations.

filter_line3 is equivalent to filter_line, memcpy, filter_line

filter_line shares quite a number of loads and some calculations in
common with its next iteration and testing shows that using aarch64
neon filter_line3s performance is 30% better than two filter_lines
and a memcpy.

Signed-off-by: John Cox 
---
 libavfilter/bwdif.h|  7 +++
 libavfilter/vf_bwdif.c | 31 +++
 2 files changed, 38 insertions(+)

diff --git a/libavfilter/bwdif.h b/libavfilter/bwdif.h
index cce99953f3..496cec72ef 100644
--- a/libavfilter/bwdif.h
+++ b/libavfilter/bwdif.h
@@ -35,6 +35,9 @@ typedef struct BWDIFContext {
 void (*filter_edge)(void *dst, void *prev, void *cur, void *next,
 int w, int prefs, int mrefs, int prefs2, int mrefs2,
 int parity, int clip_max, int spat);
+void (*filter_line3)(void *dst, int dstride,
+ const void *prev, const void *cur, const void *next, 
int prefs,
+ int w, int parity, int clip_max);
 } BWDIFContext;
 
 void ff_bwdif_init_filter_line(BWDIFContext *bwdif, int bit_depth);
@@ -53,4 +56,8 @@ void ff_bwdif_filter_line_c(void *dst1, void *prev1, void 
*cur1, void *next1,
 int prefs3, int mrefs3, int prefs4, int mrefs4,
 int parity, int clip_max);
 
+void ff_bwdif_filter_line3_c(void * dst1, int d_stride,
+ const void * prev1, const void * cur1, const void 
* next1, int s_stride,
+ int w, int parity, int clip_max);
+
 #endif /* AVFILTER_BWDIF_H */
diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c
index 26349da1fd..52bc676cf8 100644
--- a/libavfilter/vf_bwdif.c
+++ b/libavfilter/vf_bwdif.c
@@ -150,6 +150,31 @@ void ff_bwdif_filter_line_c(void *dst1, void *prev1, void 
*cur1, void *next1,
 FILTER2()
 }
 
+#define NEXT_LINE()\
+dst += d_stride; \
+prev += prefs; \
+cur  += prefs; \
+next += prefs;
+
+void ff_bwdif_filter_line3_c(void * dst1, int d_stride,
+ const void * prev1, const void * cur1, const void 
* next1, int s_stride,
+ int w, int parity, int clip_max)
+{
+const int prefs = s_stride;
+uint8_t * dst  = dst1;
+const uint8_t * prev = prev1;
+const uint8_t * cur  = cur1;
+const uint8_t * next = next1;
+
+ff_bwdif_filter_line_c(dst, (void*)prev, (void*)cur, (void*)next, w,
+   prefs, -prefs, prefs * 2, - prefs * 2, prefs * 3, 
-prefs * 3, prefs * 4, -prefs * 4, parity, clip_max);
+NEXT_LINE();
+memcpy(dst, cur, w);
+NEXT_LINE();
+ff_bwdif_filter_line_c(dst, (void*)prev, (void*)cur, (void*)next, w,
+   prefs, -prefs, prefs * 2, - prefs * 2, prefs * 3, 
-prefs * 3, prefs * 4, -prefs * 4, parity, clip_max);
+}
+
 void ff_bwdif_filter_edge_c(void *dst1, void *prev1, void *cur1, void *next1,
 int w, int prefs, int mrefs, int prefs2, int 
mrefs2,
 int parity, int clip_max, int spat)
@@ -244,6 +269,11 @@ static int filter_slice(AVFilterContext *ctx, void *arg, 
int jobnr, int nb_jobs)
refs << 1, -(refs << 1),
td->parity ^ td->tff, clip_max,
(y < 2) || ((y + 3) > td->h) ? 0 : 1);
+} else if (s->filter_line3 && y + 2 < slice_end && y + 6 < td->h) {
+s->filter_line3(dst, td->frame->linesize[td->plane],
+prev, cur, next, linesize, td->w,
+td->parity ^ td->tff, clip_max);
+y += 2;
 } else {
 s->filter_line(dst, prev, cur, next, td->w,
refs, -refs, refs << 1, -(refs << 1),
@@ -357,6 +387,7 @@ static int config_props(AVFilterLink *link)
 
 av_cold void ff_bwdif_init_filter_line(BWDIFContext *s, int bit_depth)
 {
+s->filter_line3 = 0;
 if (bit_depth > 8) {
 s->filter_intra = filter_intra_16bit;
 s->filter_line  = filter_line_c_16bit;
-- 
2.39.2

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH 00/15] avfilter/vf_bwdif: Add aarch64 neon functions

2023-07-02 Thread John Cox

Hi

>On Thu, 29 Jun 2023, John Cox wrote:
>
>> Also adds a filter_line3 method which on aarch64 neon yields approx 30%
>> speedup over 2xfilter_line and a memcpy
>>
>> John Cox (15):
>>  avfilter/vf_bwdif: Add outline for aarch neon functions
>>  avfilter/vf_bwdif: Add common macros and consts for aarch64 neon
>>  avfilter/vf_bwdif: Export C filter_intra
>>  avfilter/vf_bwdif: Add neon for filter_intra
>>  tests/checkasm: Add test for vf_bwdif filter_intra
>>  avfilter/vf_bwdif: Add clip and spatial macros for aarch64 neon
>>  avfilter/vf_bwdif: Export C filter_edge
>>  avfilter/vf_bwdif: Add neon for filter_edge
>>  tests/checkasm: Add test for vf_bwdif filter_edge
>>  avfilter/vf_bwdif: Export C filter_line
>>  avfilter/vf_bwdif: Add neon for filter_line
>>  avfilter/vf_bwdif: Add a filter_line3 method for optimisation
>>  avfilter/vf_bwdif: Add neon for filter_line3
>>  tests/checkasm: Add test for vf_bwdif filter_line3
>>  avfilter/vf_bwdif: Block filter slices into a multiple of 4 lines
>
>It's nice to have this split up in small easily checkable patches, but 
>this is perhaps a bit more finegrained than what's usual. But I guess 
>that's ok...

I normally find that people ask me to split patches so I though I'd cut
stuff down to the minimum plausible unit.

I'm more than happy to coalesce stuff if wanted.

JC

>I'll comment on the patches that need commenting on.
>
>// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH 02/15] avfilter/vf_bwdif: Add common macros and consts for aarch64 neon

2023-07-02 Thread John Cox

On Sun, 2 Jul 2023 00:35:14 +0300 (EEST), you wrote:

>On Thu, 29 Jun 2023, John Cox wrote:
>
>> Add macros for dual scalar half->single multiply and accumulate
>> Add macro for shift, saturate and shorten single to byte
>> Add filter constants
>>
>> Signed-off-by: John Cox 
>> ---
>> libavfilter/aarch64/vf_bwdif_neon.S | 46 +
>> 1 file changed, 46 insertions(+)
>>
>> diff --git a/libavfilter/aarch64/vf_bwdif_neon.S 
>> b/libavfilter/aarch64/vf_bwdif_neon.S
>> index 639ab22998..a8f0ed525a 100644
>> --- a/libavfilter/aarch64/vf_bwdif_neon.S
>> +++ b/libavfilter/aarch64/vf_bwdif_neon.S
>> @@ -23,3 +23,49 @@
>>
>> #include "libavutil/aarch64/asm.S"
>>
>> +.macro SQSHRUNN b, s0, s1, s2, s3, n
>> +sqshrun \s0\().4h, \s0\().4s, #\n - 8
>> +sqshrun2\s0\().8h, \s1\().4s, #\n - 8
>> +sqshrun \s1\().4h, \s2\().4s, #\n - 8
>> +sqshrun2\s1\().8h, \s3\().4s, #\n - 8
>> +uzp2\b\().16b, \s0\().16b, \s1\().16b
>> +.endm
>> +
>> +.macro SMULL4K a0, a1, a2, a3, s0, s1, k
>> +smull   \a0\().4s, \s0\().4h, \k
>> +smull2  \a1\().4s, \s0\().8h, \k
>> +smull   \a2\().4s, \s1\().4h, \k
>> +smull2  \a3\().4s, \s1\().8h, \k
>> +.endm
>> +
>> +.macro UMULL4K a0, a1, a2, a3, s0, s1, k
>> +umull   \a0\().4s, \s0\().4h, \k
>> +umull2  \a1\().4s, \s0\().8h, \k
>> +umull   \a2\().4s, \s1\().4h, \k
>> +umull2  \a3\().4s, \s1\().8h, \k
>> +.endm
>> +
>> +.macro UMLAL4K a0, a1, a2, a3, s0, s1, k
>> +umlal   \a0\().4s, \s0\().4h, \k
>> +umlal2  \a1\().4s, \s0\().8h, \k
>> +umlal   \a2\().4s, \s1\().4h, \k
>> +umlal2  \a3\().4s, \s1\().8h, \k
>> +.endm
>> +
>> +.macro UMLSL4K a0, a1, a2, a3, s0, s1, k
>> +umlsl   \a0\().4s, \s0\().4h, \k
>> +umlsl2  \a1\().4s, \s0\().8h, \k
>> +umlsl   \a2\().4s, \s1\().4h, \k
>> +umlsl2  \a3\().4s, \s1\().8h, \k
>> +.endm
>> +
>> +// static const uint16_t coef_lf[2] = { 4309, 213 };
>> +// static const uint16_t coef_hf[3] = { 5570, 3801, 1016 };
>> +// static const uint16_t coef_sp[2] = { 5077, 981 };
>> +
>> +.align 16
>
>Note that .align for arm is power of two; this triggers a 2^16 byte 
>alignment here, which certainly isn't intended.

Yikes! I'll swap that for a .balign now I've looked that up

>But in general, the usual way of defining constants is with a 
>const/endconst block, which places them in the right rdata section instead 
>of in the text section. But that then requires you to use a movrel macro 
>for accessing the data, instead of a plain ldr instruction.

Yeah - arm has a history of mixing text & const - I went with the
simpler code. Is this a deal breaker or can I leave it as is?

JC

>> +coeffs:
>> +.hword  4309 * 4, 213 * 4   // lf[0]*4 = v0.h[0]
>> +.hword  5570, 3801, 1016, -3801 // hf[0] = v0.h[2], 
>> -hf[1] = v0.h[5]
>> +.hword  5077, 981   // sp[0] = v0.h[6]
>> +
>> --
>
>
>// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH 04/15] avfilter/vf_bwdif: Add neon for filter_intra

2023-07-02 Thread John Cox

On Sun, 2 Jul 2023 00:37:35 +0300 (EEST), you wrote:

>On Thu, 29 Jun 2023, John Cox wrote:
>
>> Signed-off-by: John Cox 
>> ---
>> libavfilter/aarch64/vf_bwdif_init_aarch64.c | 17 +++
>> libavfilter/aarch64/vf_bwdif_neon.S | 53 +
>> 2 files changed, 70 insertions(+)
>>
>> diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c 
>> b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
>> index 86d53b2ca1..3ffaa07ab3 100644
>> --- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c
>> +++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
>> @@ -24,6 +24,22 @@
>> #include "libavfilter/bwdif.h"
>> #include "libavutil/aarch64/cpu.h"
>>
>> +void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, 
>> int mrefs,
>> +int prefs3, int mrefs3, int parity, int 
>> clip_max);
>> +
>> +
>> +static void filter_intra_helper(void *dst1, void *cur1, int w, int prefs, 
>> int mrefs,
>> +int prefs3, int mrefs3, int parity, int 
>> clip_max)
>> +{
>> +const int w0 = clip_max != 255 ? 0 : w & ~15;
>> +
>> +ff_bwdif_filter_intra_neon(dst1, cur1, w0, prefs, mrefs, prefs3, 
>> mrefs3, parity, clip_max);
>> +
>> +if (w0 < w)
>> +ff_bwdif_filter_intra_c((char *)dst1 + w0, (char *)cur1 + w0,
>> +w - w0, prefs, mrefs, prefs3, mrefs3, 
>> parity, clip_max);
>> +}
>> +
>> void
>> ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
>> {
>> @@ -35,5 +51,6 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
>> if (!have_neon(cpu_flags))
>> return;
>>
>> +s->filter_intra = filter_intra_helper;
>> }
>>
>> diff --git a/libavfilter/aarch64/vf_bwdif_neon.S 
>> b/libavfilter/aarch64/vf_bwdif_neon.S
>> index a8f0ed525a..b863b3447d 100644
>> --- a/libavfilter/aarch64/vf_bwdif_neon.S
>> +++ b/libavfilter/aarch64/vf_bwdif_neon.S
>> @@ -69,3 +69,56 @@ coeffs:
>> .hword  5570, 3801, 1016, -3801 // hf[0] = v0.h[2], 
>> -hf[1] = v0.h[5]
>> .hword  5077, 981   // sp[0] = v0.h[6]
>>
>> +// 
>> 
>> +//
>> +// void ff_bwdif_filter_intra_neon(
>> +//  void *dst1, // x0
>> +//  void *cur1, // x1
>> +//  int w,  // w2
>> +//  int prefs,  // w3
>> +//  int mrefs,  // w4
>> +//  int prefs3, // w5
>> +//  int mrefs3, // w6
>> +//  int parity, // w7   unused
>> +//  int clip_max)   // [sp, #0] unused
>
>This bit is great to have
>
>> +
>> +function ff_bwdif_filter_intra_neon, export=1
>> +cmp w2, #0
>> +ble 99f
>> +
>> +ldr q0, coeffs
>> +
>> +//for (x = 0; x < w; x++) {
>> +10:
>> +
>> +//interpol = (coef_sp[0] * (cur[mrefs] + cur[prefs]) - coef_sp[1] * 
>> (cur[mrefs3] + cur[prefs3])) >> 13;
>
>I guess the style with intermixed C code is a bit uncommon in our 
>assembly, but as long as it doesn't affect the overall code style I guess 
>we can keep it.

I needed it to track where I was whilst writing the code & if I ever
need to change it I'll be lost without it - so I, at least, rate it as
valuable and in line3 where I am very tight on registers it was
invaluable for keeping track of what referred to what.

>> +ldr q31, [x1, w4, SXTW]
>> +ldr q30, [x1, w3, SXTW]
>> +ldr q29, [x1, w6, SXTW]
>> +ldr q28, [x1, w5, SXTW]
>
>Don't use shouty uppercase SXTW here

Will change.

>> +
>> +uaddl   v20.8h,  v31.8b,  v30.8b
>> +uaddl2  v21.8h,  v31.16b, v30.16b
>> +
>> +UMULL4K v2, v3, v4, v5, v20, v21, v0.h[6]
>> +
>> +uaddl   v20.8h,  v29.8b,  v28.8b
>> +uaddl2  v21.8h,  v29.16b, v28.16b
>> +
>> +UMLSL4K v2, v3, v4, v5, v20, v21, v0.h[7]
>> +
>> +//dst[0] = av_clip(interpol, 0, clip_max);
>> +SQSHRUNNv2, v2, v3, v4, v5, 13
>> +str q2, [x0], #16
>> +
>> +//dst++;
>> +//cur++;
>> +//}
>> +
>> +subsw2,  w2,  #16
>> +add x1,  x1,  #

Re: [FFmpeg-devel] [PATCH 08/15] avfilter/vf_bwdif: Add neon for filter_edge

2023-07-02 Thread John Cox

On Sun, 2 Jul 2023 00:40:09 +0300 (EEST), you wrote:

>On Thu, 29 Jun 2023, John Cox wrote:
>
>> Signed-off-by: John Cox 
>> ---
>> libavfilter/aarch64/vf_bwdif_init_aarch64.c |  20 
>> libavfilter/aarch64/vf_bwdif_neon.S | 104 
>> 2 files changed, 124 insertions(+)
>>
>> diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c 
>> b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
>> index 3ffaa07ab3..e75cf2f204 100644
>> --- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c
>> +++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
>> @@ -24,10 +24,29 @@
>> #include "libavfilter/bwdif.h"
>> #include "libavutil/aarch64/cpu.h"
>>
>> +void ff_bwdif_filter_edge_neon(void *dst1, void *prev1, void *cur1, void 
>> *next1,
>> +   int w, int prefs, int mrefs, int prefs2, int 
>> mrefs2,
>> +   int parity, int clip_max, int spat);
>> +
>> void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, 
>> int mrefs,
>> int prefs3, int mrefs3, int parity, int 
>> clip_max);
>>
>>
>> +static void filter_edge_helper(void *dst1, void *prev1, void *cur1, void 
>> *next1,
>> +   int w, int prefs, int mrefs, int prefs2, int 
>> mrefs2,
>> +   int parity, int clip_max, int spat)
>> +{
>> +const int w0 = clip_max != 255 ? 0 : w & ~15;
>> +
>> +ff_bwdif_filter_edge_neon(dst1, prev1, cur1, next1, w0, prefs, mrefs, 
>> prefs2, mrefs2,
>> +  parity, clip_max, spat);
>> +
>> +if (w0 < w)
>> +ff_bwdif_filter_edge_c((char *)dst1 + w0, (char *)prev1 + w0, (char 
>> *)cur1 + w0, (char *)next1 + w0,
>> +   w - w0, prefs, mrefs, prefs2, mrefs2,
>> +   parity, clip_max, spat);
>> +}
>> +
>> static void filter_intra_helper(void *dst1, void *cur1, int w, int prefs, 
>> int mrefs,
>> int prefs3, int mrefs3, int parity, int 
>> clip_max)
>> {
>> @@ -52,5 +71,6 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
>> return;
>>
>> s->filter_intra = filter_intra_helper;
>> +s->filter_edge  = filter_edge_helper;
>> }
>>
>> diff --git a/libavfilter/aarch64/vf_bwdif_neon.S 
>> b/libavfilter/aarch64/vf_bwdif_neon.S
>> index 6c5d1598f4..a33b235882 100644
>> --- a/libavfilter/aarch64/vf_bwdif_neon.S
>> +++ b/libavfilter/aarch64/vf_bwdif_neon.S
>> @@ -128,6 +128,110 @@ coeffs:
>> .hword  5570, 3801, 1016, -3801 // hf[0] = v0.h[2], 
>> -hf[1] = v0.h[5]
>> .hword  5077, 981   // sp[0] = v0.h[6]
>>
>> +// 
>> 
>> +//
>> +// void ff_bwdif_filter_edge_neon(
>> +//  void *dst1, // x0
>> +//  void *prev1,// x1
>> +//  void *cur1, // x2
>> +//  void *next1,// x3
>> +//  int w,  // w4
>> +//  int prefs,  // w5
>> +//  int mrefs,  // w6
>> +//  int prefs2, // w7
>> +//  int mrefs2, // [sp, #0]
>> +//  int parity, // [sp, #8]
>> +//  int clip_max,   // [sp, #16]  unused
>> +//  int spat);  // [sp, #24]
>
>This doesn't hold for macOS targets (and the checkasm tests fail on that 
>platform).
>
>On macOS, arguments that aren't passed in registers but on the stack, are 
>tightly packed. So since parity is 32 bit and mrefs2 also was 32 bit, 
>parity is available at [sp, #4].
>
>Therefore, it's usually simplest for portability reasons, to pass any 
>arguments after the first 8, as intptr_t or ptrdiff_t, as that makes them 
>consistent across platforms.

Not my interface - this is already existing code. What do you suggest I
do?

I'm happy either to change the interface or fix my stack offsets if
there is any clue that lets me detect this ABI. As personal preference
I'd choose the latter.

I don't have easy access to a mac. Is there any easy way of getting this
tested before resubmission?

Thanks

JC

>// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH 11/15] avfilter/vf_bwdif: Add neon for filter_line

2023-07-02 Thread John Cox

On Sun, 2 Jul 2023 00:44:10 +0300 (EEST), you wrote:

>On Thu, 29 Jun 2023, John Cox wrote:
>
>> Signed-off-by: John Cox 
>> ---
>> libavfilter/aarch64/vf_bwdif_init_aarch64.c |  21 ++
>> libavfilter/aarch64/vf_bwdif_neon.S | 215 
>> 2 files changed, 236 insertions(+)
>>
>> diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c 
>> b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
>> index e75cf2f204..21e67884ab 100644
>> --- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c
>> +++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
>> @@ -31,6 +31,26 @@ void ff_bwdif_filter_edge_neon(void *dst1, void *prev1, 
>> void *cur1, void *next1,
>> void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, 
>> int mrefs,
>> int prefs3, int mrefs3, int parity, int 
>> clip_max);
>>
>> +void ff_bwdif_filter_line_neon(void *dst1, void *prev1, void *cur1, void 
>> *next1,
>> +   int w, int prefs, int mrefs, int prefs2, int 
>> mrefs2,
>> +   int prefs3, int mrefs3, int prefs4, int 
>> mrefs4,
>> +   int parity, int clip_max);
>> +
>> +
>> +static void filter_line_helper(void *dst1, void *prev1, void *cur1, void 
>> *next1,
>> +   int w, int prefs, int mrefs, int prefs2, int 
>> mrefs2,
>> +   int prefs3, int mrefs3, int prefs4, int 
>> mrefs4,
>> +   int parity, int clip_max)
>> +{
>> +const int w0 = clip_max != 255 ? 0 : w & ~15;
>> +
>> +ff_bwdif_filter_line_neon(dst1, prev1, cur1, next1,
>> +  w0, prefs, mrefs, prefs2, mrefs2, prefs3, 
>> mrefs3, prefs4, mrefs4, parity, clip_max);
>> +
>> +if (w0 < w)
>> +ff_bwdif_filter_line_c((char *)dst1 + w0, (char *)prev1 + w0, (char 
>> *)cur1 + w0, (char *)next1 + w0,
>> +   w - w0, prefs, mrefs, prefs2, mrefs2, 
>> prefs3, mrefs3, prefs4, mrefs4, parity, clip_max);
>> +}
>>
>> static void filter_edge_helper(void *dst1, void *prev1, void *cur1, void 
>> *next1,
>>int w, int prefs, int mrefs, int prefs2, int 
>> mrefs2,
>> @@ -71,6 +91,7 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
>> return;
>>
>> s->filter_intra = filter_intra_helper;
>> +s->filter_line  = filter_line_helper;
>> s->filter_edge  = filter_edge_helper;
>> }
>>
>> diff --git a/libavfilter/aarch64/vf_bwdif_neon.S 
>> b/libavfilter/aarch64/vf_bwdif_neon.S
>> index a33b235882..675e97d966 100644
>> --- a/libavfilter/aarch64/vf_bwdif_neon.S
>> +++ b/libavfilter/aarch64/vf_bwdif_neon.S
>> @@ -128,6 +128,221 @@ coeffs:
>> .hword  5570, 3801, 1016, -3801 // hf[0] = v0.h[2], 
>> -hf[1] = v0.h[5]
>> .hword  5077, 981   // sp[0] = v0.h[6]
>>
>> +// 
>> ===
>> +//
>> +// void filter_line(
>> +//  void *dst1, // x0
>> +//  void *prev1,// x1
>> +//  void *cur1, // x2
>> +//  void *next1,// x3
>> +//  int w,  // w4
>> +//  int prefs,  // w5
>> +//  int mrefs,  // w6
>> +//  int prefs2, // w7
>> +//  int mrefs2, // [sp, #0]
>> +//  int prefs3, // [sp, #8]
>> +//  int mrefs3, // [sp, #16]
>> +//  int prefs4, // [sp, #24]
>> +//  int mrefs4, // [sp, #32]
>> +//  int parity, // [sp, #40]
>> +//  int clip_max)   // [sp, #48]
>> +
>> +function ff_bwdif_filter_line_neon, export=1
>> +// Sanity check w
>> +cmp w4, #0
>> +ble 99f
>> +
>> +// Rearrange regs to be the same as line3 for ease of debug!
>> +mov w10, w4 // w10 = loop count
>> +mov w9,  w6 // w9  = mref
>> +mov w12, w7 // w12 = pref2
>> +mov w11, w5 // w11 = pref
>> +ldr w8,  [sp, #0]   // w8 =  mref2
>> +ldr w7,  [sp, #16]  // w7  = mref3
>> +ldr w6,  [sp, #32]  // w6  = mref4
>> +ld

[FFmpeg-devel] [PATCH v2 00/15] avfilter/vf_bwdif: Add aarch64 neon functions

2023-07-02 Thread John Cox

Also adds a filter_line3 method which on aarch64 neon yields approx 30%
speedup over 2xfilter_line and a memcpy

Differences from v1:
.align 16 corrected to .balign 16
SXTW tolower
Mac ABI (hopefully) fixed
V register pop/push macroed & prettified

John Cox (15):
  avfilter/vf_bwdif: Add outline for aarch neon functions
  avfilter/vf_bwdif: Add common macros and consts for aarch64 neon
  avfilter/vf_bwdif: Export C filter_intra
  avfilter/vf_bwdif: Add neon for filter_intra
  tests/checkasm: Add test for vf_bwdif filter_intra
  avfilter/vf_bwdif: Add clip and spatial macros for aarch64 neon
  avfilter/vf_bwdif: Export C filter_edge
  avfilter/vf_bwdif: Add neon for filter_edge
  tests/checkasm: Add test for vf_bwdif filter_edge
  avfilter/vf_bwdif: Export C filter_line
  avfilter/vf_bwdif: Add neon for filter_line
  avfilter/vf_bwdif: Add a filter_line3 method for optimisation
  avfilter/vf_bwdif: Add neon for filter_line3
  tests/checkasm: Add test for vf_bwdif filter_line3
  avfilter/vf_bwdif: Block filter slices into a multiple of 4 lines

 libavfilter/aarch64/Makefile|   2 +
 libavfilter/aarch64/vf_bwdif_init_aarch64.c | 125 
 libavfilter/aarch64/vf_bwdif_neon.S | 788 
 libavfilter/bwdif.h |  20 +
 libavfilter/vf_bwdif.c  |  70 +-
 tests/checkasm/vf_bwdif.c   | 172 +
 6 files changed, 1162 insertions(+), 15 deletions(-)
 create mode 100644 libavfilter/aarch64/vf_bwdif_init_aarch64.c
 create mode 100644 libavfilter/aarch64/vf_bwdif_neon.S

-- 
2.39.2

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH v2 01/15] avfilter/vf_bwdif: Add outline for aarch neon functions

2023-07-02 Thread John Cox

Outline but no actual functions.

Signed-off-by: John Cox 
---
 libavfilter/aarch64/Makefile|  2 ++
 libavfilter/aarch64/vf_bwdif_init_aarch64.c | 39 +
 libavfilter/aarch64/vf_bwdif_neon.S | 25 +
 libavfilter/bwdif.h |  1 +
 libavfilter/vf_bwdif.c  |  2 ++
 5 files changed, 69 insertions(+)
 create mode 100644 libavfilter/aarch64/vf_bwdif_init_aarch64.c
 create mode 100644 libavfilter/aarch64/vf_bwdif_neon.S

diff --git a/libavfilter/aarch64/Makefile b/libavfilter/aarch64/Makefile
index b58daa3a3f..b68209bc94 100644
--- a/libavfilter/aarch64/Makefile
+++ b/libavfilter/aarch64/Makefile
@@ -1,3 +1,5 @@
+OBJS-$(CONFIG_BWDIF_FILTER)  += aarch64/vf_bwdif_init_aarch64.o
 OBJS-$(CONFIG_NLMEANS_FILTER)+= aarch64/vf_nlmeans_init.o
 
+NEON-OBJS-$(CONFIG_BWDIF_FILTER) += aarch64/vf_bwdif_neon.o
 NEON-OBJS-$(CONFIG_NLMEANS_FILTER)   += aarch64/vf_nlmeans_neon.o
diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c 
b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
new file mode 100644
index 00..86d53b2ca1
--- /dev/null
+++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
@@ -0,0 +1,39 @@
+/*
+ * bwdif aarch64 NEON optimisations
+ *
+ * Copyright (c) 2023 John Cox 
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/common.h"
+#include "libavfilter/bwdif.h"
+#include "libavutil/aarch64/cpu.h"
+
+void
+ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
+{
+const int cpu_flags = av_get_cpu_flags();
+
+if (bit_depth != 8)
+return;
+
+if (!have_neon(cpu_flags))
+return;
+
+}
+
diff --git a/libavfilter/aarch64/vf_bwdif_neon.S 
b/libavfilter/aarch64/vf_bwdif_neon.S
new file mode 100644
index 00..639ab22998
--- /dev/null
+++ b/libavfilter/aarch64/vf_bwdif_neon.S
@@ -0,0 +1,25 @@
+/*
+ * bwdif aarch64 NEON optimisations
+ *
+ * Copyright (c) 2023 John Cox 
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+
+#include "libavutil/aarch64/asm.S"
+
diff --git a/libavfilter/bwdif.h b/libavfilter/bwdif.h
index 5749345f78..6a0f70487a 100644
--- a/libavfilter/bwdif.h
+++ b/libavfilter/bwdif.h
@@ -39,5 +39,6 @@ typedef struct BWDIFContext {
 
 void ff_bwdif_init_filter_line(BWDIFContext *bwdif, int bit_depth);
 void ff_bwdif_init_x86(BWDIFContext *bwdif, int bit_depth);
+void ff_bwdif_init_aarch64(BWDIFContext *bwdif, int bit_depth);
 
 #endif /* AVFILTER_BWDIF_H */
diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c
index e278cf1217..39a51429ac 100644
--- a/libavfilter/vf_bwdif.c
+++ b/libavfilter/vf_bwdif.c
@@ -369,6 +369,8 @@ av_cold void ff_bwdif_init_filter_line(BWDIFContext *s, int 
bit_depth)
 
 #if ARCH_X86
 ff_bwdif_init_x86(s, bit_depth);
+#elif ARCH_AARCH64
+ff_bwdif_init_aarch64(s, bit_depth);
 #endif
 }
 
-- 
2.39.2

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH v2 02/15] avfilter/vf_bwdif: Add common macros and consts for aarch64 neon

2023-07-02 Thread John Cox

Add macros for dual scalar half->single multiply and accumulate
Add macro for shift, saturate and shorten single to byte
Add filter constants

Signed-off-by: John Cox 
---
 libavfilter/aarch64/vf_bwdif_neon.S | 53 +
 1 file changed, 53 insertions(+)

diff --git a/libavfilter/aarch64/vf_bwdif_neon.S 
b/libavfilter/aarch64/vf_bwdif_neon.S
index 639ab22998..c2f5eb1f73 100644
--- a/libavfilter/aarch64/vf_bwdif_neon.S
+++ b/libavfilter/aarch64/vf_bwdif_neon.S
@@ -23,3 +23,56 @@
 
 #include "libavutil/aarch64/asm.S"
 
+// Space taken on the stack by an int (32-bit)
+#ifdef __APPLE__
+.setSP_INT, 4
+#else
+.setSP_INT, 8
+#endif
+
+.macro SQSHRUNN b, s0, s1, s2, s3, n
+sqshrun \s0\().4h, \s0\().4s, #\n - 8
+sqshrun2\s0\().8h, \s1\().4s, #\n - 8
+sqshrun \s1\().4h, \s2\().4s, #\n - 8
+sqshrun2\s1\().8h, \s3\().4s, #\n - 8
+uzp2\b\().16b, \s0\().16b, \s1\().16b
+.endm
+
+.macro SMULL4K a0, a1, a2, a3, s0, s1, k
+smull   \a0\().4s, \s0\().4h, \k
+smull2  \a1\().4s, \s0\().8h, \k
+smull   \a2\().4s, \s1\().4h, \k
+smull2  \a3\().4s, \s1\().8h, \k
+.endm
+
+.macro UMULL4K a0, a1, a2, a3, s0, s1, k
+umull   \a0\().4s, \s0\().4h, \k
+umull2  \a1\().4s, \s0\().8h, \k
+umull   \a2\().4s, \s1\().4h, \k
+umull2  \a3\().4s, \s1\().8h, \k
+.endm
+
+.macro UMLAL4K a0, a1, a2, a3, s0, s1, k
+umlal   \a0\().4s, \s0\().4h, \k
+umlal2  \a1\().4s, \s0\().8h, \k
+umlal   \a2\().4s, \s1\().4h, \k
+umlal2  \a3\().4s, \s1\().8h, \k
+.endm
+
+.macro UMLSL4K a0, a1, a2, a3, s0, s1, k
+umlsl   \a0\().4s, \s0\().4h, \k
+umlsl2  \a1\().4s, \s0\().8h, \k
+umlsl   \a2\().4s, \s1\().4h, \k
+umlsl2  \a3\().4s, \s1\().8h, \k
+.endm
+
+// static const uint16_t coef_lf[2] = { 4309, 213 };
+// static const uint16_t coef_hf[3] = { 5570, 3801, 1016 };
+// static const uint16_t coef_sp[2] = { 5077, 981 };
+
+.balign 16
+coeffs:
+.hword  4309 * 4, 213 * 4   // lf[0]*4 = v0.h[0]
+.hword  5570, 3801, 1016, -3801 // hf[0] = v0.h[2], 
-hf[1] = v0.h[5]
+.hword  5077, 981   // sp[0] = v0.h[6]
+
-- 
2.39.2

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH v2 03/15] avfilter/vf_bwdif: Export C filter_intra

2023-07-02 Thread John Cox

Needed for tail fixup of neon code

Signed-off-by: John Cox 
---
 libavfilter/bwdif.h| 3 +++
 libavfilter/vf_bwdif.c | 6 +++---
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/libavfilter/bwdif.h b/libavfilter/bwdif.h
index 6a0f70487a..ae6f6ce223 100644
--- a/libavfilter/bwdif.h
+++ b/libavfilter/bwdif.h
@@ -41,4 +41,7 @@ void ff_bwdif_init_filter_line(BWDIFContext *bwdif, int 
bit_depth);
 void ff_bwdif_init_x86(BWDIFContext *bwdif, int bit_depth);
 void ff_bwdif_init_aarch64(BWDIFContext *bwdif, int bit_depth);
 
+void ff_bwdif_filter_intra_c(void *dst1, void *cur1, int w, int prefs, int 
mrefs,
+ int prefs3, int mrefs3, int parity, int clip_max);
+
 #endif /* AVFILTER_BWDIF_H */
diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c
index 39a51429ac..035fc58670 100644
--- a/libavfilter/vf_bwdif.c
+++ b/libavfilter/vf_bwdif.c
@@ -122,8 +122,8 @@ typedef struct ThreadData {
 next2++; \
 }
 
-static void filter_intra(void *dst1, void *cur1, int w, int prefs, int mrefs,
- int prefs3, int mrefs3, int parity, int clip_max)
+void ff_bwdif_filter_intra_c(void *dst1, void *cur1, int w, int prefs, int 
mrefs,
+ int prefs3, int mrefs3, int parity, int clip_max)
 {
 uint8_t *dst = dst1;
 uint8_t *cur = cur1;
@@ -362,7 +362,7 @@ av_cold void ff_bwdif_init_filter_line(BWDIFContext *s, int 
bit_depth)
 s->filter_line  = filter_line_c_16bit;
 s->filter_edge  = filter_edge_16bit;
 } else {
-s->filter_intra = filter_intra;
+s->filter_intra = ff_bwdif_filter_intra_c;
 s->filter_line  = filter_line_c;
 s->filter_edge  = filter_edge;
 }
-- 
2.39.2

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH v2 04/15] avfilter/vf_bwdif: Add neon for filter_intra

2023-07-02 Thread John Cox

Signed-off-by: John Cox 
---
 libavfilter/aarch64/vf_bwdif_init_aarch64.c | 17 +++
 libavfilter/aarch64/vf_bwdif_neon.S | 53 +
 2 files changed, 70 insertions(+)

diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c 
b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
index 86d53b2ca1..3ffaa07ab3 100644
--- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c
+++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
@@ -24,6 +24,22 @@
 #include "libavfilter/bwdif.h"
 #include "libavutil/aarch64/cpu.h"
 
+void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, int 
mrefs,
+int prefs3, int mrefs3, int parity, int 
clip_max);
+
+
+static void filter_intra_helper(void *dst1, void *cur1, int w, int prefs, int 
mrefs,
+int prefs3, int mrefs3, int parity, int 
clip_max)
+{
+const int w0 = clip_max != 255 ? 0 : w & ~15;
+
+ff_bwdif_filter_intra_neon(dst1, cur1, w0, prefs, mrefs, prefs3, mrefs3, 
parity, clip_max);
+
+if (w0 < w)
+ff_bwdif_filter_intra_c((char *)dst1 + w0, (char *)cur1 + w0,
+w - w0, prefs, mrefs, prefs3, mrefs3, parity, 
clip_max);
+}
+
 void
 ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
 {
@@ -35,5 +51,6 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
 if (!have_neon(cpu_flags))
 return;
 
+s->filter_intra = filter_intra_helper;
 }
 
diff --git a/libavfilter/aarch64/vf_bwdif_neon.S 
b/libavfilter/aarch64/vf_bwdif_neon.S
index c2f5eb1f73..6a614f8d6e 100644
--- a/libavfilter/aarch64/vf_bwdif_neon.S
+++ b/libavfilter/aarch64/vf_bwdif_neon.S
@@ -76,3 +76,56 @@ coeffs:
 .hword  5570, 3801, 1016, -3801 // hf[0] = v0.h[2], 
-hf[1] = v0.h[5]
 .hword  5077, 981   // sp[0] = v0.h[6]
 
+// 
+//
+// void ff_bwdif_filter_intra_neon(
+//  void *dst1, // x0
+//  void *cur1, // x1
+//  int w,  // w2
+//  int prefs,  // w3
+//  int mrefs,  // w4
+//  int prefs3, // w5
+//  int mrefs3, // w6
+//  int parity, // w7   unused
+//  int clip_max)   // [sp, #0] unused
+
+function ff_bwdif_filter_intra_neon, export=1
+cmp w2, #0
+ble 99f
+
+ldr q0, coeffs
+
+//for (x = 0; x < w; x++) {
+10:
+
+//interpol = (coef_sp[0] * (cur[mrefs] + cur[prefs]) - coef_sp[1] * 
(cur[mrefs3] + cur[prefs3])) >> 13;
+ldr q31, [x1, w4, sxtw]
+ldr q30, [x1, w3, sxtw]
+ldr q29, [x1, w6, sxtw]
+ldr q28, [x1, w5, sxtw]
+
+uaddl   v20.8h,  v31.8b,  v30.8b
+uaddl2  v21.8h,  v31.16b, v30.16b
+
+UMULL4K v2, v3, v4, v5, v20, v21, v0.h[6]
+
+uaddl   v20.8h,  v29.8b,  v28.8b
+uaddl2  v21.8h,  v29.16b, v28.16b
+
+UMLSL4K v2, v3, v4, v5, v20, v21, v0.h[7]
+
+//dst[0] = av_clip(interpol, 0, clip_max);
+SQSHRUNNv2, v2, v3, v4, v5, 13
+str q2, [x0], #16
+
+//dst++;
+//cur++;
+//}
+
+subsw2,  w2,  #16
+add x1,  x1,  #16
+bgt 10b
+
+99:
+ret
+endfunc
-- 
2.39.2

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH v2 08/15] avfilter/vf_bwdif: Add neon for filter_edge

2023-07-02 Thread John Cox

Signed-off-by: John Cox 
---
 libavfilter/aarch64/vf_bwdif_init_aarch64.c |  20 
 libavfilter/aarch64/vf_bwdif_neon.S | 104 
 2 files changed, 124 insertions(+)

diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c 
b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
index 3ffaa07ab3..e75cf2f204 100644
--- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c
+++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
@@ -24,10 +24,29 @@
 #include "libavfilter/bwdif.h"
 #include "libavutil/aarch64/cpu.h"
 
+void ff_bwdif_filter_edge_neon(void *dst1, void *prev1, void *cur1, void 
*next1,
+   int w, int prefs, int mrefs, int prefs2, int 
mrefs2,
+   int parity, int clip_max, int spat);
+
 void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, int 
mrefs,
 int prefs3, int mrefs3, int parity, int 
clip_max);
 
 
+static void filter_edge_helper(void *dst1, void *prev1, void *cur1, void 
*next1,
+   int w, int prefs, int mrefs, int prefs2, int 
mrefs2,
+   int parity, int clip_max, int spat)
+{
+const int w0 = clip_max != 255 ? 0 : w & ~15;
+
+ff_bwdif_filter_edge_neon(dst1, prev1, cur1, next1, w0, prefs, mrefs, 
prefs2, mrefs2,
+  parity, clip_max, spat);
+
+if (w0 < w)
+ff_bwdif_filter_edge_c((char *)dst1 + w0, (char *)prev1 + w0, (char 
*)cur1 + w0, (char *)next1 + w0,
+   w - w0, prefs, mrefs, prefs2, mrefs2,
+   parity, clip_max, spat);
+}
+
 static void filter_intra_helper(void *dst1, void *cur1, int w, int prefs, int 
mrefs,
 int prefs3, int mrefs3, int parity, int 
clip_max)
 {
@@ -52,5 +71,6 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
 return;
 
 s->filter_intra = filter_intra_helper;
+s->filter_edge  = filter_edge_helper;
 }
 
diff --git a/libavfilter/aarch64/vf_bwdif_neon.S 
b/libavfilter/aarch64/vf_bwdif_neon.S
index 48dc7bcd9d..d6e7d109f5 100644
--- a/libavfilter/aarch64/vf_bwdif_neon.S
+++ b/libavfilter/aarch64/vf_bwdif_neon.S
@@ -149,6 +149,110 @@ coeffs:
 .hword  5570, 3801, 1016, -3801 // hf[0] = v0.h[2], 
-hf[1] = v0.h[5]
 .hword  5077, 981   // sp[0] = v0.h[6]
 
+// 
+//
+// void ff_bwdif_filter_edge_neon(
+//  void *dst1, // x0
+//  void *prev1,// x1
+//  void *cur1, // x2
+//  void *next1,// x3
+//  int w,  // w4
+//  int prefs,  // w5
+//  int mrefs,  // w6
+//  int prefs2, // w7
+//  int mrefs2, // [sp, #0]
+//  int parity, // [sp, #SP_INT]
+//  int clip_max,   // [sp, #SP_INT*2]  unused
+//  int spat);  // [sp, #SP_INT*3]
+
+function ff_bwdif_filter_edge_neon, export=1
+// Sanity check w
+cmp w4, #0
+ble 99f
+
+// #define prev2 cur
+// const uint8_t * restrict next2 = parity ? prev : next;
+
+ldr w8,  [sp, #0]   // mrefs2
+
+ldr w17, [sp, #SP_INT]  // parity
+ldr w16, [sp, #SP_INT*3]// spat
+cmp w17, #0
+cselx17, x1, x3, ne
+
+// for (x = 0; x < w; x++) {
+
+10:
+//int m1 = cur[mrefs];
+//int d = (prev2[0] + next2[0]) >> 1;
+//int p1 = cur[prefs];
+//int temporal_diff0 = FFABS(prev2[0] - next2[0]);
+//int temporal_diff1 =(FFABS(prev[mrefs] - m1) + FFABS(prev[prefs] - 
p1)) >> 1;
+//int temporal_diff2 =(FFABS(next[mrefs] - m1) + FFABS(next[prefs] - 
p1)) >> 1;
+//int diff = FFMAX3(temporal_diff0 >> 1, temporal_diff1, 
temporal_diff2);
+ldr q31, [x2]
+ldr q21, [x17]
+uhadd   v16.16b, v31.16b, v21.16b   // d0 = v16
+uabdv17.16b, v31.16b, v21.16b   // td0 = v17
+ldr q24, [x2, w6, sxtw] // m1 = v24
+ldr q22, [x2, w5, sxtw] // p1 = v22
+
+ldr q0,  [x1, w6, sxtw] // prev[mrefs]
+ldr q2,  [x1, w5, sxtw] // prev[prefs]
+ldr q1,  [x3, w6, sxtw] // next[mrefs]
+ldr q3,  [x3, w5, sxtw] // next[prefs]
+
+ushrv29.16b, v17.16b, #1
+
+uabdv31.16b, v0.16b,  v24.16b
+uabdv30.16b, v2.16b,  v22.16b
+uhadd   v0.16b,  v31.16b, v30.16b   // td1 = q0
+
+uabdv31.16b, v1.16b,  v24.16b
+uabdv30.16b, v3.16b,  v22.16b
+uhadd   v1.16b,  v31.16b,

[FFmpeg-devel] [PATCH v2 09/15] tests/checkasm: Add test for vf_bwdif filter_edge

2023-07-02 Thread John Cox

Signed-off-by: John Cox 
---
 tests/checkasm/vf_bwdif.c | 54 +++
 1 file changed, 54 insertions(+)

diff --git a/tests/checkasm/vf_bwdif.c b/tests/checkasm/vf_bwdif.c
index 034bbabb4c..5fdba09fdc 100644
--- a/tests/checkasm/vf_bwdif.c
+++ b/tests/checkasm/vf_bwdif.c
@@ -83,6 +83,60 @@ void checkasm_check_vf_bwdif(void)
 report("bwdif10");
 }
 
+{
+LOCAL_ALIGNED_16(uint8_t, prev0, [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, prev1, [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, next0, [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, next1, [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, cur0,  [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, cur1,  [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, dst0,  [WIDTH*3]);
+LOCAL_ALIGNED_16(uint8_t, dst1,  [WIDTH*3]);
+const int stride = WIDTH;
+const int mask = (1<<8)-1;
+int spat;
+int parity;
+
+for (spat = 0; spat != 2; ++spat) {
+for (parity = 0; parity != 2; ++parity) {
+if (check_func(ctx_8.filter_edge, "bwdif8.edge.s%d.p%d", spat, 
parity)) {
+
+declare_func(void, void *dst1, void *prev1, void *cur1, 
void *next1,
+int w, int prefs, int mrefs, int 
prefs2, int mrefs2,
+int parity, int clip_max, int 
spat);
+
+randomize_buffers(prev0, prev1, mask, 11*WIDTH);
+randomize_buffers(next0, next1, mask, 11*WIDTH);
+randomize_buffers( cur0,  cur1, mask, 11*WIDTH);
+memset(dst0, 0xba, WIDTH * 3);
+memset(dst1, 0xba, WIDTH * 3);
+
+call_ref(dst0 + stride,
+ prev0 + stride * 4, cur0 + stride * 4, next0 + 
stride * 4, WIDTH,
+ stride, -stride, stride * 2, -stride * 2,
+ parity, mask, spat);
+call_new(dst1 + stride,
+ prev1 + stride * 4, cur1 + stride * 4, next1 + 
stride * 4, WIDTH,
+ stride, -stride, stride * 2, -stride * 2,
+ parity, mask, spat);
+
+if (memcmp(dst0, dst1, WIDTH*3)
+|| memcmp(prev0, prev1, WIDTH*11)
+|| memcmp(next0, next1, WIDTH*11)
+|| memcmp( cur0,  cur1, WIDTH*11))
+fail();
+
+bench_new(dst1 + stride,
+ prev1 + stride * 4, cur1 + stride * 4, next1 + 
stride * 4, WIDTH,
+ stride, -stride, stride * 2, -stride * 2,
+ parity, mask, spat);
+}
+}
+}
+
+report("bwdif8.edge");
+}
+
 if (check_func(ctx_8.filter_intra, "bwdif8.intra")) {
 LOCAL_ALIGNED_16(uint8_t, cur0,  [11*WIDTH]);
 LOCAL_ALIGNED_16(uint8_t, cur1,  [11*WIDTH]);
-- 
2.39.2

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH v2 05/15] tests/checkasm: Add test for vf_bwdif filter_intra

2023-07-02 Thread John Cox

Signed-off-by: John Cox 
---
 tests/checkasm/vf_bwdif.c | 37 +
 1 file changed, 37 insertions(+)

diff --git a/tests/checkasm/vf_bwdif.c b/tests/checkasm/vf_bwdif.c
index 46224bb575..034bbabb4c 100644
--- a/tests/checkasm/vf_bwdif.c
+++ b/tests/checkasm/vf_bwdif.c
@@ -20,6 +20,7 @@
 #include "checkasm.h"
 #include "libavcodec/internal.h"
 #include "libavfilter/bwdif.h"
+#include "libavutil/mem_internal.h"
 
 #define WIDTH 256
 
@@ -81,4 +82,40 @@ void checkasm_check_vf_bwdif(void)
 BODY(uint16_t, 10);
 report("bwdif10");
 }
+
+if (check_func(ctx_8.filter_intra, "bwdif8.intra")) {
+LOCAL_ALIGNED_16(uint8_t, cur0,  [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, cur1,  [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, dst0,  [WIDTH*3]);
+LOCAL_ALIGNED_16(uint8_t, dst1,  [WIDTH*3]);
+const int stride = WIDTH;
+const int mask = (1<<8)-1;
+
+declare_func(void, void *dst1, void *cur1, int w, int prefs, int mrefs,
+ int prefs3, int mrefs3, int parity, int clip_max);
+
+randomize_buffers( cur0,  cur1, mask, 11*WIDTH);
+memset(dst0, 0xba, WIDTH * 3);
+memset(dst1, 0xba, WIDTH * 3);
+
+call_ref(dst0 + stride,
+ cur0 + stride * 4, WIDTH,
+ stride, -stride, stride * 3, -stride * 3,
+ 0, mask);
+call_new(dst1 + stride,
+ cur0 + stride * 4, WIDTH,
+ stride, -stride, stride * 3, -stride * 3,
+ 0, mask);
+
+if (memcmp(dst0, dst1, WIDTH*3)
+|| memcmp( cur0,  cur1, WIDTH*11))
+fail();
+
+bench_new(dst1 + stride,
+  cur0 + stride * 4, WIDTH,
+  stride, -stride, stride * 3, -stride * 3,
+  0, mask);
+
+report("bwdif8.intra");
+}
 }
-- 
2.39.2

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH v2 06/15] avfilter/vf_bwdif: Add clip and spatial macros for aarch64 neon

2023-07-02 Thread John Cox

Signed-off-by: John Cox 
---
 libavfilter/aarch64/vf_bwdif_neon.S | 73 +
 1 file changed, 73 insertions(+)

diff --git a/libavfilter/aarch64/vf_bwdif_neon.S 
b/libavfilter/aarch64/vf_bwdif_neon.S
index 6a614f8d6e..48dc7bcd9d 100644
--- a/libavfilter/aarch64/vf_bwdif_neon.S
+++ b/libavfilter/aarch64/vf_bwdif_neon.S
@@ -66,6 +66,79 @@
 umlsl2  \a3\().4s, \s1\().8h, \k
 .endm
 
+//  int b = m2s1 - m1;
+//  int f = p2s1 - p1;
+//  int dc = c0s1 - m1;
+//  int de = c0s1 - p1;
+//  int sp_max = FFMIN(p1 - c0s1, m1 - c0s1);
+//  sp_max = FFMIN(sp_max, FFMAX(-b,-f));
+//  int sp_min = FFMIN(c0s1 - p1, c0s1 - m1);
+//  sp_min = FFMIN(sp_min, FFMAX(b,f));
+//  diff = diff == 0 ? 0 : FFMAX3(diff, sp_min, sp_max);
+.macro SPAT_CHECK diff, m2s1, m1, c0s1, p1, p2s1, t0, t1, t2, t3
+uqsub   \t0\().16b, \p1\().16b, \c0s1\().16b
+uqsub   \t2\().16b, \m1\().16b, \c0s1\().16b
+umin\t2\().16b, \t0\().16b, \t2\().16b
+
+uqsub   \t1\().16b, \m1\().16b, \m2s1\().16b
+uqsub   \t3\().16b, \p1\().16b, \p2s1\().16b
+umax\t3\().16b, \t3\().16b, \t1\().16b
+umin\t3\().16b, \t3\().16b, \t2\().16b
+
+uqsub   \t0\().16b, \c0s1\().16b, \p1\().16b
+uqsub   \t2\().16b, \c0s1\().16b, \m1\().16b
+umin\t2\().16b, \t0\().16b, \t2\().16b
+
+uqsub   \t1\().16b, \m2s1\().16b, \m1\().16b
+uqsub   \t0\().16b, \p2s1\().16b, \p1\().16b
+umax\t0\().16b, \t0\().16b, \t1\().16b
+umin\t2\().16b, \t2\().16b, \t0\().16b
+
+cmeq\t1\().16b, \diff\().16b, #0
+umax\diff\().16b, \diff\().16b, \t3\().16b
+umax\diff\().16b, \diff\().16b, \t2\().16b
+bic \diff\().16b, \diff\().16b, \t1\().16b
+.endm
+
+//  i0 = s0;
+//  if (i0 > d0 + diff0)
+//  i0 = d0 + diff0;
+//  else if (i0 < d0 - diff0)
+//  i0 = d0 - diff0;
+//
+// i0 = s0 is safe
+.macro DIFF_CLIP i0, s0, d0, diff, t0, t1
+uqadd   \t0\().16b, \d0\().16b, \diff\().16b
+uqsub   \t1\().16b, \d0\().16b, \diff\().16b
+umin\i0\().16b, \s0\().16b, \t0\().16b
+umax\i0\().16b, \i0\().16b, \t1\().16b
+.endm
+
+//  i0 = FFABS(m1 - p1) > td0 ? i1 : i2;
+//  DIFF_CLIP
+//
+// i0 = i1 is safe
+.macro INTERPOL i0, i1, i2, m1, d0, p1, td0, diff, t0, t1, t2
+uabd\t0\().16b, \m1\().16b, \p1\().16b
+cmhi\t0\().16b, \t0\().16b, \td0\().16b
+bsl \t0\().16b, \i1\().16b, \i2\().16b
+DIFF_CLIP   \i0, \t0, \d0, \diff, \t1, \t2
+.endm
+
+.macro PUSH_VREGS
+stp d8,  d9,  [sp, #-64]!
+stp d10, d11, [sp, #16]
+stp d12, d13, [sp, #32]
+stp d14, d15, [sp, #48]
+.endm
+
+.macro POP_VREGS
+ldp d14, d15, [sp, #48]
+ldp d12, d13, [sp, #32]
+ldp d10, d11, [sp, #16]
+ldp d8,  d9,  [sp], #64
+.endm
+
 // static const uint16_t coef_lf[2] = { 4309, 213 };
 // static const uint16_t coef_hf[3] = { 5570, 3801, 1016 };
 // static const uint16_t coef_sp[2] = { 5077, 981 };
-- 
2.39.2

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH v2 10/15] avfilter/vf_bwdif: Export C filter_line

2023-07-02 Thread John Cox

Needed for tail fixup of neon code

Signed-off-by: John Cox 
---
 libavfilter/bwdif.h|  5 +
 libavfilter/vf_bwdif.c | 10 +-
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/libavfilter/bwdif.h b/libavfilter/bwdif.h
index ae1616d366..cce99953f3 100644
--- a/libavfilter/bwdif.h
+++ b/libavfilter/bwdif.h
@@ -48,4 +48,9 @@ void ff_bwdif_filter_edge_c(void *dst1, void *prev1, void 
*cur1, void *next1,
 void ff_bwdif_filter_intra_c(void *dst1, void *cur1, int w, int prefs, int 
mrefs,
  int prefs3, int mrefs3, int parity, int clip_max);
 
+void ff_bwdif_filter_line_c(void *dst1, void *prev1, void *cur1, void *next1,
+int w, int prefs, int mrefs, int prefs2, int 
mrefs2,
+int prefs3, int mrefs3, int prefs4, int mrefs4,
+int parity, int clip_max);
+
 #endif /* AVFILTER_BWDIF_H */
diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c
index bec83111b4..26349da1fd 100644
--- a/libavfilter/vf_bwdif.c
+++ b/libavfilter/vf_bwdif.c
@@ -132,10 +132,10 @@ void ff_bwdif_filter_intra_c(void *dst1, void *cur1, int 
w, int prefs, int mrefs
 FILTER_INTRA()
 }
 
-static void filter_line_c(void *dst1, void *prev1, void *cur1, void *next1,
-  int w, int prefs, int mrefs, int prefs2, int mrefs2,
-  int prefs3, int mrefs3, int prefs4, int mrefs4,
-  int parity, int clip_max)
+void ff_bwdif_filter_line_c(void *dst1, void *prev1, void *cur1, void *next1,
+int w, int prefs, int mrefs, int prefs2, int 
mrefs2,
+int prefs3, int mrefs3, int prefs4, int mrefs4,
+int parity, int clip_max)
 {
 uint8_t *dst   = dst1;
 uint8_t *prev  = prev1;
@@ -363,7 +363,7 @@ av_cold void ff_bwdif_init_filter_line(BWDIFContext *s, int 
bit_depth)
 s->filter_edge  = filter_edge_16bit;
 } else {
 s->filter_intra = ff_bwdif_filter_intra_c;
-s->filter_line  = filter_line_c;
+s->filter_line  = ff_bwdif_filter_line_c;
 s->filter_edge  = ff_bwdif_filter_edge_c;
 }
 
-- 
2.39.2

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH v2 07/15] avfilter/vf_bwdif: Export C filter_edge

2023-07-02 Thread John Cox

Needed for tail fixup of neon code

Signed-off-by: John Cox 
---
 libavfilter/bwdif.h| 4 
 libavfilter/vf_bwdif.c | 8 
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/libavfilter/bwdif.h b/libavfilter/bwdif.h
index ae6f6ce223..ae1616d366 100644
--- a/libavfilter/bwdif.h
+++ b/libavfilter/bwdif.h
@@ -41,6 +41,10 @@ void ff_bwdif_init_filter_line(BWDIFContext *bwdif, int 
bit_depth);
 void ff_bwdif_init_x86(BWDIFContext *bwdif, int bit_depth);
 void ff_bwdif_init_aarch64(BWDIFContext *bwdif, int bit_depth);
 
+void ff_bwdif_filter_edge_c(void *dst1, void *prev1, void *cur1, void *next1,
+int w, int prefs, int mrefs, int prefs2, int 
mrefs2,
+int parity, int clip_max, int spat);
+
 void ff_bwdif_filter_intra_c(void *dst1, void *cur1, int w, int prefs, int 
mrefs,
  int prefs3, int mrefs3, int parity, int clip_max);
 
diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c
index 035fc58670..bec83111b4 100644
--- a/libavfilter/vf_bwdif.c
+++ b/libavfilter/vf_bwdif.c
@@ -150,9 +150,9 @@ static void filter_line_c(void *dst1, void *prev1, void 
*cur1, void *next1,
 FILTER2()
 }
 
-static void filter_edge(void *dst1, void *prev1, void *cur1, void *next1,
-int w, int prefs, int mrefs, int prefs2, int mrefs2,
-int parity, int clip_max, int spat)
+void ff_bwdif_filter_edge_c(void *dst1, void *prev1, void *cur1, void *next1,
+int w, int prefs, int mrefs, int prefs2, int 
mrefs2,
+int parity, int clip_max, int spat)
 {
 uint8_t *dst   = dst1;
 uint8_t *prev  = prev1;
@@ -364,7 +364,7 @@ av_cold void ff_bwdif_init_filter_line(BWDIFContext *s, int 
bit_depth)
 } else {
 s->filter_intra = ff_bwdif_filter_intra_c;
 s->filter_line  = filter_line_c;
-s->filter_edge  = filter_edge;
+s->filter_edge  = ff_bwdif_filter_edge_c;
 }
 
 #if ARCH_X86
-- 
2.39.2

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH v2 11/15] avfilter/vf_bwdif: Add neon for filter_line

2023-07-02 Thread John Cox

Signed-off-by: John Cox 
---
 libavfilter/aarch64/vf_bwdif_init_aarch64.c |  21 ++
 libavfilter/aarch64/vf_bwdif_neon.S | 208 
 2 files changed, 229 insertions(+)

diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c 
b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
index e75cf2f204..21e67884ab 100644
--- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c
+++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
@@ -31,6 +31,26 @@ void ff_bwdif_filter_edge_neon(void *dst1, void *prev1, void 
*cur1, void *next1,
 void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, int 
mrefs,
 int prefs3, int mrefs3, int parity, int 
clip_max);
 
+void ff_bwdif_filter_line_neon(void *dst1, void *prev1, void *cur1, void 
*next1,
+   int w, int prefs, int mrefs, int prefs2, int 
mrefs2,
+   int prefs3, int mrefs3, int prefs4, int mrefs4,
+   int parity, int clip_max);
+
+
+static void filter_line_helper(void *dst1, void *prev1, void *cur1, void 
*next1,
+   int w, int prefs, int mrefs, int prefs2, int 
mrefs2,
+   int prefs3, int mrefs3, int prefs4, int mrefs4,
+   int parity, int clip_max)
+{
+const int w0 = clip_max != 255 ? 0 : w & ~15;
+
+ff_bwdif_filter_line_neon(dst1, prev1, cur1, next1,
+  w0, prefs, mrefs, prefs2, mrefs2, prefs3, 
mrefs3, prefs4, mrefs4, parity, clip_max);
+
+if (w0 < w)
+ff_bwdif_filter_line_c((char *)dst1 + w0, (char *)prev1 + w0, (char 
*)cur1 + w0, (char *)next1 + w0,
+   w - w0, prefs, mrefs, prefs2, mrefs2, prefs3, 
mrefs3, prefs4, mrefs4, parity, clip_max);
+}
 
 static void filter_edge_helper(void *dst1, void *prev1, void *cur1, void 
*next1,
int w, int prefs, int mrefs, int prefs2, int 
mrefs2,
@@ -71,6 +91,7 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
 return;
 
 s->filter_intra = filter_intra_helper;
+s->filter_line  = filter_line_helper;
 s->filter_edge  = filter_edge_helper;
 }
 
diff --git a/libavfilter/aarch64/vf_bwdif_neon.S 
b/libavfilter/aarch64/vf_bwdif_neon.S
index d6e7d109f5..abc050565c 100644
--- a/libavfilter/aarch64/vf_bwdif_neon.S
+++ b/libavfilter/aarch64/vf_bwdif_neon.S
@@ -149,6 +149,214 @@ coeffs:
 .hword  5570, 3801, 1016, -3801 // hf[0] = v0.h[2], 
-hf[1] = v0.h[5]
 .hword  5077, 981   // sp[0] = v0.h[6]
 
+// ===
+//
+// void filter_line(
+//  void *dst1, // x0
+//  void *prev1,// x1
+//  void *cur1, // x2
+//  void *next1,// x3
+//  int w,  // w4
+//  int prefs,  // w5
+//  int mrefs,  // w6
+//  int prefs2, // w7
+//  int mrefs2, // [sp, #0]
+//  int prefs3, // [sp, #SP_INT]
+//  int mrefs3, // [sp, #SP_INT*2]
+//  int prefs4, // [sp, #SP_INT*3]
+//  int mrefs4, // [sp, #SP_INT*4]
+//  int parity, // [sp, #SP_INT*5]
+//  int clip_max)   // [sp, #SP_INT*6]
+
+function ff_bwdif_filter_line_neon, export=1
+// Sanity check w
+cmp w4, #0
+ble 99f
+
+// Rearrange regs to be the same as line3 for ease of debug!
+mov w10, w4 // w10 = loop count
+mov w9,  w6 // w9  = mref
+mov w12, w7 // w12 = pref2
+mov w11, w5 // w11 = pref
+ldr w8,  [sp, #0]   // w8 =  mref2
+ldr w7,  [sp, #SP_INT*2]// w7  = mref3
+ldr w6,  [sp, #SP_INT*4]// w6  = mref4
+ldr w13, [sp, #SP_INT]  // w13 = pref3
+ldr w14, [sp, #SP_INT*3]// w14 = pref4
+
+mov x4,  x3
+mov x3,  x2
+mov x2,  x1
+
+// #define prev2 cur
+//const uint8_t * restrict next2 = parity ? prev : next;
+ldr w17, [sp, #SP_INT*5]// parity
+cmp w17, #0
+cselx17, x2, x4, ne
+
+PUSH_VREGS
+
+ldr q0, coeffs
+
+// for (x = 0; x < w; x++) {
+// int diff0, diff2;
+// int d0, d2;
+// int temporal_diff0, temporal_diff2;
+//
+// int i1, i2;
+// int j1, j2;
+// int p6, p5, p4, p3, p2, p1, c0, m1, m2, m3, m4;
+
+10:
+// c0 = prev2[0] + next2[0];// c0 = v20, v21
+// d0  = c0 >> 1;   // d0 = v10
+// temporal_diff0

[FFmpeg-devel] [PATCH v2 12/15] avfilter/vf_bwdif: Add a filter_line3 method for optimisation

2023-07-02 Thread John Cox

Add an optional filter_line3 to the available optimisations.

filter_line3 is equivalent to filter_line, memcpy, filter_line

filter_line shares quite a number of loads and some calculations in
common with its next iteration and testing shows that using aarch64
neon filter_line3s performance is 30% better than two filter_lines
and a memcpy.

Signed-off-by: John Cox 
---
 libavfilter/bwdif.h|  7 +++
 libavfilter/vf_bwdif.c | 31 +++
 2 files changed, 38 insertions(+)

diff --git a/libavfilter/bwdif.h b/libavfilter/bwdif.h
index cce99953f3..496cec72ef 100644
--- a/libavfilter/bwdif.h
+++ b/libavfilter/bwdif.h
@@ -35,6 +35,9 @@ typedef struct BWDIFContext {
 void (*filter_edge)(void *dst, void *prev, void *cur, void *next,
 int w, int prefs, int mrefs, int prefs2, int mrefs2,
 int parity, int clip_max, int spat);
+void (*filter_line3)(void *dst, int dstride,
+ const void *prev, const void *cur, const void *next, 
int prefs,
+ int w, int parity, int clip_max);
 } BWDIFContext;
 
 void ff_bwdif_init_filter_line(BWDIFContext *bwdif, int bit_depth);
@@ -53,4 +56,8 @@ void ff_bwdif_filter_line_c(void *dst1, void *prev1, void 
*cur1, void *next1,
 int prefs3, int mrefs3, int prefs4, int mrefs4,
 int parity, int clip_max);
 
+void ff_bwdif_filter_line3_c(void * dst1, int d_stride,
+ const void * prev1, const void * cur1, const void 
* next1, int s_stride,
+ int w, int parity, int clip_max);
+
 #endif /* AVFILTER_BWDIF_H */
diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c
index 26349da1fd..52bc676cf8 100644
--- a/libavfilter/vf_bwdif.c
+++ b/libavfilter/vf_bwdif.c
@@ -150,6 +150,31 @@ void ff_bwdif_filter_line_c(void *dst1, void *prev1, void 
*cur1, void *next1,
 FILTER2()
 }
 
+#define NEXT_LINE()\
+dst += d_stride; \
+prev += prefs; \
+cur  += prefs; \
+next += prefs;
+
+void ff_bwdif_filter_line3_c(void * dst1, int d_stride,
+ const void * prev1, const void * cur1, const void 
* next1, int s_stride,
+ int w, int parity, int clip_max)
+{
+const int prefs = s_stride;
+uint8_t * dst  = dst1;
+const uint8_t * prev = prev1;
+const uint8_t * cur  = cur1;
+const uint8_t * next = next1;
+
+ff_bwdif_filter_line_c(dst, (void*)prev, (void*)cur, (void*)next, w,
+   prefs, -prefs, prefs * 2, - prefs * 2, prefs * 3, 
-prefs * 3, prefs * 4, -prefs * 4, parity, clip_max);
+NEXT_LINE();
+memcpy(dst, cur, w);
+NEXT_LINE();
+ff_bwdif_filter_line_c(dst, (void*)prev, (void*)cur, (void*)next, w,
+   prefs, -prefs, prefs * 2, - prefs * 2, prefs * 3, 
-prefs * 3, prefs * 4, -prefs * 4, parity, clip_max);
+}
+
 void ff_bwdif_filter_edge_c(void *dst1, void *prev1, void *cur1, void *next1,
 int w, int prefs, int mrefs, int prefs2, int 
mrefs2,
 int parity, int clip_max, int spat)
@@ -244,6 +269,11 @@ static int filter_slice(AVFilterContext *ctx, void *arg, 
int jobnr, int nb_jobs)
refs << 1, -(refs << 1),
td->parity ^ td->tff, clip_max,
(y < 2) || ((y + 3) > td->h) ? 0 : 1);
+} else if (s->filter_line3 && y + 2 < slice_end && y + 6 < td->h) {
+s->filter_line3(dst, td->frame->linesize[td->plane],
+prev, cur, next, linesize, td->w,
+td->parity ^ td->tff, clip_max);
+y += 2;
 } else {
 s->filter_line(dst, prev, cur, next, td->w,
refs, -refs, refs << 1, -(refs << 1),
@@ -357,6 +387,7 @@ static int config_props(AVFilterLink *link)
 
 av_cold void ff_bwdif_init_filter_line(BWDIFContext *s, int bit_depth)
 {
+s->filter_line3 = 0;
 if (bit_depth > 8) {
 s->filter_intra = filter_intra_16bit;
 s->filter_line  = filter_line_c_16bit;
-- 
2.39.2

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH v2 13/15] avfilter/vf_bwdif: Add neon for filter_line3

2023-07-02 Thread John Cox

Signed-off-by: John Cox 
---
 libavfilter/aarch64/vf_bwdif_init_aarch64.c |  28 ++
 libavfilter/aarch64/vf_bwdif_neon.S | 272 
 2 files changed, 300 insertions(+)

diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c 
b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
index 21e67884ab..f52bc4b9b4 100644
--- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c
+++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
@@ -36,6 +36,33 @@ void ff_bwdif_filter_line_neon(void *dst1, void *prev1, void 
*cur1, void *next1,
int prefs3, int mrefs3, int prefs4, int mrefs4,
int parity, int clip_max);
 
+void ff_bwdif_filter_line3_neon(void * dst1, int d_stride,
+const void * prev1, const void * cur1, const 
void * next1, int s_stride,
+int w, int parity, int clip_max);
+
+
+static void filter_line3_helper(void * dst1, int d_stride,
+const void * prev1, const void * cur1, const 
void * next1, int s_stride,
+int w, int parity, int clip_max)
+{
+// Asm works on 16 byte chunks
+// If w is a multiple of 16 then all is good - if not then if width rounded
+// up to nearest 16 will fit in both src & dst strides then allow the asm
+// to write over the padding bytes as that is almost certainly faster than
+// having to invoke the C version to clean up the tail.
+const int w1 = FFALIGN(w, 16);
+const int w0 = clip_max != 255 ? 0 :
+   d_stride <= w1 && s_stride <= w1 ? w : w & ~15;
+
+ff_bwdif_filter_line3_neon(dst1, d_stride,
+   prev1, cur1, next1, s_stride,
+   w0, parity, clip_max);
+
+if (w0 < w)
+ff_bwdif_filter_line3_c((char *)dst1 + w0, d_stride,
+(const char *)prev1 + w0, (const char *)cur1 + 
w0, (const char *)next1 + w0, s_stride,
+w - w0, parity, clip_max);
+}
 
 static void filter_line_helper(void *dst1, void *prev1, void *cur1, void 
*next1,
int w, int prefs, int mrefs, int prefs2, int 
mrefs2,
@@ -93,5 +120,6 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
 s->filter_intra = filter_intra_helper;
 s->filter_line  = filter_line_helper;
 s->filter_edge  = filter_edge_helper;
+s->filter_line3 = filter_line3_helper;
 }
 
diff --git a/libavfilter/aarch64/vf_bwdif_neon.S 
b/libavfilter/aarch64/vf_bwdif_neon.S
index abc050565c..1405ea10fb 100644
--- a/libavfilter/aarch64/vf_bwdif_neon.S
+++ b/libavfilter/aarch64/vf_bwdif_neon.S
@@ -149,6 +149,278 @@ coeffs:
 .hword  5570, 3801, 1016, -3801 // hf[0] = v0.h[2], 
-hf[1] = v0.h[5]
 .hword  5077, 981   // sp[0] = v0.h[6]
 
+// ===
+//
+// void ff_bwdif_filter_line3_neon(
+// void * dst1, // x0
+// int d_stride,// w1
+// const void * prev1,  // x2
+// const void * cur1,   // x3
+// const void * next1,  // x4
+// int s_stride,// w5
+// int w,   // w6
+// int parity,  // w7
+// int clip_max);   // [sp, #0] (Ignored)
+
+function ff_bwdif_filter_line3_neon, export=1
+// Sanity check w
+cmp w6, #0
+ble 99f
+
+// #define prev2 cur
+//const uint8_t * restrict next2 = parity ? prev : next;
+cmp w7, #0
+cselx17, x2, x4, ne
+
+// We want all the V registers - save all the ones we must
+PUSH_VREGS
+
+ldr q0, coeffs
+
+// Some rearrangement of initial values for nice layout of refs in regs
+mov w10, w6 // w10 = loop count
+neg w9,  w5 // w9  = mref
+lsl w8,  w9,  #1// w8 =  mref2
+add w7,  w9,  w9, LSL #1// w7  = mref3
+lsl w6,  w9,  #2// w6  = mref4
+mov w11, w5 // w11 = pref
+lsl w12, w5,  #1// w12 = pref2
+add w13, w5,  w5, LSL #1// w13 = pref3
+lsl w14, w5,  #2// w14 = pref4
+add w15, w5,  w5, LSL #2// w15 = pref5
+add w16, w14, w12   // w16 = pref6
+
+lsl w5,  w1,  #1// w5 = d_stride * 2
+
+// for (x = 0; x < w; x++) {
+// int diff0, diff2;
+// int d0, d2;
+// int temporal_diff0, temporal_diff2;
+//
+//

[FFmpeg-devel] [PATCH v2 14/15] tests/checkasm: Add test for vf_bwdif filter_line3

2023-07-02 Thread John Cox

Signed-off-by: John Cox 
---
 tests/checkasm/vf_bwdif.c | 81 +++
 1 file changed, 81 insertions(+)

diff --git a/tests/checkasm/vf_bwdif.c b/tests/checkasm/vf_bwdif.c
index 5fdba09fdc..3399cacdf7 100644
--- a/tests/checkasm/vf_bwdif.c
+++ b/tests/checkasm/vf_bwdif.c
@@ -28,6 +28,10 @@
 for (size_t i = 0; i < count; i++) \
 buf0[i] = buf1[i] = rnd() & mask
 
+#define randomize_overflow_check(buf0, buf1, mask, count) \
+for (size_t i = 0; i < count; i++) \
+buf0[i] = buf1[i] = (rnd() & 1) != 0 ? mask : 0;
+
 #define BODY(type, depth)  
\
 do {   
\
 type prev0[9*WIDTH], prev1[9*WIDTH];   
\
@@ -83,6 +87,83 @@ void checkasm_check_vf_bwdif(void)
 report("bwdif10");
 }
 
+if (!ctx_8.filter_line3)
+ctx_8.filter_line3 = ff_bwdif_filter_line3_c;
+
+{
+LOCAL_ALIGNED_16(uint8_t, prev0, [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, prev1, [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, next0, [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, next1, [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, cur0,  [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, cur1,  [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, dst0,  [WIDTH*3]);
+LOCAL_ALIGNED_16(uint8_t, dst1,  [WIDTH*3]);
+const int stride = WIDTH;
+const int mask = (1<<8)-1;
+int parity;
+
+for (parity = 0; parity != 2; ++parity) {
+if (check_func(ctx_8.filter_line3, "bwdif8.line3.rnd.p%d", 
parity)) {
+
+declare_func(void, void * dst1, int d_stride,
+  const void * prev1, const void * 
cur1, const void * next1, int prefs,
+  int w, int parity, int clip_max);
+
+randomize_buffers(prev0, prev1, mask, 11*WIDTH);
+randomize_buffers(next0, next1, mask, 11*WIDTH);
+randomize_buffers( cur0,  cur1, mask, 11*WIDTH);
+
+call_ref(dst0, stride,
+ prev0 + stride * 4, cur0 + stride * 4, next0 + stride 
* 4, stride,
+ WIDTH, parity, mask);
+call_new(dst1, stride,
+ prev1 + stride * 4, cur1 + stride * 4, next1 + stride 
* 4, stride,
+ WIDTH, parity, mask);
+
+if (memcmp(dst0, dst1, WIDTH*3)
+|| memcmp(prev0, prev1, WIDTH*11)
+|| memcmp(next0, next1, WIDTH*11)
+|| memcmp( cur0,  cur1, WIDTH*11))
+fail();
+
+bench_new(dst1, stride,
+ prev1 + stride * 4, cur1 + stride * 4, next1 + stride 
* 4, stride,
+ WIDTH, parity, mask);
+}
+}
+
+// Use just 0s and ~0s to try to provoke bad cropping or overflow
+// Parity makes no difference to this test so just test 0
+if (check_func(ctx_8.filter_line3, "bwdif8.line3.overflow")) {
+
+declare_func(void, void * dst1, int d_stride,
+  const void * prev1, const void * cur1, 
const void * next1, int prefs,
+  int w, int parity, int clip_max);
+
+randomize_overflow_check(prev0, prev1, mask, 11*WIDTH);
+randomize_overflow_check(next0, next1, mask, 11*WIDTH);
+randomize_overflow_check( cur0,  cur1, mask, 11*WIDTH);
+
+call_ref(dst0, stride,
+ prev0 + stride * 4, cur0 + stride * 4, next0 + stride * 
4, stride,
+ WIDTH, 0, mask);
+call_new(dst1, stride,
+ prev1 + stride * 4, cur1 + stride * 4, next1 + stride * 
4, stride,
+ WIDTH, 0, mask);
+
+if (memcmp(dst0, dst1, WIDTH*3)
+|| memcmp(prev0, prev1, WIDTH*11)
+|| memcmp(next0, next1, WIDTH*11)
+|| memcmp( cur0,  cur1, WIDTH*11))
+fail();
+
+// No point to benching
+}
+
+report("bwdif8.line3");
+}
+
 {
 LOCAL_ALIGNED_16(uint8_t, prev0, [11*WIDTH]);
 LOCAL_ALIGNED_16(uint8_t, prev1, [11*WIDTH]);
-- 
2.39.2

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH v2 15/15] avfilter/vf_bwdif: Block filter slices into a multiple of 4 lines

2023-07-02 Thread John Cox

Round job start lines down to a multiple of 4. This means that if
filter_line3 exists then filter_line will not sometimes be called
once at the end of a slice depending on thread count. The final slice
may do up to 3 extra lines but filter_edge is faster than filter_line
so it is unlikely to create any noticable thread load variation.

Signed-off-by: John Cox 
---
 libavfilter/vf_bwdif.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c
index 52bc676cf8..6701208efe 100644
--- a/libavfilter/vf_bwdif.c
+++ b/libavfilter/vf_bwdif.c
@@ -237,6 +237,13 @@ static void filter_edge_16bit(void *dst1, void *prev1, 
void *cur1, void *next1,
 FILTER2()
 }
 
+// Round job start line down to multiple of 4 so that if filter_line3 exists
+// and the frame is a multiple of 4 high then filter_line will never be called
+static inline int job_start(const int jobnr, const int nb_jobs, const int h)
+{
+return jobnr >= nb_jobs ? h : ((h * jobnr) / nb_jobs) & ~3;
+}
+
 static int filter_slice(AVFilterContext *ctx, void *arg, int jobnr, int 
nb_jobs)
 {
 BWDIFContext *s = ctx->priv;
@@ -246,8 +253,8 @@ static int filter_slice(AVFilterContext *ctx, void *arg, 
int jobnr, int nb_jobs)
 int clip_max = (1 << (yadif->csp->comp[td->plane].depth)) - 1;
 int df = (yadif->csp->comp[td->plane].depth + 7) / 8;
 int refs = linesize / df;
-int slice_start = (td->h *  jobnr   ) / nb_jobs;
-int slice_end   = (td->h * (jobnr+1)) / nb_jobs;
+int slice_start = job_start(jobnr, nb_jobs, td->h);
+int slice_end   = job_start(jobnr + 1, nb_jobs, td->h);
 int y;
 
 for (y = slice_start; y < slice_end; y++) {
@@ -310,7 +317,7 @@ static void filter(AVFilterContext *ctx, AVFrame *dstpic,
 td.plane = i;
 
 ff_filter_execute(ctx, filter_slice, &td, NULL,
-  FFMIN(h, ff_filter_get_nb_threads(ctx)));
+  FFMIN((h+3)/4, ff_filter_get_nb_threads(ctx)));
 }
 if (yadif->current_field == YADIF_FIELD_END) {
 yadif->current_field = YADIF_FIELD_NORMAL;
-- 
2.39.2

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH v2 12/15] avfilter/vf_bwdif: Add a filter_line3 method for optimisation

2023-07-03 Thread John Cox

On Mon, 3 Jul 2023 00:12:46 +0300 (EEST), you wrote:

>On Sun, 2 Jul 2023, Thomas Mundt wrote:
>
>> Am So., 2. Juli 2023 um 14:34 Uhr schrieb John Cox :
>>   Add an optional filter_line3 to the available optimisations.
>>
>>   filter_line3 is equivalent to filter_line, memcpy, filter_line
>>
>>   filter_line shares quite a number of loads and some calculations
>>   in
>>   common with its next iteration and testing shows that using
>>   aarch64
>>   neon filter_line3s performance is 30% better than two
>>       filter_lines
>>   and a memcpy.
>>
>>   Signed-off-by: John Cox 
>>   ---
>>    libavfilter/bwdif.h    |  7 +++
>>    libavfilter/vf_bwdif.c | 31 +++
>>    2 files changed, 38 insertions(+)
>>
>>   diff --git a/libavfilter/bwdif.h b/libavfilter/bwdif.h
>>   index cce99953f3..496cec72ef 100644
>>   --- a/libavfilter/bwdif.h
>>   +++ b/libavfilter/bwdif.h
>>   @@ -35,6 +35,9 @@ typedef struct BWDIFContext {
>>        void (*filter_edge)(void *dst, void *prev, void *cur, void
>>   *next,
>>                            int w, int prefs, int mrefs, int
>>   prefs2, int mrefs2,
>>                            int parity, int clip_max, int spat);
>>   +    void (*filter_line3)(void *dst, int dstride,
>>   +                         const void *prev, const void *cur,
>>   const void *next, int prefs,
>>   +                         int w, int parity, int clip_max);
>>    } BWDIFContext;
>>
>>    void ff_bwdif_init_filter_line(BWDIFContext *bwdif, int
>>   bit_depth);
>>   @@ -53,4 +56,8 @@ void ff_bwdif_filter_line_c(void *dst1, void
>>   *prev1, void *cur1, void *next1,
>>                                int prefs3, int mrefs3, int prefs4,
>>   int mrefs4,
>>                                int parity, int clip_max);
>>
>>   +void ff_bwdif_filter_line3_c(void * dst1, int d_stride,
>>   +                             const void * prev1, const void *
>>   cur1, const void * next1, int s_stride,
>>   +                             int w, int parity, int clip_max);
>>   +
>>    #endif /* AVFILTER_BWDIF_H */
>>   diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c
>>   index 26349da1fd..52bc676cf8 100644
>>   --- a/libavfilter/vf_bwdif.c
>>   +++ b/libavfilter/vf_bwdif.c
>>   @@ -150,6 +150,31 @@ void ff_bwdif_filter_line_c(void *dst1,
>>   void *prev1, void *cur1, void *next1,
>>        FILTER2()
>>    }
>>
>>   +#define NEXT_LINE()\
>>   +    dst += d_stride; \
>>   +    prev += prefs; \
>>   +    cur  += prefs; \
>>   +    next += prefs;
>>   +
>>   +void ff_bwdif_filter_line3_c(void * dst1, int d_stride,
>>   +                             const void * prev1, const void *
>>   cur1, const void * next1, int s_stride,
>>   +                             int w, int parity, int clip_max)
>>   +{
>>   +    const int prefs = s_stride;
>>   +    uint8_t * dst  = dst1;
>>   +    const uint8_t * prev = prev1;
>>   +    const uint8_t * cur  = cur1;
>>   +    const uint8_t * next = next1;
>>   +
>>   +    ff_bwdif_filter_line_c(dst, (void*)prev, (void*)cur,
>>   (void*)next, w,
>>   +                           prefs, -prefs, prefs * 2, - prefs *
>>   2, prefs * 3, -prefs * 3, prefs * 4, -prefs * 4, parity,
>>   clip_max);
>>   +    NEXT_LINE();
>>   +    memcpy(dst, cur, w);
>>   +    NEXT_LINE();
>>   +    ff_bwdif_filter_line_c(dst, (void*)prev, (void*)cur,
>>   (void*)next, w,
>>   +                           prefs, -prefs, prefs * 2, - prefs *
>>   2, prefs * 3, -prefs * 3, prefs * 4, -prefs * 4, parity,
>>   clip_max);
>>   +}
>>   +
>>    void ff_bwdif_filter_edge_c(void *dst1, void *prev1, void
>>   *cur1, void *next1,
>>                                int w, int prefs, int mrefs, int
>>   prefs2, int mrefs2,
>>                                int parity, int clip_max, int spat)
>>   @@ -244,6 +269,11 @@ static int filter_slice(AVFilterContext
>>   *ctx, void *arg, int jobnr, int nb_jobs)
>>                                   refs << 1, -(refs << 1),
>>

Re: [FFmpeg-devel] [PATCH 02/15] avfilter/vf_bwdif: Add common macros and consts for aarch64 neon

2023-07-03 Thread John Cox

On Mon, 3 Jul 2023 00:02:27 +0300 (EEST), you wrote:

>On Sun, 2 Jul 2023, Martin Storsjö wrote:
>
>> On Sun, 2 Jul 2023, John Cox wrote:
>>
>>> On Sun, 2 Jul 2023 00:35:14 +0300 (EEST), you wrote:
>>> 
>>>> On Thu, 29 Jun 2023, John Cox wrote:
>>>> 
>>>>> Add macros for dual scalar half->single multiply and accumulate
>>>>> Add macro for shift, saturate and shorten single to byte
>>>>> Add filter constants
>>>>> 
>>>>> Signed-off-by: John Cox 
>>>>> ---
>>>>> libavfilter/aarch64/vf_bwdif_neon.S | 46 +
>>>>> 1 file changed, 46 insertions(+)
>>>>> 
>>>>> +
>>>>> +.align 16
>>>> 
>>>> Note that .align for arm is power of two; this triggers a 2^16 byte
>>>> alignment here, which certainly isn't intended.
>>> 
>>> Yikes! I'll swap that for a .balign now I've looked that up
>>> 
>>>> But in general, the usual way of defining constants is with a
>>>> const/endconst block, which places them in the right rdata section instead
>>>> of in the text section. But that then requires you to use a movrel macro
>>>> for accessing the data, instead of a plain ldr instruction.
>>> 
>>> Yeah - arm has a history of mixing text & const - I went with the
>>> simpler code. Is this a deal breaker or can I leave it as is?
>>
>> I wouldn't treat it as a deal breaker as long as it works across all 
>> platforms - even if consistency with the code style elsewhere is preferred, 
>> but IIRC there may be issues with MS armasm (after passed through 
>> gas-preprocessor). IIRC there might be issues with starting out with 
>> straight 
>> up content without the full setup made by the function/const macros. OTOH I 
>> might have fixed that in gas-preprocessor too...
>>
>> Last time around, the patchset failed building in that configuration due ot 
>> the out of range alignment, I'll see how it fares now.
>
>I'm sorry, but I'd just recommend you to go with the const macros.
>
>Your current patch fails because gas-preprocessor, 
>https://github.com/ffmpeg/gas-preprocessor, doesn't support the .balign 
>directive, it only recognizes .align and .p2align. (Extending it to 
>support it would be trivial though.)
>
>If I change your code to ".align 4", I get the following warning:
>
>\home\martin\code\ffmpeg-msvc-arm64\libavfilter\aarch64\vf_bwdif_neon.o.asm(1011)
> 
>: warning A4228: Alignment value exceeds AREA alignment; alignment not 
>guaranteed
> ALIGN 16
>
>Since we haven't started any section, apparently armasm defaults to a 
>section with 4 byte alignment.
>
>But anyway, regardless of the alignment, it later fails with this error:
>
>\home\martin\code\ffmpeg-msvc-arm64\libavfilter\aarch64\vf_bwdif_neon.o.asm(1051)
> 
>: error A2504: operand 2: Expected address
> ldr q0, coeffs
>
>
>So I would request you to just go with the macros we use elsewhere. The 
>gas-preprocessor/armasm setup doesn't support/expect any random assembly, 
>but the disciplined subset we normally write. In most cases, we 
>essentially never write bare directives in the code, but only use the 
>macros from asm.S, which are set up to handle portability across all 
>supported platforms and their toolchains.

OK - will do.

JC

>// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH v2 00/15] avfilter/vf_bwdif: Add aarch64 neon functions

2023-07-03 Thread John Cox

On Mon, 3 Jul 2023 00:09:52 +0300 (EEST), you wrote:

>On Sun, 2 Jul 2023, John Cox wrote:
>
>> Also adds a filter_line3 method which on aarch64 neon yields approx 30%
>> speedup over 2xfilter_line and a memcpy
>>
>> Differences from v1:
>> .align 16 corrected to .balign 16
>> SXTW tolower
>> Mac ABI (hopefully) fixed
>> V register pop/push macroed & prettified
>>
>> John Cox (15):
>>  avfilter/vf_bwdif: Add outline for aarch neon functions
>>  avfilter/vf_bwdif: Add common macros and consts for aarch64 neon
>>  avfilter/vf_bwdif: Export C filter_intra
>>  avfilter/vf_bwdif: Add neon for filter_intra
>>  tests/checkasm: Add test for vf_bwdif filter_intra
>>  avfilter/vf_bwdif: Add clip and spatial macros for aarch64 neon
>>  avfilter/vf_bwdif: Export C filter_edge
>>  avfilter/vf_bwdif: Add neon for filter_edge
>>  tests/checkasm: Add test for vf_bwdif filter_edge
>>  avfilter/vf_bwdif: Export C filter_line
>>  avfilter/vf_bwdif: Add neon for filter_line
>>  avfilter/vf_bwdif: Add a filter_line3 method for optimisation
>>  avfilter/vf_bwdif: Add neon for filter_line3
>>  tests/checkasm: Add test for vf_bwdif filter_line3
>>  avfilter/vf_bwdif: Block filter slices into a multiple of 4 lines
>
>Overall, I'd suggest squashing/reordering the patches like this:
>
>- tests/checkasm: Add test for vf_bwdif filter_intra
>- avfilter/vf_bwdif: Add neon for filter_intra
>   (With the preceding patches squashed. For extra common macros, only add
>   the ones you use in this patch here.)
>- tests/checkasm: Add test for vf_bwdif filter_edge
>- avfilter/vf_bwdif: Add neon for filter_edge (with other dependencies
>   squashed)
>- avfilter/vf_bwdif: Add neon for filter_line
>- avfilter/vf_bwdif: Add a filter_line3 method for optimisation
>   + checkasm test squashed
>- avfilter/vf_bwdif: Add neon for filter_line3

I'm happy with that if everyone else is - it is easy to merge patches -
harder to take them apart.

JC

>// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH v3 0/7] avfilter/vf_bwdif: Add aarch64 neon functions

2023-07-03 Thread John Cox

Also adds a filter_line3 method which on aarch64 neon yields approx 30%
speedup over 2xfilter_line and a memcpy

Differences from v2:
coeffs moved into const segment
number of patches reduced

John Cox (7):
  tests/checkasm: Add test for vf_bwdif filter_intra
  avfilter/vf_bwdif: Add neon for filter_intra
  tests/checkasm: Add test for vf_bwdif filter_edge
  avfilter/vf_bwdif: Add neon for filter_edge
  avfilter/vf_bwdif: Add neon for filter_line Exports C filter_line
needed for tail fixup of neon code
  avfilter/vf_bwdif: Add a filter_line3 method for optimisation
  avfilter/vf_bwdif: Add neon for filter_line3

 libavfilter/aarch64/Makefile|   2 +
 libavfilter/aarch64/vf_bwdif_init_aarch64.c | 125 +++
 libavfilter/aarch64/vf_bwdif_neon.S | 793 
 libavfilter/bwdif.h |  20 +
 libavfilter/vf_bwdif.c  |  70 +-
 tests/checkasm/vf_bwdif.c   | 172 +
 6 files changed, 1167 insertions(+), 15 deletions(-)
 create mode 100644 libavfilter/aarch64/vf_bwdif_init_aarch64.c
 create mode 100644 libavfilter/aarch64/vf_bwdif_neon.S

-- 
2.39.2

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH v3 1/7] tests/checkasm: Add test for vf_bwdif filter_intra

2023-07-03 Thread John Cox

Signed-off-by: John Cox 
---
 tests/checkasm/vf_bwdif.c | 37 +
 1 file changed, 37 insertions(+)

diff --git a/tests/checkasm/vf_bwdif.c b/tests/checkasm/vf_bwdif.c
index 46224bb575..034bbabb4c 100644
--- a/tests/checkasm/vf_bwdif.c
+++ b/tests/checkasm/vf_bwdif.c
@@ -20,6 +20,7 @@
 #include "checkasm.h"
 #include "libavcodec/internal.h"
 #include "libavfilter/bwdif.h"
+#include "libavutil/mem_internal.h"
 
 #define WIDTH 256
 
@@ -81,4 +82,40 @@ void checkasm_check_vf_bwdif(void)
 BODY(uint16_t, 10);
 report("bwdif10");
 }
+
+if (check_func(ctx_8.filter_intra, "bwdif8.intra")) {
+LOCAL_ALIGNED_16(uint8_t, cur0,  [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, cur1,  [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, dst0,  [WIDTH*3]);
+LOCAL_ALIGNED_16(uint8_t, dst1,  [WIDTH*3]);
+const int stride = WIDTH;
+const int mask = (1<<8)-1;
+
+declare_func(void, void *dst1, void *cur1, int w, int prefs, int mrefs,
+ int prefs3, int mrefs3, int parity, int clip_max);
+
+randomize_buffers( cur0,  cur1, mask, 11*WIDTH);
+memset(dst0, 0xba, WIDTH * 3);
+memset(dst1, 0xba, WIDTH * 3);
+
+call_ref(dst0 + stride,
+ cur0 + stride * 4, WIDTH,
+ stride, -stride, stride * 3, -stride * 3,
+ 0, mask);
+call_new(dst1 + stride,
+ cur0 + stride * 4, WIDTH,
+ stride, -stride, stride * 3, -stride * 3,
+ 0, mask);
+
+if (memcmp(dst0, dst1, WIDTH*3)
+|| memcmp( cur0,  cur1, WIDTH*11))
+fail();
+
+bench_new(dst1 + stride,
+  cur0 + stride * 4, WIDTH,
+  stride, -stride, stride * 3, -stride * 3,
+  0, mask);
+
+report("bwdif8.intra");
+}
 }
-- 
2.39.2

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH v3 2/7] avfilter/vf_bwdif: Add neon for filter_intra

2023-07-03 Thread John Cox

Adds an outline for aarch neon functions
Adds common macros and consts for aarch64 neon
Exports C filter_intra needed for tail fixup of neon code
Adds neon for filter_intra

Signed-off-by: John Cox 
---
 libavfilter/aarch64/Makefile|   2 +
 libavfilter/aarch64/vf_bwdif_init_aarch64.c |  56 
 libavfilter/aarch64/vf_bwdif_neon.S | 136 
 libavfilter/bwdif.h |   4 +
 libavfilter/vf_bwdif.c  |   8 +-
 5 files changed, 203 insertions(+), 3 deletions(-)
 create mode 100644 libavfilter/aarch64/vf_bwdif_init_aarch64.c
 create mode 100644 libavfilter/aarch64/vf_bwdif_neon.S

diff --git a/libavfilter/aarch64/Makefile b/libavfilter/aarch64/Makefile
index b58daa3a3f..b68209bc94 100644
--- a/libavfilter/aarch64/Makefile
+++ b/libavfilter/aarch64/Makefile
@@ -1,3 +1,5 @@
+OBJS-$(CONFIG_BWDIF_FILTER)  += aarch64/vf_bwdif_init_aarch64.o
 OBJS-$(CONFIG_NLMEANS_FILTER)+= aarch64/vf_nlmeans_init.o
 
+NEON-OBJS-$(CONFIG_BWDIF_FILTER) += aarch64/vf_bwdif_neon.o
 NEON-OBJS-$(CONFIG_NLMEANS_FILTER)   += aarch64/vf_nlmeans_neon.o
diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c 
b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
new file mode 100644
index 00..3ffaa07ab3
--- /dev/null
+++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
@@ -0,0 +1,56 @@
+/*
+ * bwdif aarch64 NEON optimisations
+ *
+ * Copyright (c) 2023 John Cox 
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/common.h"
+#include "libavfilter/bwdif.h"
+#include "libavutil/aarch64/cpu.h"
+
+void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, int 
mrefs,
+int prefs3, int mrefs3, int parity, int 
clip_max);
+
+
+static void filter_intra_helper(void *dst1, void *cur1, int w, int prefs, int 
mrefs,
+int prefs3, int mrefs3, int parity, int 
clip_max)
+{
+const int w0 = clip_max != 255 ? 0 : w & ~15;
+
+ff_bwdif_filter_intra_neon(dst1, cur1, w0, prefs, mrefs, prefs3, mrefs3, 
parity, clip_max);
+
+if (w0 < w)
+ff_bwdif_filter_intra_c((char *)dst1 + w0, (char *)cur1 + w0,
+w - w0, prefs, mrefs, prefs3, mrefs3, parity, 
clip_max);
+}
+
+void
+ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
+{
+const int cpu_flags = av_get_cpu_flags();
+
+if (bit_depth != 8)
+return;
+
+if (!have_neon(cpu_flags))
+return;
+
+s->filter_intra = filter_intra_helper;
+}
+
diff --git a/libavfilter/aarch64/vf_bwdif_neon.S 
b/libavfilter/aarch64/vf_bwdif_neon.S
new file mode 100644
index 00..e288efbe6c
--- /dev/null
+++ b/libavfilter/aarch64/vf_bwdif_neon.S
@@ -0,0 +1,136 @@
+/*
+ * bwdif aarch64 NEON optimisations
+ *
+ * Copyright (c) 2023 John Cox 
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+
+#include "libavutil/aarch64/asm.S"
+
+// Space taken on the stack by an int (32-bit)
+#ifdef __APPLE__
+.setSP_INT, 4
+#else
+.setSP_INT, 8
+#endif
+
+.macro SQSHRUNN b, s0, s1, s2, s3, n
+sqshrun \s0\().4h, \s0\().4s, #\n - 8
+sqshrun2\s0\().8h, \s1\().4s, #\n - 8
+sqshrun \s1\().4h, \s2\().4s, #\n - 8
+sqshrun2\s1\().8h, \s3\().4s, #\n - 8
+uzp2\b\().16b, \s0\().16b, \s1\().16b
+.endm
+
+.macro SMULL4K a0, a1, a2, a3, s0, s1, k
+smull   \a0\().4s

[FFmpeg-devel] [PATCH v3 3/7] tests/checkasm: Add test for vf_bwdif filter_edge

2023-07-03 Thread John Cox

Signed-off-by: John Cox 
---
 tests/checkasm/vf_bwdif.c | 54 +++
 1 file changed, 54 insertions(+)

diff --git a/tests/checkasm/vf_bwdif.c b/tests/checkasm/vf_bwdif.c
index 034bbabb4c..5fdba09fdc 100644
--- a/tests/checkasm/vf_bwdif.c
+++ b/tests/checkasm/vf_bwdif.c
@@ -83,6 +83,60 @@ void checkasm_check_vf_bwdif(void)
 report("bwdif10");
 }
 
+{
+LOCAL_ALIGNED_16(uint8_t, prev0, [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, prev1, [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, next0, [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, next1, [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, cur0,  [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, cur1,  [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, dst0,  [WIDTH*3]);
+LOCAL_ALIGNED_16(uint8_t, dst1,  [WIDTH*3]);
+const int stride = WIDTH;
+const int mask = (1<<8)-1;
+int spat;
+int parity;
+
+for (spat = 0; spat != 2; ++spat) {
+for (parity = 0; parity != 2; ++parity) {
+if (check_func(ctx_8.filter_edge, "bwdif8.edge.s%d.p%d", spat, 
parity)) {
+
+declare_func(void, void *dst1, void *prev1, void *cur1, 
void *next1,
+int w, int prefs, int mrefs, int 
prefs2, int mrefs2,
+int parity, int clip_max, int 
spat);
+
+randomize_buffers(prev0, prev1, mask, 11*WIDTH);
+randomize_buffers(next0, next1, mask, 11*WIDTH);
+randomize_buffers( cur0,  cur1, mask, 11*WIDTH);
+memset(dst0, 0xba, WIDTH * 3);
+memset(dst1, 0xba, WIDTH * 3);
+
+call_ref(dst0 + stride,
+ prev0 + stride * 4, cur0 + stride * 4, next0 + 
stride * 4, WIDTH,
+ stride, -stride, stride * 2, -stride * 2,
+ parity, mask, spat);
+call_new(dst1 + stride,
+ prev1 + stride * 4, cur1 + stride * 4, next1 + 
stride * 4, WIDTH,
+ stride, -stride, stride * 2, -stride * 2,
+ parity, mask, spat);
+
+if (memcmp(dst0, dst1, WIDTH*3)
+|| memcmp(prev0, prev1, WIDTH*11)
+|| memcmp(next0, next1, WIDTH*11)
+|| memcmp( cur0,  cur1, WIDTH*11))
+fail();
+
+bench_new(dst1 + stride,
+ prev1 + stride * 4, cur1 + stride * 4, next1 + 
stride * 4, WIDTH,
+ stride, -stride, stride * 2, -stride * 2,
+ parity, mask, spat);
+}
+}
+}
+
+report("bwdif8.edge");
+}
+
 if (check_func(ctx_8.filter_intra, "bwdif8.intra")) {
 LOCAL_ALIGNED_16(uint8_t, cur0,  [11*WIDTH]);
 LOCAL_ALIGNED_16(uint8_t, cur1,  [11*WIDTH]);
-- 
2.39.2

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH v3 4/7] avfilter/vf_bwdif: Add neon for filter_edge

2023-07-03 Thread John Cox

Adds clip and spatial macros for aarch64 neon
Exports C filter_edge needed for tail fixup of neon code
Adds neon for filter_edge

Signed-off-by: John Cox 
---
 libavfilter/aarch64/vf_bwdif_init_aarch64.c |  20 +++
 libavfilter/aarch64/vf_bwdif_neon.S | 177 
 libavfilter/bwdif.h |   4 +
 libavfilter/vf_bwdif.c  |   8 +-
 4 files changed, 205 insertions(+), 4 deletions(-)

diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c 
b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
index 3ffaa07ab3..e75cf2f204 100644
--- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c
+++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
@@ -24,10 +24,29 @@
 #include "libavfilter/bwdif.h"
 #include "libavutil/aarch64/cpu.h"
 
+void ff_bwdif_filter_edge_neon(void *dst1, void *prev1, void *cur1, void 
*next1,
+   int w, int prefs, int mrefs, int prefs2, int 
mrefs2,
+   int parity, int clip_max, int spat);
+
 void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, int 
mrefs,
 int prefs3, int mrefs3, int parity, int 
clip_max);
 
 
+static void filter_edge_helper(void *dst1, void *prev1, void *cur1, void 
*next1,
+   int w, int prefs, int mrefs, int prefs2, int 
mrefs2,
+   int parity, int clip_max, int spat)
+{
+const int w0 = clip_max != 255 ? 0 : w & ~15;
+
+ff_bwdif_filter_edge_neon(dst1, prev1, cur1, next1, w0, prefs, mrefs, 
prefs2, mrefs2,
+  parity, clip_max, spat);
+
+if (w0 < w)
+ff_bwdif_filter_edge_c((char *)dst1 + w0, (char *)prev1 + w0, (char 
*)cur1 + w0, (char *)next1 + w0,
+   w - w0, prefs, mrefs, prefs2, mrefs2,
+   parity, clip_max, spat);
+}
+
 static void filter_intra_helper(void *dst1, void *cur1, int w, int prefs, int 
mrefs,
 int prefs3, int mrefs3, int parity, int 
clip_max)
 {
@@ -52,5 +71,6 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
 return;
 
 s->filter_intra = filter_intra_helper;
+s->filter_edge  = filter_edge_helper;
 }
 
diff --git a/libavfilter/aarch64/vf_bwdif_neon.S 
b/libavfilter/aarch64/vf_bwdif_neon.S
index e288efbe6c..389302b813 100644
--- a/libavfilter/aarch64/vf_bwdif_neon.S
+++ b/libavfilter/aarch64/vf_bwdif_neon.S
@@ -66,6 +66,79 @@
 umlsl2  \a3\().4s, \s1\().8h, \k
 .endm
 
+//  int b = m2s1 - m1;
+//  int f = p2s1 - p1;
+//  int dc = c0s1 - m1;
+//  int de = c0s1 - p1;
+//  int sp_max = FFMIN(p1 - c0s1, m1 - c0s1);
+//  sp_max = FFMIN(sp_max, FFMAX(-b,-f));
+//  int sp_min = FFMIN(c0s1 - p1, c0s1 - m1);
+//  sp_min = FFMIN(sp_min, FFMAX(b,f));
+//  diff = diff == 0 ? 0 : FFMAX3(diff, sp_min, sp_max);
+.macro SPAT_CHECK diff, m2s1, m1, c0s1, p1, p2s1, t0, t1, t2, t3
+uqsub   \t0\().16b, \p1\().16b, \c0s1\().16b
+uqsub   \t2\().16b, \m1\().16b, \c0s1\().16b
+umin\t2\().16b, \t0\().16b, \t2\().16b
+
+uqsub   \t1\().16b, \m1\().16b, \m2s1\().16b
+uqsub   \t3\().16b, \p1\().16b, \p2s1\().16b
+umax\t3\().16b, \t3\().16b, \t1\().16b
+umin\t3\().16b, \t3\().16b, \t2\().16b
+
+uqsub   \t0\().16b, \c0s1\().16b, \p1\().16b
+uqsub   \t2\().16b, \c0s1\().16b, \m1\().16b
+umin\t2\().16b, \t0\().16b, \t2\().16b
+
+uqsub   \t1\().16b, \m2s1\().16b, \m1\().16b
+uqsub   \t0\().16b, \p2s1\().16b, \p1\().16b
+umax\t0\().16b, \t0\().16b, \t1\().16b
+umin\t2\().16b, \t2\().16b, \t0\().16b
+
+cmeq\t1\().16b, \diff\().16b, #0
+umax\diff\().16b, \diff\().16b, \t3\().16b
+umax\diff\().16b, \diff\().16b, \t2\().16b
+bic \diff\().16b, \diff\().16b, \t1\().16b
+.endm
+
+//  i0 = s0;
+//  if (i0 > d0 + diff0)
+//  i0 = d0 + diff0;
+//  else if (i0 < d0 - diff0)
+//  i0 = d0 - diff0;
+//
+// i0 = s0 is safe
+.macro DIFF_CLIP i0, s0, d0, diff, t0, t1
+uqadd   \t0\().16b, \d0\().16b, \diff\().16b
+uqsub   \t1\().16b, \d0\().16b, \diff\().16b
+umin\i0\().16b, \s0\().16b, \t0\().16b
+umax\i0\().16b, \i0\().16b, \t1\().16b
+.endm
+
+//  i0 = FFABS(m1 - p1) > td0 ? i1 : i2;
+//  DIFF_CLIP
+//
+// i0 = i1 is safe
+.macro INTERPOL i0, i1, i2, m1, d0, p1, td0, diff, t0, t1, t2
+uabd\t0\().16b, \m1\().16b, \p1\().16b
+cmhi\t0\().16b, \t0\().16b, \td0\().16b
+bsl \t0\().16b, \i1\().16b, \i2\().16b
+DIFF_CLIP   \i0, \t0, \d0, \diff, \t1, \t

[FFmpeg-devel] [PATCH v3 6/7] avfilter/vf_bwdif: Add a filter_line3 method for optimisation

2023-07-03 Thread John Cox

Add an optional filter_line3 to the available optimisations.

filter_line3 is equivalent to filter_line, memcpy, filter_line

filter_line shares quite a number of loads and some calculations in
common with its next iteration and testing shows that using aarch64
neon filter_line3s performance is 30% better than two filter_lines
and a memcpy.

Adds a test for vf_bwdif filter_line3 to checkasm

Rounds job start lines down to a multiple of 4. This means that if
filter_line3 exists then filter_line will not sometimes be called
once at the end of a slice depending on thread count. The final slice
may do up to 3 extra lines but filter_edge is faster than filter_line
so it is unlikely to create any noticable thread load variation.

Signed-off-by: John Cox 
---
 libavfilter/bwdif.h   |  7 
 libavfilter/vf_bwdif.c| 44 +++--
 tests/checkasm/vf_bwdif.c | 81 +++
 3 files changed, 129 insertions(+), 3 deletions(-)

diff --git a/libavfilter/bwdif.h b/libavfilter/bwdif.h
index cce99953f3..496cec72ef 100644
--- a/libavfilter/bwdif.h
+++ b/libavfilter/bwdif.h
@@ -35,6 +35,9 @@ typedef struct BWDIFContext {
 void (*filter_edge)(void *dst, void *prev, void *cur, void *next,
 int w, int prefs, int mrefs, int prefs2, int mrefs2,
 int parity, int clip_max, int spat);
+void (*filter_line3)(void *dst, int dstride,
+ const void *prev, const void *cur, const void *next, 
int prefs,
+ int w, int parity, int clip_max);
 } BWDIFContext;
 
 void ff_bwdif_init_filter_line(BWDIFContext *bwdif, int bit_depth);
@@ -53,4 +56,8 @@ void ff_bwdif_filter_line_c(void *dst1, void *prev1, void 
*cur1, void *next1,
 int prefs3, int mrefs3, int prefs4, int mrefs4,
 int parity, int clip_max);
 
+void ff_bwdif_filter_line3_c(void * dst1, int d_stride,
+ const void * prev1, const void * cur1, const void 
* next1, int s_stride,
+ int w, int parity, int clip_max);
+
 #endif /* AVFILTER_BWDIF_H */
diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c
index 26349da1fd..6701208efe 100644
--- a/libavfilter/vf_bwdif.c
+++ b/libavfilter/vf_bwdif.c
@@ -150,6 +150,31 @@ void ff_bwdif_filter_line_c(void *dst1, void *prev1, void 
*cur1, void *next1,
 FILTER2()
 }
 
+#define NEXT_LINE()\
+dst += d_stride; \
+prev += prefs; \
+cur  += prefs; \
+next += prefs;
+
+void ff_bwdif_filter_line3_c(void * dst1, int d_stride,
+ const void * prev1, const void * cur1, const void 
* next1, int s_stride,
+ int w, int parity, int clip_max)
+{
+const int prefs = s_stride;
+uint8_t * dst  = dst1;
+const uint8_t * prev = prev1;
+const uint8_t * cur  = cur1;
+const uint8_t * next = next1;
+
+ff_bwdif_filter_line_c(dst, (void*)prev, (void*)cur, (void*)next, w,
+   prefs, -prefs, prefs * 2, - prefs * 2, prefs * 3, 
-prefs * 3, prefs * 4, -prefs * 4, parity, clip_max);
+NEXT_LINE();
+memcpy(dst, cur, w);
+NEXT_LINE();
+ff_bwdif_filter_line_c(dst, (void*)prev, (void*)cur, (void*)next, w,
+   prefs, -prefs, prefs * 2, - prefs * 2, prefs * 3, 
-prefs * 3, prefs * 4, -prefs * 4, parity, clip_max);
+}
+
 void ff_bwdif_filter_edge_c(void *dst1, void *prev1, void *cur1, void *next1,
 int w, int prefs, int mrefs, int prefs2, int 
mrefs2,
 int parity, int clip_max, int spat)
@@ -212,6 +237,13 @@ static void filter_edge_16bit(void *dst1, void *prev1, 
void *cur1, void *next1,
 FILTER2()
 }
 
+// Round job start line down to multiple of 4 so that if filter_line3 exists
+// and the frame is a multiple of 4 high then filter_line will never be called
+static inline int job_start(const int jobnr, const int nb_jobs, const int h)
+{
+return jobnr >= nb_jobs ? h : ((h * jobnr) / nb_jobs) & ~3;
+}
+
 static int filter_slice(AVFilterContext *ctx, void *arg, int jobnr, int 
nb_jobs)
 {
 BWDIFContext *s = ctx->priv;
@@ -221,8 +253,8 @@ static int filter_slice(AVFilterContext *ctx, void *arg, 
int jobnr, int nb_jobs)
 int clip_max = (1 << (yadif->csp->comp[td->plane].depth)) - 1;
 int df = (yadif->csp->comp[td->plane].depth + 7) / 8;
 int refs = linesize / df;
-int slice_start = (td->h *  jobnr   ) / nb_jobs;
-int slice_end   = (td->h * (jobnr+1)) / nb_jobs;
+int slice_start = job_start(jobnr, nb_jobs, td->h);
+int slice_end   = job_start(jobnr + 1, nb_jobs, td->h);
 int y;
 
 for (y = slice_start; y < slice_end; y++) {
@@ -244,6 +276,11 @@ static int filter_slice(AVFilterContext *ctx, void *arg, 
int jobnr, int nb_jobs)
refs << 1, -(refs << 1),

[FFmpeg-devel] [PATCH v3 5/7] avfilter/vf_bwdif: Add neon for filter_line Exports C filter_line needed for tail fixup of neon code

2023-07-03 Thread John Cox

Signed-off-by: John Cox 
---
 libavfilter/aarch64/vf_bwdif_init_aarch64.c |  21 ++
 libavfilter/aarch64/vf_bwdif_neon.S | 208 
 libavfilter/bwdif.h |   5 +
 libavfilter/vf_bwdif.c  |  10 +-
 4 files changed, 239 insertions(+), 5 deletions(-)

diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c 
b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
index e75cf2f204..21e67884ab 100644
--- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c
+++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
@@ -31,6 +31,26 @@ void ff_bwdif_filter_edge_neon(void *dst1, void *prev1, void 
*cur1, void *next1,
 void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, int 
mrefs,
 int prefs3, int mrefs3, int parity, int 
clip_max);
 
+void ff_bwdif_filter_line_neon(void *dst1, void *prev1, void *cur1, void 
*next1,
+   int w, int prefs, int mrefs, int prefs2, int 
mrefs2,
+   int prefs3, int mrefs3, int prefs4, int mrefs4,
+   int parity, int clip_max);
+
+
+static void filter_line_helper(void *dst1, void *prev1, void *cur1, void 
*next1,
+   int w, int prefs, int mrefs, int prefs2, int 
mrefs2,
+   int prefs3, int mrefs3, int prefs4, int mrefs4,
+   int parity, int clip_max)
+{
+const int w0 = clip_max != 255 ? 0 : w & ~15;
+
+ff_bwdif_filter_line_neon(dst1, prev1, cur1, next1,
+  w0, prefs, mrefs, prefs2, mrefs2, prefs3, 
mrefs3, prefs4, mrefs4, parity, clip_max);
+
+if (w0 < w)
+ff_bwdif_filter_line_c((char *)dst1 + w0, (char *)prev1 + w0, (char 
*)cur1 + w0, (char *)next1 + w0,
+   w - w0, prefs, mrefs, prefs2, mrefs2, prefs3, 
mrefs3, prefs4, mrefs4, parity, clip_max);
+}
 
 static void filter_edge_helper(void *dst1, void *prev1, void *cur1, void 
*next1,
int w, int prefs, int mrefs, int prefs2, int 
mrefs2,
@@ -71,6 +91,7 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
 return;
 
 s->filter_intra = filter_intra_helper;
+s->filter_line  = filter_line_helper;
 s->filter_edge  = filter_edge_helper;
 }
 
diff --git a/libavfilter/aarch64/vf_bwdif_neon.S 
b/libavfilter/aarch64/vf_bwdif_neon.S
index 389302b813..ae5f09c511 100644
--- a/libavfilter/aarch64/vf_bwdif_neon.S
+++ b/libavfilter/aarch64/vf_bwdif_neon.S
@@ -154,6 +154,214 @@ const coeffs, align=4   // align 4 means align on 2^4 
boundry
 .hword  5077, 981   // sp[0] = v0.h[6]
 endconst
 
+// ===
+//
+// void filter_line(
+//  void *dst1, // x0
+//  void *prev1,// x1
+//  void *cur1, // x2
+//  void *next1,// x3
+//  int w,  // w4
+//  int prefs,  // w5
+//  int mrefs,  // w6
+//  int prefs2, // w7
+//  int mrefs2, // [sp, #0]
+//  int prefs3, // [sp, #SP_INT]
+//  int mrefs3, // [sp, #SP_INT*2]
+//  int prefs4, // [sp, #SP_INT*3]
+//  int mrefs4, // [sp, #SP_INT*4]
+//  int parity, // [sp, #SP_INT*5]
+//  int clip_max)   // [sp, #SP_INT*6]
+
+function ff_bwdif_filter_line_neon, export=1
+// Sanity check w
+cmp w4, #0
+ble 99f
+
+// Rearrange regs to be the same as line3 for ease of debug!
+mov w10, w4 // w10 = loop count
+mov w9,  w6 // w9  = mref
+mov w12, w7 // w12 = pref2
+mov w11, w5 // w11 = pref
+ldr w8,  [sp, #0]   // w8 =  mref2
+ldr w7,  [sp, #SP_INT*2]// w7  = mref3
+ldr w6,  [sp, #SP_INT*4]// w6  = mref4
+ldr w13, [sp, #SP_INT]  // w13 = pref3
+ldr w14, [sp, #SP_INT*3]// w14 = pref4
+
+mov x4,  x3
+mov x3,  x2
+mov x2,  x1
+
+LDR_COEFFS  v0, x17
+
+// #define prev2 cur
+//const uint8_t * restrict next2 = parity ? prev : next;
+ldr w17, [sp, #SP_INT*5]// parity
+cmp w17, #0
+cselx17, x2, x4, ne
+
+PUSH_VREGS
+
+// for (x = 0; x < w; x++) {
+// int diff0, diff2;
+// int d0, d2;
+// int temporal_diff0, temporal_diff2;
+//
+// int i1, i2;
+// int j1, j2;
+// int p6, p5, p4, p3, p2, p1, c0, m1, m2, m3, m4;
+
+10:
+// c0 = prev2[0] + next2[0];// c0 = v20, v21
+//

[FFmpeg-devel] [PATCH v3 7/7] avfilter/vf_bwdif: Add neon for filter_line3

2023-07-03 Thread John Cox

Signed-off-by: John Cox 
---
 libavfilter/aarch64/vf_bwdif_init_aarch64.c |  28 ++
 libavfilter/aarch64/vf_bwdif_neon.S | 272 
 2 files changed, 300 insertions(+)

diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c 
b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
index 21e67884ab..f52bc4b9b4 100644
--- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c
+++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
@@ -36,6 +36,33 @@ void ff_bwdif_filter_line_neon(void *dst1, void *prev1, void 
*cur1, void *next1,
int prefs3, int mrefs3, int prefs4, int mrefs4,
int parity, int clip_max);
 
+void ff_bwdif_filter_line3_neon(void * dst1, int d_stride,
+const void * prev1, const void * cur1, const 
void * next1, int s_stride,
+int w, int parity, int clip_max);
+
+
+static void filter_line3_helper(void * dst1, int d_stride,
+const void * prev1, const void * cur1, const 
void * next1, int s_stride,
+int w, int parity, int clip_max)
+{
+// Asm works on 16 byte chunks
+// If w is a multiple of 16 then all is good - if not then if width rounded
+// up to nearest 16 will fit in both src & dst strides then allow the asm
+// to write over the padding bytes as that is almost certainly faster than
+// having to invoke the C version to clean up the tail.
+const int w1 = FFALIGN(w, 16);
+const int w0 = clip_max != 255 ? 0 :
+   d_stride <= w1 && s_stride <= w1 ? w : w & ~15;
+
+ff_bwdif_filter_line3_neon(dst1, d_stride,
+   prev1, cur1, next1, s_stride,
+   w0, parity, clip_max);
+
+if (w0 < w)
+ff_bwdif_filter_line3_c((char *)dst1 + w0, d_stride,
+(const char *)prev1 + w0, (const char *)cur1 + 
w0, (const char *)next1 + w0, s_stride,
+w - w0, parity, clip_max);
+}
 
 static void filter_line_helper(void *dst1, void *prev1, void *cur1, void 
*next1,
int w, int prefs, int mrefs, int prefs2, int 
mrefs2,
@@ -93,5 +120,6 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
 s->filter_intra = filter_intra_helper;
 s->filter_line  = filter_line_helper;
 s->filter_edge  = filter_edge_helper;
+s->filter_line3 = filter_line3_helper;
 }
 
diff --git a/libavfilter/aarch64/vf_bwdif_neon.S 
b/libavfilter/aarch64/vf_bwdif_neon.S
index ae5f09c511..bc092477b9 100644
--- a/libavfilter/aarch64/vf_bwdif_neon.S
+++ b/libavfilter/aarch64/vf_bwdif_neon.S
@@ -154,6 +154,278 @@ const coeffs, align=4   // align 4 means align on 2^4 
boundry
 .hword  5077, 981   // sp[0] = v0.h[6]
 endconst
 
+// ===
+//
+// void ff_bwdif_filter_line3_neon(
+// void * dst1, // x0
+// int d_stride,// w1
+// const void * prev1,  // x2
+// const void * cur1,   // x3
+// const void * next1,  // x4
+// int s_stride,// w5
+// int w,   // w6
+// int parity,  // w7
+// int clip_max);   // [sp, #0] (Ignored)
+
+function ff_bwdif_filter_line3_neon, export=1
+// Sanity check w
+cmp w6, #0
+ble 99f
+
+LDR_COEFFS  v0, x17
+
+// #define prev2 cur
+//const uint8_t * restrict next2 = parity ? prev : next;
+cmp w7, #0
+cselx17, x2, x4, ne
+
+// We want all the V registers - save all the ones we must
+PUSH_VREGS
+
+// Some rearrangement of initial values for nice layout of refs in regs
+mov w10, w6 // w10 = loop count
+neg w9,  w5 // w9  = mref
+lsl w8,  w9,  #1// w8 =  mref2
+add w7,  w9,  w9, LSL #1// w7  = mref3
+lsl w6,  w9,  #2// w6  = mref4
+mov w11, w5 // w11 = pref
+lsl w12, w5,  #1// w12 = pref2
+add w13, w5,  w5, LSL #1// w13 = pref3
+lsl w14, w5,  #2// w14 = pref4
+add w15, w5,  w5, LSL #2// w15 = pref5
+add w16, w14, w12   // w16 = pref6
+
+lsl w5,  w1,  #1// w5 = d_stride * 2
+
+// for (x = 0; x < w; x++) {
+// int diff0, diff2;
+// int d0, d2;
+// int temporal_diff0, temporal_diff2;
+//
+// int i1, i2;
+// int j1, j2;
+//

Re: [FFmpeg-devel] [PATCH v2 05/15] tests/checkasm: Add test for vf_bwdif filter_intra

2023-07-04 Thread John Cox

On Mon, 3 Jul 2023 00:14:16 +0300 (EEST), you wrote:

>[snip]
>It's a bit of a shame that this only tests things for 8 bit, not 10, but I 
>guess that's better than nothing. The way the current code is set up to 
>template both variants of the tests isn't very neat either...

Is there actually >8-bit interlaced content out in the wild? I've never
seen a single clip. If so where does it come from?

Just curious

JC

>// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH v4 0/7] avfilter/vf_bwdif: Add aarch64 neon functions

2023-07-04 Thread John Cox

Also adds a filter_line3 method which on aarch64 neon yields approx 30%
speedup over 2xfilter_line and a memcpy

Differences from v3:
Remove a few lines of neon in filter_line that should have been removed
when copying from line3

Sorry about the two patch sets in quick succession, but I think I've
applied all the requested changes and I didn't want this mistake in the
final patchset. (The mistake was benign - it just wasted a few cycles.)

John Cox (7):
  tests/checkasm: Add test for vf_bwdif filter_intra
  avfilter/vf_bwdif: Add neon for filter_intra
  tests/checkasm: Add test for vf_bwdif filter_edge
  avfilter/vf_bwdif: Add neon for filter_edge
  avfilter/vf_bwdif: Add neon for filter_line
  avfilter/vf_bwdif: Add a filter_line3 method for optimisation
  avfilter/vf_bwdif: Add neon for filter_line3

 libavfilter/aarch64/Makefile|   2 +
 libavfilter/aarch64/vf_bwdif_init_aarch64.c | 125 
 libavfilter/aarch64/vf_bwdif_neon.S | 788 
 libavfilter/bwdif.h |  20 +
 libavfilter/vf_bwdif.c  |  70 +-
 tests/checkasm/vf_bwdif.c   | 172 +
 6 files changed, 1162 insertions(+), 15 deletions(-)
 create mode 100644 libavfilter/aarch64/vf_bwdif_init_aarch64.c
 create mode 100644 libavfilter/aarch64/vf_bwdif_neon.S

-- 
2.39.2

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH v4 1/7] tests/checkasm: Add test for vf_bwdif filter_intra

2023-07-04 Thread John Cox

Signed-off-by: John Cox 
---
 tests/checkasm/vf_bwdif.c | 37 +
 1 file changed, 37 insertions(+)

diff --git a/tests/checkasm/vf_bwdif.c b/tests/checkasm/vf_bwdif.c
index 46224bb575..034bbabb4c 100644
--- a/tests/checkasm/vf_bwdif.c
+++ b/tests/checkasm/vf_bwdif.c
@@ -20,6 +20,7 @@
 #include "checkasm.h"
 #include "libavcodec/internal.h"
 #include "libavfilter/bwdif.h"
+#include "libavutil/mem_internal.h"
 
 #define WIDTH 256
 
@@ -81,4 +82,40 @@ void checkasm_check_vf_bwdif(void)
 BODY(uint16_t, 10);
 report("bwdif10");
 }
+
+if (check_func(ctx_8.filter_intra, "bwdif8.intra")) {
+LOCAL_ALIGNED_16(uint8_t, cur0,  [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, cur1,  [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, dst0,  [WIDTH*3]);
+LOCAL_ALIGNED_16(uint8_t, dst1,  [WIDTH*3]);
+const int stride = WIDTH;
+const int mask = (1<<8)-1;
+
+declare_func(void, void *dst1, void *cur1, int w, int prefs, int mrefs,
+ int prefs3, int mrefs3, int parity, int clip_max);
+
+randomize_buffers( cur0,  cur1, mask, 11*WIDTH);
+memset(dst0, 0xba, WIDTH * 3);
+memset(dst1, 0xba, WIDTH * 3);
+
+call_ref(dst0 + stride,
+ cur0 + stride * 4, WIDTH,
+ stride, -stride, stride * 3, -stride * 3,
+ 0, mask);
+call_new(dst1 + stride,
+ cur0 + stride * 4, WIDTH,
+ stride, -stride, stride * 3, -stride * 3,
+ 0, mask);
+
+if (memcmp(dst0, dst1, WIDTH*3)
+|| memcmp( cur0,  cur1, WIDTH*11))
+fail();
+
+bench_new(dst1 + stride,
+  cur0 + stride * 4, WIDTH,
+  stride, -stride, stride * 3, -stride * 3,
+  0, mask);
+
+report("bwdif8.intra");
+}
 }
-- 
2.39.2

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH v4 2/7] avfilter/vf_bwdif: Add neon for filter_intra

2023-07-04 Thread John Cox

Adds an outline for aarch neon functions
Adds common macros and consts for aarch64 neon
Exports C filter_intra needed for tail fixup of neon code
Adds neon for filter_intra

Signed-off-by: John Cox 
---
 libavfilter/aarch64/Makefile|   2 +
 libavfilter/aarch64/vf_bwdif_init_aarch64.c |  56 
 libavfilter/aarch64/vf_bwdif_neon.S | 136 
 libavfilter/bwdif.h |   4 +
 libavfilter/vf_bwdif.c  |   8 +-
 5 files changed, 203 insertions(+), 3 deletions(-)
 create mode 100644 libavfilter/aarch64/vf_bwdif_init_aarch64.c
 create mode 100644 libavfilter/aarch64/vf_bwdif_neon.S

diff --git a/libavfilter/aarch64/Makefile b/libavfilter/aarch64/Makefile
index b58daa3a3f..b68209bc94 100644
--- a/libavfilter/aarch64/Makefile
+++ b/libavfilter/aarch64/Makefile
@@ -1,3 +1,5 @@
+OBJS-$(CONFIG_BWDIF_FILTER)  += aarch64/vf_bwdif_init_aarch64.o
 OBJS-$(CONFIG_NLMEANS_FILTER)+= aarch64/vf_nlmeans_init.o
 
+NEON-OBJS-$(CONFIG_BWDIF_FILTER) += aarch64/vf_bwdif_neon.o
 NEON-OBJS-$(CONFIG_NLMEANS_FILTER)   += aarch64/vf_nlmeans_neon.o
diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c 
b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
new file mode 100644
index 00..3ffaa07ab3
--- /dev/null
+++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
@@ -0,0 +1,56 @@
+/*
+ * bwdif aarch64 NEON optimisations
+ *
+ * Copyright (c) 2023 John Cox 
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/common.h"
+#include "libavfilter/bwdif.h"
+#include "libavutil/aarch64/cpu.h"
+
+void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, int 
mrefs,
+int prefs3, int mrefs3, int parity, int 
clip_max);
+
+
+static void filter_intra_helper(void *dst1, void *cur1, int w, int prefs, int 
mrefs,
+int prefs3, int mrefs3, int parity, int 
clip_max)
+{
+const int w0 = clip_max != 255 ? 0 : w & ~15;
+
+ff_bwdif_filter_intra_neon(dst1, cur1, w0, prefs, mrefs, prefs3, mrefs3, 
parity, clip_max);
+
+if (w0 < w)
+ff_bwdif_filter_intra_c((char *)dst1 + w0, (char *)cur1 + w0,
+w - w0, prefs, mrefs, prefs3, mrefs3, parity, 
clip_max);
+}
+
+void
+ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
+{
+const int cpu_flags = av_get_cpu_flags();
+
+if (bit_depth != 8)
+return;
+
+if (!have_neon(cpu_flags))
+return;
+
+s->filter_intra = filter_intra_helper;
+}
+
diff --git a/libavfilter/aarch64/vf_bwdif_neon.S 
b/libavfilter/aarch64/vf_bwdif_neon.S
new file mode 100644
index 00..e288efbe6c
--- /dev/null
+++ b/libavfilter/aarch64/vf_bwdif_neon.S
@@ -0,0 +1,136 @@
+/*
+ * bwdif aarch64 NEON optimisations
+ *
+ * Copyright (c) 2023 John Cox 
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+
+#include "libavutil/aarch64/asm.S"
+
+// Space taken on the stack by an int (32-bit)
+#ifdef __APPLE__
+.setSP_INT, 4
+#else
+.setSP_INT, 8
+#endif
+
+.macro SQSHRUNN b, s0, s1, s2, s3, n
+sqshrun \s0\().4h, \s0\().4s, #\n - 8
+sqshrun2\s0\().8h, \s1\().4s, #\n - 8
+sqshrun \s1\().4h, \s2\().4s, #\n - 8
+sqshrun2\s1\().8h, \s3\().4s, #\n - 8
+uzp2\b\().16b, \s0\().16b, \s1\().16b
+.endm
+
+.macro SMULL4K a0, a1, a2, a3, s0, s1, k
+smull   \a0\().4s

[FFmpeg-devel] [PATCH v4 3/7] tests/checkasm: Add test for vf_bwdif filter_edge

2023-07-04 Thread John Cox

Signed-off-by: John Cox 
---
 tests/checkasm/vf_bwdif.c | 54 +++
 1 file changed, 54 insertions(+)

diff --git a/tests/checkasm/vf_bwdif.c b/tests/checkasm/vf_bwdif.c
index 034bbabb4c..5fdba09fdc 100644
--- a/tests/checkasm/vf_bwdif.c
+++ b/tests/checkasm/vf_bwdif.c
@@ -83,6 +83,60 @@ void checkasm_check_vf_bwdif(void)
 report("bwdif10");
 }
 
+{
+LOCAL_ALIGNED_16(uint8_t, prev0, [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, prev1, [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, next0, [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, next1, [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, cur0,  [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, cur1,  [11*WIDTH]);
+LOCAL_ALIGNED_16(uint8_t, dst0,  [WIDTH*3]);
+LOCAL_ALIGNED_16(uint8_t, dst1,  [WIDTH*3]);
+const int stride = WIDTH;
+const int mask = (1<<8)-1;
+int spat;
+int parity;
+
+for (spat = 0; spat != 2; ++spat) {
+for (parity = 0; parity != 2; ++parity) {
+if (check_func(ctx_8.filter_edge, "bwdif8.edge.s%d.p%d", spat, 
parity)) {
+
+declare_func(void, void *dst1, void *prev1, void *cur1, 
void *next1,
+int w, int prefs, int mrefs, int 
prefs2, int mrefs2,
+int parity, int clip_max, int 
spat);
+
+randomize_buffers(prev0, prev1, mask, 11*WIDTH);
+randomize_buffers(next0, next1, mask, 11*WIDTH);
+randomize_buffers( cur0,  cur1, mask, 11*WIDTH);
+memset(dst0, 0xba, WIDTH * 3);
+memset(dst1, 0xba, WIDTH * 3);
+
+call_ref(dst0 + stride,
+ prev0 + stride * 4, cur0 + stride * 4, next0 + 
stride * 4, WIDTH,
+ stride, -stride, stride * 2, -stride * 2,
+ parity, mask, spat);
+call_new(dst1 + stride,
+ prev1 + stride * 4, cur1 + stride * 4, next1 + 
stride * 4, WIDTH,
+ stride, -stride, stride * 2, -stride * 2,
+ parity, mask, spat);
+
+if (memcmp(dst0, dst1, WIDTH*3)
+|| memcmp(prev0, prev1, WIDTH*11)
+|| memcmp(next0, next1, WIDTH*11)
+|| memcmp( cur0,  cur1, WIDTH*11))
+fail();
+
+bench_new(dst1 + stride,
+ prev1 + stride * 4, cur1 + stride * 4, next1 + 
stride * 4, WIDTH,
+ stride, -stride, stride * 2, -stride * 2,
+ parity, mask, spat);
+}
+}
+}
+
+report("bwdif8.edge");
+}
+
 if (check_func(ctx_8.filter_intra, "bwdif8.intra")) {
 LOCAL_ALIGNED_16(uint8_t, cur0,  [11*WIDTH]);
 LOCAL_ALIGNED_16(uint8_t, cur1,  [11*WIDTH]);
-- 
2.39.2

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH v4 4/7] avfilter/vf_bwdif: Add neon for filter_edge

2023-07-04 Thread John Cox

Adds clip and spatial macros for aarch64 neon
Exports C filter_edge needed for tail fixup of neon code
Adds neon for filter_edge

Signed-off-by: John Cox 
---
 libavfilter/aarch64/vf_bwdif_init_aarch64.c |  20 +++
 libavfilter/aarch64/vf_bwdif_neon.S | 177 
 libavfilter/bwdif.h |   4 +
 libavfilter/vf_bwdif.c  |   8 +-
 4 files changed, 205 insertions(+), 4 deletions(-)

diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c 
b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
index 3ffaa07ab3..e75cf2f204 100644
--- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c
+++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
@@ -24,10 +24,29 @@
 #include "libavfilter/bwdif.h"
 #include "libavutil/aarch64/cpu.h"
 
+void ff_bwdif_filter_edge_neon(void *dst1, void *prev1, void *cur1, void 
*next1,
+   int w, int prefs, int mrefs, int prefs2, int 
mrefs2,
+   int parity, int clip_max, int spat);
+
 void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, int 
mrefs,
 int prefs3, int mrefs3, int parity, int 
clip_max);
 
 
+static void filter_edge_helper(void *dst1, void *prev1, void *cur1, void 
*next1,
+   int w, int prefs, int mrefs, int prefs2, int 
mrefs2,
+   int parity, int clip_max, int spat)
+{
+const int w0 = clip_max != 255 ? 0 : w & ~15;
+
+ff_bwdif_filter_edge_neon(dst1, prev1, cur1, next1, w0, prefs, mrefs, 
prefs2, mrefs2,
+  parity, clip_max, spat);
+
+if (w0 < w)
+ff_bwdif_filter_edge_c((char *)dst1 + w0, (char *)prev1 + w0, (char 
*)cur1 + w0, (char *)next1 + w0,
+   w - w0, prefs, mrefs, prefs2, mrefs2,
+   parity, clip_max, spat);
+}
+
 static void filter_intra_helper(void *dst1, void *cur1, int w, int prefs, int 
mrefs,
 int prefs3, int mrefs3, int parity, int 
clip_max)
 {
@@ -52,5 +71,6 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
 return;
 
 s->filter_intra = filter_intra_helper;
+s->filter_edge  = filter_edge_helper;
 }
 
diff --git a/libavfilter/aarch64/vf_bwdif_neon.S 
b/libavfilter/aarch64/vf_bwdif_neon.S
index e288efbe6c..389302b813 100644
--- a/libavfilter/aarch64/vf_bwdif_neon.S
+++ b/libavfilter/aarch64/vf_bwdif_neon.S
@@ -66,6 +66,79 @@
 umlsl2  \a3\().4s, \s1\().8h, \k
 .endm
 
+//  int b = m2s1 - m1;
+//  int f = p2s1 - p1;
+//  int dc = c0s1 - m1;
+//  int de = c0s1 - p1;
+//  int sp_max = FFMIN(p1 - c0s1, m1 - c0s1);
+//  sp_max = FFMIN(sp_max, FFMAX(-b,-f));
+//  int sp_min = FFMIN(c0s1 - p1, c0s1 - m1);
+//  sp_min = FFMIN(sp_min, FFMAX(b,f));
+//  diff = diff == 0 ? 0 : FFMAX3(diff, sp_min, sp_max);
+.macro SPAT_CHECK diff, m2s1, m1, c0s1, p1, p2s1, t0, t1, t2, t3
+uqsub   \t0\().16b, \p1\().16b, \c0s1\().16b
+uqsub   \t2\().16b, \m1\().16b, \c0s1\().16b
+umin\t2\().16b, \t0\().16b, \t2\().16b
+
+uqsub   \t1\().16b, \m1\().16b, \m2s1\().16b
+uqsub   \t3\().16b, \p1\().16b, \p2s1\().16b
+umax\t3\().16b, \t3\().16b, \t1\().16b
+umin\t3\().16b, \t3\().16b, \t2\().16b
+
+uqsub   \t0\().16b, \c0s1\().16b, \p1\().16b
+uqsub   \t2\().16b, \c0s1\().16b, \m1\().16b
+umin\t2\().16b, \t0\().16b, \t2\().16b
+
+uqsub   \t1\().16b, \m2s1\().16b, \m1\().16b
+uqsub   \t0\().16b, \p2s1\().16b, \p1\().16b
+umax\t0\().16b, \t0\().16b, \t1\().16b
+umin\t2\().16b, \t2\().16b, \t0\().16b
+
+cmeq\t1\().16b, \diff\().16b, #0
+umax\diff\().16b, \diff\().16b, \t3\().16b
+umax\diff\().16b, \diff\().16b, \t2\().16b
+bic \diff\().16b, \diff\().16b, \t1\().16b
+.endm
+
+//  i0 = s0;
+//  if (i0 > d0 + diff0)
+//  i0 = d0 + diff0;
+//  else if (i0 < d0 - diff0)
+//  i0 = d0 - diff0;
+//
+// i0 = s0 is safe
+.macro DIFF_CLIP i0, s0, d0, diff, t0, t1
+uqadd   \t0\().16b, \d0\().16b, \diff\().16b
+uqsub   \t1\().16b, \d0\().16b, \diff\().16b
+umin\i0\().16b, \s0\().16b, \t0\().16b
+umax\i0\().16b, \i0\().16b, \t1\().16b
+.endm
+
+//  i0 = FFABS(m1 - p1) > td0 ? i1 : i2;
+//  DIFF_CLIP
+//
+// i0 = i1 is safe
+.macro INTERPOL i0, i1, i2, m1, d0, p1, td0, diff, t0, t1, t2
+uabd\t0\().16b, \m1\().16b, \p1\().16b
+cmhi\t0\().16b, \t0\().16b, \td0\().16b
+bsl \t0\().16b, \i1\().16b, \i2\().16b
+DIFF_CLIP   \i0, \t0, \d0, \diff, \t1, \t

1 2 >

1 - 100 of 120 matches

Mail list logo