[FFmpeg-devel] [PATCH] swscale: rgb_to_yuv neon optimizations

2025-05-19 Thread Dmitriy Kovalenko
Signed-off-by: Dmitriy Kovalenko --- libswscale/aarch64/input.S | 166 + 1 file changed, 112 insertions(+), 54 deletions(-) diff --git a/libswscale/aarch64/input.S b/libswscale/aarch64/input.S index c1c0adffc8..ee8eb24c14 100644 --- a/libswscale/aarch64/input.S +

[FFmpeg-devel] [PATCH 1/2] swscale: rgb_to_yuv neon optimizations

2025-05-27 Thread Dmitriy Kovalenko
I've found quite a few ways to optimize existing ffmpeg's rgb to yuv subsampled conversion. In this patch stack I'll try to improve the perofrmance. This particular set of changes is a small improvement to all the existing functions and macro. The biggest performance gain is coming from post loadi

[FFmpeg-devel] [PATCH 2/2] swscale: Neon rgb_to_yuv_half process 16 pixels at a time

2025-05-27 Thread Dmitriy Kovalenko
This patches integrates so called double bufferring when we are loading 2 batch elements at a time and then processing them in parallel. On the moden arm processors especially Apple Silicon it gives a visible benefit, for subsampled pixel processing it is especially nice because it allows to read e

Re: [FFmpeg-devel] [PATCH] swscale: rgb_to_yuv neon optimizations

2025-05-21 Thread Dmitriy Kovalenko
Bumping on the review for this one On 19/05/2025 21:50, Dmitriy Kovalenko wrote: I've found quite a few ways to optimize existing ffmpeg's rgb to yuv subsampled conversion. In this patch stack I'll try to improve the performance. This particular set of changes is a small imp

Re: [FFmpeg-devel] [PATCH v4 2/2] swscale: Neon rgb_to_yuv_half process 32 pixels at a time

2025-05-31 Thread Dmitriy Kovalenko
Great. I send another version with the reverted change for the asr register change. What is the correct process to reply for the inline changes then? Inline email answer or cover letter? > On May 30, 2025, at 11:10, Martin Storsjö wrote: > > On Fri, 30 May 2025, Dmitriy Kovale

Re: [FFmpeg-devel] [PATCH 2/2] swscale: Neon rgb_to_yuv_half process 32 pixels at a time

2025-05-31 Thread Dmitriy Kovalenko
Correct. I meant dual issue https://developer.arm.com/documentation/ddi0460/d/Cycle-Timings-and-Interlock-Behavior/Dual-issue Best regards, Dmitriy Kovalenko On May 31, 2025, at 12:32, Kieran Kunhya wrote:  On Sat, 31 May 2025, 10:17 Dmitriy Kovalenko, mailto:dmtr.kovale...@outlook.com

Re: [FFmpeg-devel] [PATCH v4 2/2] swscale: Neon rgb_to_yuv_half process 32 pixels at a time

2025-05-31 Thread Dmitriy Kovalenko
the quoted message. You can still use “>” to make a partial quote (hope it works lol) Best regards, Dmitriy Kovalenko > On May 31, 2025, at 12:43, Christopher Snowhill wrote: > > by > not allowing one to insert text into the middle of

Re: [FFmpeg-devel] [PATCH 2/2] swscale: Neon rgb_to_yuv_half process 32 pixels at a time

2025-05-31 Thread Dmitriy Kovalenko
> On May 31, 2025, at 14:13, Martin Storsjö wrote: > > On Sat, 31 May 2025, Dmitriy Kovalenko wrote: > >> Correct. I meant dual issue >> https://developer.arm.com/documentation/ddi0460/d/Cycle-Timings-and-Interlock-Behavior/Dual-issue > > D

[FFmpeg-devel] [PATCH 0/2] swscale: rgb_to_yuv neon optimizations

2025-05-31 Thread Dmitriy Kovalenko
ich does detect such issues. I managed to rewrite the function to avoid using any callee saved registers. The only register I keep using is v7 which is not callee saved. Dmitriy Kovalenko (2): swscale: rgb_to_yuv neon optimizations swscale: Neon rgb_to_yuv_half process 32 pixe

[FFmpeg-devel] [PATCH 1/2] swscale: rgb_to_yuv neon optimizations

2025-05-31 Thread Dmitriy Kovalenko
I've found quite a few ways to optimize existing ffmpeg's rgb to yuv subsampled conversion. In this patch stack I'll try to improve the perofrmance. This particular set of changes is a small improvement to all the existing functions and macro. The biggest performance gain is coming from post loadi

[FFmpeg-devel] [PATCH 2/2] swscale: Neon rgb_to_yuv_half process 32 pixels at a time

2025-05-31 Thread Dmitriy Kovalenko
This patch integrates so called double bufferring when we are loading 2 batch of elements at a time and then processing them in parallel. On the moden arm processors especially Apple Silicon it gives a visible benefit, for subsampled pixel processing it is especially nice because it allows to read

Re: [FFmpeg-devel] [PATCH 1/2] swscale: rgb_to_yuv neon optimizations

2025-05-29 Thread Dmitriy Kovalenko
I appreciate the review for both the commits. I did fix all the unrelated changes and iterated in the new version, would appreciate the rearview. > On May 29, 2025, at 20:53, Martin Storsjö wrote: > > On Tue, 27 May 2025, Dmitriy Kovalenko wrote: > >> This particular s

[FFmpeg-devel] [PATCH v2 1/2] swscale: rgb_to_yuv neon optimizations

2025-05-29 Thread Dmitriy Kovalenko
I've found quite a few ways to optimize existing ffmpeg's rgb to yuv subsampled conversion. In this patch stack I'll try to improve the perofrmance. This particular set of changes is a small improvement to all the existing functions and macro. The biggest performance gain is coming from post loadi

[FFmpeg-devel] [PATCH v2 2/2] swscale: Neon rgb_to_yuv_half process 32 pixels at a time

2025-05-29 Thread Dmitriy Kovalenko
This patch integrates so called double bufferring when we are loading 2 batch of elements at a time and then processing them in parallel. On the moden arm processors especially Apple Silicon it gives a visible benefit, for subsampled pixel processing it is especially nice because it allows to read

Re: [FFmpeg-devel] [PATCH 1/2] swscale: rgb_to_yuv neon optimizations

2025-05-30 Thread Dmitriy Kovalenko
, Martin Storsjö wrote: > > On Thu, 29 May 2025, Dmitriy Kovalenko wrote: > >> I appreciate the review for both the commits. I did fix all the unrelated >> changes and iterated in the new version, would appreciate the rearview. > > Don't top post. > > There a

[FFmpeg-devel] [PATCH v4 2/2] swscale: Neon rgb_to_yuv_half process 32 pixels at a time

2025-05-30 Thread Dmitriy Kovalenko
=== Feedback response === > Also, with that fixed, this fails to properly back up and restore registers > v8-v15; checkasm doesn't notice this on macOS, but on Linux and windows, > checkasm has a call wrapper which does detect such issues. I managed to rewrite the function to avoid using any ca

[FFmpeg-devel] [PATCH v4 1/2] swscale: rgb_to_yuv neon optimizations

2025-05-30 Thread Dmitriy Kovalenko
I'm sorry for the previous patch it seems to be something happening off with the corrupted patch got sent at the outlook step, I'll keep using send-email. === __every single__ inline comment response === > This is an unrelated change Fixed and resolved > The patch adds trailing whitespace here

[FFmpeg-devel] [PATCH v2 0/2] swscale: neon aarch64 rgb_to_yuv optimizationsj

2025-05-29 Thread Dmitriy Kovalenko
macos nor linux arm builds so why not to keep them? Dmitriy Kovalenko (2): swscale: rgb_to_yuv neon optimizations swscale: Neon rgb_to_yuv_half process 32 pixels at a time libswscale/aarch64/input.S | 212 +++-- 1 file changed, 155 insertions(+), 57 deletions

[FFmpeg-devel] [PATCH v3 1/2] swscale: rgb_to_yuv neon optimizations

2025-05-30 Thread Dmitriy Kovalenko
I've found quite a few ways to optimize existing ffmpeg's rgb to yuv subsampled conversion. In this patch stack I'll try to improve the perofrmance. This particular set of changes is a small improvement to all the existing functions and macro. The biggest performance gain is coming from post loadi

[FFmpeg-devel] [PATCH v3 2/2] swscale: Neon rgb_to_yuv_half process 32 pixels at a time

2025-05-30 Thread Dmitriy Kovalenko
This patch integrates so called double bufferring when we are loading 2 batch of elements at a time and then processing them in parallel. On the moden arm processors especially Apple Silicon it gives a visible benefit, for subsampled pixel processing it is especially nice because it allows to read

[FFmpeg-devel] [PATCH] configure: Ignore nullability-completeness apple clang warnings

2025-06-05 Thread Dmitriy Kovalenko
Some of the versions of Apple Clang produces a ton of the warnings related to the missing nullablity specifiers on the existing codebase of ffmpeg which significantly slows down the compilation becuase of the produced output size (especially on CI as a part of external build systems because they us