On Mon, 25 Jul 2022, Hubert Mazur wrote:
Provide optimized implementation of pix_abs16_y2 function for arm64.
Performance comparison tests are shown below.
pix_abs_0_2_c: 308.5
pix_abs_0_2_neon: 39.2
Benchmarks and tests run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur <h...@semihalf.com>
---
libavcodec/aarch64/me_cmp_init_aarch64.c | 3 +
libavcodec/aarch64/me_cmp_neon.S | 73 ++++++++++++++++++++++++
2 files changed, 76 insertions(+)
+// iterate by one
+2:
+
+ ld1 {v1.16b}, [x2], x3 // Load pix2
+ ld1 {v2.16b}, [x5], x3 // Load pix3
+ urhadd v30.16b, v1.16b, v2.16b // Rounding halving add
+ ld1 {v0.16b}, [x1], x3 // Load pix1
+ uabd v30.16b, v30.16b, v30.16b
This should be "uabd v30, v30, v0" here too - please check the uncommon
codepaths too (until we can make checkasm test them by default).
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".