On Tue, Aug 18, 2020 at 01:11:30PM -0500, Sebastian Pop wrote:
> Hi,
> 
> Unrolling by 4 the outer loop in yuv2planeX reduces the number of cache
> accesses by 7.5%.
> The values loaded for the filter are used in the 4 unrolled iterations and
> avoids reloading 3 times the same values.
> The performance was measured on an Arm64 Neoverse-N1 Graviton2 c6g.metal
> instance with the following command:
> $ perf stat -e cache-references ./ffmpeg_g -nostats -f lavfi -i
> testsrc2=4k:d=2 -vf bench=start,scale=1024x1024,bench=stop -f null -
> 
> before: 1551591469      cache-references
> after:  1436140431      cache-references
> 
> before: [bench @ 0xaaaac62b7d30] t:0.013226 avg:0.013219 max:0.013537
> min:0.012975
> after:  [bench @ 0xaaaad84f3d30] t:0.012355 avg:0.012381 max:0.013164
> min:0.012158
> 

> Ok to commit?

faster is better obviously, so if its tested with odd sizes and arm
developers had a chance to comment. it should be ok

one potential improvment is to use the unrolled code for odd width
too and use the non unrolled for the end

thx


[...]

-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Many things microsoft did are stupid, but not doing something just because
microsoft did it is even more stupid. If everything ms did were stupid they
would be bankrupt already.

Attachment: signature.asc
Description: PGP signature

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Reply via email to