Hi,

Unrolling by 4 the outer loop in yuv2planeX reduces the number of cache
accesses by 7.5%.
The values loaded for the filter are used in the 4 unrolled iterations and
avoids reloading 3 times the same values.
The performance was measured on an Arm64 Neoverse-N1 Graviton2 c6g.metal
instance with the following command:
$ perf stat -e cache-references ./ffmpeg_g -nostats -f lavfi -i
testsrc2=4k:d=2 -vf bench=start,scale=1024x1024,bench=stop -f null -

before: 1551591469      cache-references
after:  1436140431      cache-references

before: [bench @ 0xaaaac62b7d30] t:0.013226 avg:0.013219 max:0.013537
min:0.012975
after:  [bench @ 0xaaaad84f3d30] t:0.012355 avg:0.012381 max:0.013164
min:0.012158

Ok to commit?

Thanks,
Sebastian

Attachment: 0001-aarch64-yuv2planeX-unroll-outer-loop-by-4-increases-.patch
Description: Binary data

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Reply via email to