Hi, Unrolling by 4 the outer loop in yuv2planeX reduces the number of cache accesses by 7.5%. The values loaded for the filter are used in the 4 unrolled iterations and avoids reloading 3 times the same values. The performance was measured on an Arm64 Neoverse-N1 Graviton2 c6g.metal instance with the following command: $ perf stat -e cache-references ./ffmpeg_g -nostats -f lavfi -i testsrc2=4k:d=2 -vf bench=start,scale=1024x1024,bench=stop -f null -
before: 1551591469 cache-references after: 1436140431 cache-references before: [bench @ 0xaaaac62b7d30] t:0.013226 avg:0.013219 max:0.013537 min:0.012975 after: [bench @ 0xaaaad84f3d30] t:0.012355 avg:0.012381 max:0.013164 min:0.012158 Ok to commit? Thanks, Sebastian
0001-aarch64-yuv2planeX-unroll-outer-loop-by-4-increases-.patch
Description: Binary data
_______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".