A first draft of a patch set adding AVX functions for 8-bit H.264 IDCT. Unfortunately they only provide a small speedup. 8-bit data isn't usually large enough to take advantage of wider registers. Although I admit I might have missed the places where only MMX code exists but 16-byte registers would be useful; or I just haven't reached them yet.
Regarding these patches: I still need to check that they work on 32-bit and Windows (both sizes). 64-bit Linux was fine. I also need to write a proper subject line for most of them. Finally, h264_idct_add16intra does not work so of course I won't push it if I can't get it working. I still included it here for completeness, future reference, and for fresh eyes. Initial timing data, Skylake-U: h264_idct_add avx: 1.20x faster (658±0.8 vs. 547±0.2 decicycles) compared with mmxext h264_idct_dc_add avx: 1.04x faster (521±1.7 vs. 501±1.1 decicycles) compared with mmxext h264_idct8_add avx: 1.01x faster (1069±1.9 vs. 1060±0.7 decicycles) compared with sse2 h264_idct8_dc_add avx: 1.12x faster (638±12.7 vs. 568±4.3 decicycles) compared with mmxext h264_idct_add16 avx: 1.01x faster (2150±46.1 vs. 2118±29.0 decicycles) compared with sse2 h264_idct8_add4 avx: 1.00x faster (2884±63.9 vs. 2880±21.1 decicycles) compared with sse2 h264_idct_add16intra avx: 1.02x faster (1580±4.8 vs. 1555±3.9 decicycles) compared with sse2 _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel