Hello, in attach patch to add AVX2 version for each 8b func (except divide)
001 : avutil : add ABS2 for avx2 002 : avfilter : add AVX2 version for most of the func, the AVX2 is a simple modification VBROADCASTi128, for constant loading when the process stay in 8bits when the process use intermediate 16 bits (the load use movh (64 bits load)) i create a macro (someone will probably have a better idea for the name of these new macro) the idea in AVX2 is to load 128bits of data (2x 64 bits) then shuffle accross lane, the two 64 bits in the low part of each lane, to keep the rest of the process similar to the sse version for the store, the idea is similar in the opposite way (shuffle before store) The speed improvment is not very significative for these func (grainextract, multiply, screen, average, grainmerge) (i'm not sure, the avx2 version is need (except for screen). Checkasm result (x86_64, kaby lake) ./tests/checkasm/checkasm --test=vf_blend --bench benchmarking with native FFmpeg timers nop: 36.2 checkasm: using random seed 2027036350 SSE2: - vf_blend.8bit [OK] SSSE3: - vf_blend.8bit [OK] AVX2: - vf_blend.8bit [OK] checkasm: all 37 tests passed addition_c: 21882.7 addition_sse2: 483.9 addition_avx2: 250.9 and_c: 15336.7 and_sse2: 421.9 and_avx2: 196.7 average_c: 15640.7 average_sse2: 1160.7 average_avx2: 1155.7 darken_c: 27204.7 darken_sse2: 486.7 darken_avx2: 251.9 difference_c: 17101.9 difference_sse2: 981.2 difference_ssse3: 965.4 difference_avx2: 514.2 extremity_c: 27748.9 extremity_sse2: 1174.4 extremity_ssse3: 983.7 extremity_avx2: 520.4 grainextract_c: 22755.9 grainextract_sse2: 1158.2 grainextract_avx2: 1152.9 grainmerge_c: 26173.9 grainmerge_sse2: 1156.9 grainmerge_avx2: 1153.9 hardmix_c: 15676.9 hardmix_sse2: 458.4 hardmix_avx2: 268.7 lighten_c: 27137.4 lighten_sse2: 422.2 lighten_avx2: 194.2 multiply_c: 16449.9 multiply_sse2: 1378.9 multiply_avx2: 1158.7 negation_c: 17372.9 negation_sse2: 1439.4 negation_ssse3: 1172.4 negation_avx2: 520.4 or_c: 14116.2 or_sse2: 483.9 or_avx2: 236.4 phoenix_c: 30905.9 phoenix_sse2: 553.7 phoenix_avx2: 388.7 screen_c: 20414.7 screen_sse2: 1803.9 screen_avx2: 1257.4 subtract_c: 20596.2 subtract_sse2: 439.7 subtract_avx2: 403.7 xor_c: 15380.7 xor_sse2: 445.7 xor_avx2: 405.2 Comment welcome Martin
0001-avutil-x86-x86util-add-ABS2-for-AVX2.patch
Description: Binary data
0002-avfilter-x86-vf_blend-add-AVX2-version.patch
Description: Binary data
_______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel