On Thu, Oct 1, 2015 at 8:42 PM, Paul B Mahol <one...@gmail.com> wrote: > diff --git a/libavfilter/vf_maskedmerge.c b/libavfilter/vf_maskedmerge.c
> if (desc->comp[0].depth == 8) > s->maskedmerge = maskedmerge8; > else > s->maskedmerge = maskedmerge16; > > + if (ARCH_X86) > + ff_maskedmerge_init_x86(s); > + Create a new function ff_maskedmerge_init() and move the above code there, that will make it easier to add a unit test. > diff --git a/libavfilter/x86/vf_maskedmerge.asm > b/libavfilter/x86/vf_maskedmerge.asm > + mova m5, [pw_128] > + mova m2, [pw_256] > + pxor m6, m6 Nit: Reorganize your registers so you get those constants in m4, m5, m6. It will make the code easier to follow IMO. > + mov r10q, 0 Xor a register with itself instead of using mov to zero a register. There's also no need to use the q suffix for plain register names, r10 is enough. > + movh m0, [bsrcq + x] > + movh m1, [osrcq + x] > + movh m3, [msrcq + x] [...] > + punpcklbw m0, m6 > + punpcklbw m1, m6 > + punpcklbw m3, m6 You could also make an SSE4 version that uses pmovzxbw. > + paddw m1, m5 > + psrlw m1, 8 I believe you could also make an SSSE3 version that uses pmulhrsw instead of add + shift. > + add r10q, mmsize / 2 > + cmp r10q, wq > + jl .loop There's a trick you could do here that might be faster: 1) Add w to bsrc, osrc, msrc and dst and then negate w in the beginning of the function. 2) Initialize r10 to w instead of 0 at the beginning of each .nextrow iteration 3) You can now drop the cmp, the add will be enough to set the right flags for the branch I also encourage you to write a checkasm unit test, that will make it easier to both benchmark and verify the correctness of your code. _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel