On Fri, Oct 2, 2015 at 6:57 PM, Paul B Mahol <one...@gmail.com> wrote: > +INIT_XMM sse2 > +cglobal blend_xor, 9, 10, 2, 0, top, top_linesize, bottom, bottom_linesize, > dst, dst_linesize, width, start, end [...] > +cglobal blend_or, 9, 10, 2, 0, top, top_linesize, bottom, bottom_linesize, > dst, dst_linesize, width, start, end [...] > +cglobal blend_and, 9, 10, 2, 0, top, top_linesize, bottom, bottom_linesize, > dst, dst_linesize, width, start, end
You could do those using floating point operations (xorps, orps, andps), then you only need SSE instead of SSE2 (and AVX instead of AVX2 if you want to make versions using ymm registers). > +cglobal blend_addition, 9, 10, 3, 0, top, top_linesize, bottom, > bottom_linesize, dst, dst_linesize, width, start, end [...] > + punpcklbw m0, m2 > + punpcklbw m1, m2 > + paddw m0, m1 > + packuswb m0, m0 > + movh [dstq + x], m0 > + add r10q, mmsize / 2 paddusb > +cglobal blend_subtract, 9, 10, 3, 0, top, top_linesize, bottom, > bottom_linesize, dst, dst_linesize, width, start, end [...] > + punpcklbw m0, m2 > + punpcklbw m1, m2 > + psubw m0, m1 > + packuswb m0, m0 psubusb > +cglobal blend_darken, 9, 10, 2, 0, top, top_linesize, bottom, > bottom_linesize, dst, dst_linesize, width, start, end [...] > + movh m0, [topq + x] > + movh m1, [bottomq + x] > + pminub m0, m1 > + movh [dstq + x], m0 [...] > +cglobal blend_lighten, 9, 10, 2, 0, top, top_linesize, bottom, > bottom_linesize, dst, dst_linesize, width, start, end [...] > + movh m0, [topq + x] > + movh m1, [bottomq + x] > + pmaxub m0, m1 > + movh [dstq + x], m0 You're only utilizing the lower half the registers here. _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel