Hi, On Wed, Oct 7, 2015 at 5:38 AM, Paul B Mahol <one...@gmail.com> wrote:
> Signed-off-by: Paul B Mahol <one...@gmail.com> > --- > libavfilter/x86/vf_blend.asm | 62 > +++++++++++++++++++++++++++++++++++++++++ > libavfilter/x86/vf_blend_init.c | 14 ++++++++++ > 2 files changed, 76 insertions(+) > > diff --git a/libavfilter/x86/vf_blend.asm b/libavfilter/x86/vf_blend.asm > index 167e72b..7180817 100644 > --- a/libavfilter/x86/vf_blend.asm > +++ b/libavfilter/x86/vf_blend.asm > @@ -27,6 +27,8 @@ SECTION_RODATA > > pw_128: times 8 dw 128 > pw_255: times 8 dw 255 > +pb_128: times 16 db 128 > +pb_255: times 16 db 255 > > SECTION .text > > @@ -273,6 +275,36 @@ cglobal blend_darken, 9, 10, 2, 0, top, top_linesize, > bottom, bottom_linesize, d > jg .nextrow > REP_RET > > +cglobal blend_hardmix, 9, 10, 3, 0, top, top_linesize, bottom, > bottom_linesize, dst, dst_linesize, width, start, end > + add topq, widthq > + add bottomq, widthq > + add dstq, widthq > + sub endq, startq > + neg widthq > +.nextrow: > + mov r10q, widthq > + %define x r10q > You're saying that you use 10 regs, but you're using r10, which is the 11th. Use r9 here, or specify that you use 11. Now, more generally, you're using a lot of regs in all your simd, and some aren't necessary, so some lessons about arguments: most arguments come on stack. On x86-64, the first 4 (win64) or 6 (unix64) come in registers, but the rest (width, start, end) come on stack. On x86-32, all arguments come on stack. So, if you get 9 arguments, you have 3 arguments at least on stack, including width. That means you don't have to move width into r10q; you can move widthmp (the stack version of this argument) into widthq at the start of each row, since the system already put width on stack for you. x86inc.asm moves it from stack into a register for you when you say cglobal name, %d and %d >= 7 (where width is the 7th argument). Then, you can also sub startmp from endq, which you can then store back into endmp on x86-32, and suddenly on x86-32 you only need 7 regs (for x86-64, you keep using endd since that's faster). And now, your simd works on 32bit systems as well. + .loop: > + movu m0, [topq + x] > + movu m1, [bottomq + x] > + mova m2, [pb_255] > + psubusb m2, m1 pxor m1, [pb_255] should be the same as mova reg, [pb_255] and psubusb reg, m1 Now, you're using pb_255 a lot inside your inner loop, and with pxor, you only use it non-destructively, so why not move it into a register (m3) outside the loop so you only load it from mem once? Ronald _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel