On Thu, Aug 21, 2014 at 12:42 AM, James Almer <jamr...@gmail.com> wrote: > * Reduced xmm register count to 7 (As such they are now enabled for x86_32). > * Removed four movdqa (affects the sse2 version only). > * pxor is now used to clear m0 only once. > > ~5% faster. > > Signed-off-by: James Almer <jamr...@gmail.com> > ---
Good job, faster and 32-bit compat! > libavcodec/x86/hevc_res_add.asm | 122 > ++++++++++++++++------------------------ > libavcodec/x86/hevcdsp_init.c | 10 ++-- > 2 files changed, 51 insertions(+), 81 deletions(-) > > diff --git a/libavcodec/x86/hevc_res_add.asm b/libavcodec/x86/hevc_res_add.asm > index feea50c..7238fb3 100644 > --- a/libavcodec/x86/hevc_res_add.asm > +++ b/libavcodec/x86/hevc_res_add.asm > @@ -88,71 +88,41 @@ cglobal hevc_transform_add4_8, 3, 4, 6 > movhps [r0+r3 ], m1 > %endmacro > > -%macro TR_ADD_INIT_SSE_8 0 > - pxor m0, m0 > - > - mova m4, [r1] > - mova m1, [r1+16] > - psubw m2, m0, m1 > - psubw m5, m0, m4 > - packuswb m4, m1 > - packuswb m5, m2 > - > - mova m6, [r1+32] > - mova m1, [r1+48] > - psubw m2, m0, m1 > - psubw m7, m0, m6 > - packuswb m6, m1 > - packuswb m7, m2 > - > - mova m8, [r1+64] > - mova m1, [r1+80] > - psubw m2, m0, m1 > - psubw m9, m0, m8 > - packuswb m8, m1 > - packuswb m9, m2 > - > - mova m10, [r1+96] > - mova m1, [r1+112] > - psubw m2, m0, m1 > - psubw m11, m0, m10 > - packuswb m10, m1 > - packuswb m11, m2 > -%endmacro > - > - > -%macro TR_ADD_SSE_16_8 0 > - TR_ADD_INIT_SSE_8 > - > - paddusb m0, m4, [r0 ] > - paddusb m1, m6, [r0+r2 ] > - paddusb m2, m8, [r0+r2*2] > - paddusb m3, m10,[r0+r3 ] > - psubusb m0, m5 > - psubusb m1, m7 > - psubusb m2, m9 > - psubusb m3, m11 > - mova [r0 ], m0 > - mova [r0+r2 ], m1 > - mova [r0+2*r2], m2 > - mova [r0+r3 ], m3 > -%endmacro > - > -%macro TR_ADD_SSE_32_8 0 > - TR_ADD_INIT_SSE_8 > - > - paddusb m0, m4, [r0 ] > - paddusb m1, m6, [r0+16 ] > - paddusb m2, m8, [r0+r2 ] > - paddusb m3, m10,[r0+r2+16] > - psubusb m0, m5 > - psubusb m1, m7 > - psubusb m2, m9 > - psubusb m3, m11 > - mova [r0 ], m0 > - mova [r0+16 ], m1 > - mova [r0+r2 ], m2 > - mova [r0+r2+16], m3 > +%macro TR_ADD_SSE_16_32_8 3 > + mova m2, [r1+%1 ] > + mova m6, [r1+%1+16] > +%if cpuflag(avx) > + psubw m1, m0, m2 > + psubw m5, m0, m6 > +%else > + mova m1, m0 > + mova m5, m0 > + psubw m1, m2 > + psubw m5, m6 > +%endif I was wondering about these blocks - doesn't the x264asm layer automatically add the mova's when you just use the 3-arg form on sse2? Or is there a speed benefit grouping the mov's? - Hendrik _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel