On Wed, Sep 03, 2014 at 02:04:43PM -0700, Pascal Massimino wrote: > Clément > > On Wed, Sep 3, 2014 at 12:37 PM, Clément Bœsch <u...@pkh.me> wrote: > > > On Wed, Sep 03, 2014 at 07:05:48PM +0200, Pascal Massimino wrote: > > [...] > > > > > + punpcklbw m3, m_zero > > > > > + punpckhbw m4, m_zero > > > > > + > > > > > + paddsw m0, m3 > > > > > + paddsw m1, m4 > > > > > + > > > > > + movq m3, [bq+indexq*1] > > > > > + movq m4, m3 > > > > > + punpcklbw m3, m_zero > > > > > + punpckhbw m4, m_zero > > > > > + > > > > > + paddw m3, m3 > > > > > + paddw m4, m4 > > > > > + psubsw m0, m3 > > > > > + psubsw m1, m4 > > > > > + > > > > > > > > > + ABS1 m0, m5 > > > > > + ABS1 m1, m5 > > > > > > > > ABS2? > > > > > > > > > > ABS2 requires the two tmp registers to be different (can't use m5 for > > both). > > > > > > > Aren't m3 and m4 available at that point? > > > > Indeed. We can actually get rid of m5 use totally. > > > > > (ABS2 has the benefit of doing some pairing so is faster than doing two > > ABS1) > > > > > > > > > > > > > > + paddw m0, m1 > > > > > + movq m1, m0 > > > > > + punpcklwd m0, m_zero > > > > > + punpckhwd m1, m_zero > > > > > + paddd m0, m1 > > > > > + paddd m_sum, m0 > > > > > + > > > > > + add indexq, 0x8 > > > > > > > > > + CMP widthq, indexq > > > > > > > > Someone needs to confirm this, but I think you'll need to make width a > > > > ptrdiff_t and not an int > > > > > > > > > > changed to widthd/indexd, that's enough. > > > > > > > Hopefully... > > > > > > > > > > > > > Also... stupid question but what's CMP? > > > > > > > > > > it's equivalent to 'cmp DWORD' here iirc. > > > > > > > I believe you can keep it lowercase. I thought it was a macro but didn't > > see anything like this. > > > > In any case, it doesn't compile here with lowercase here (yasm 1.3.0). > Not sure why. Any idea welcome... > > > > > > > > > > [...] > > > diff --git a/libavfilter/x86/vf_idet.asm b/libavfilter/x86/vf_idet.asm > > > new file mode 100644 > > > index 0000000..19b7f3b > > > --- /dev/null > > > +++ b/libavfilter/x86/vf_idet.asm > > > @@ -0,0 +1,116 @@ > > > +;; > > ***************************************************************************** > > > +;; * x86-optimized functions for idet filter > > > +;; * > > > +;; * This file is part of FFmpeg. > > > +;; * > > > +;; * FFmpeg is free software; you can redistribute it and/or modify > > > +;; * it under the terms of the GNU General Public License as published > > by > > > +;; * the Free Software Foundation; either version 2 of the License, or > > > +;; * (at your option) any later version. > > > +;; * > > > +;; * FFmpeg is distributed in the hope that it will be useful, > > > +;; * but WITHOUT ANY WARRANTY; without even the implied warranty of > > > +;; * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > > > +;; * GNU General Public License for more details. > > > +;; * > > > +;; * You should have received a copy of the GNU General Public License > > along > > > +;; * with FFmpeg; if not, write to the Free Software Foundation, Inc., > > > +;; * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. > > > +;; > > ****************************************************************************** > > > + > > > +%include "libavutil/x86/x86util.asm" > > > + > > > +SECTION_TEXT > > > + > > > +; Implementation that does 8-bytes at a time using single-word > > operations. > > > +%macro IDET_FILTER_LINE 0 > > > > > +cglobal idet_filter_line, 4, 8, 6, a, b, c, width, index > > > > Shouldn't this be 4,5,8? 4 args, 5 regs (4 args + 1 local var), 8 xmm > > (m0..m7) > > > > And BTW, you don't seem to use m2, so you can have 4,5,7 ; you probably > > want to define m_zero on m2 to avoid any confusion. > > > > Actually, m5 m6 and m7 are now unused, so it's 4,5,6 now, eventually. > > > > Now, this macro is only used for MMX, so you can probably do 4,5,0 (no XMM > > reg). If you do 4,5,0, please move the INIT_MMX into the macro: > > > > %macro IDET_FILTER_LINE_MMX 1 > > INIT_MMX %1 > > ... > > %end macro > > > > all done. > > > > > > ... > > > > INIT_MMX mmx > > INIT_MMX mmxext > > > > > > > + xor indexq, indexq > > > +%define m_zero m7 > > > +%define m_sum m6 > > > + pxor m_sum, m_sum > > > + pxor m_zero, m_zero > > > + > > > +.loop: > > > + movu m0, [aq+indexq*1] > > > + punpckhbw m1, m0, m_zero > > > + punpcklbw m0, m_zero > > > + > > > + movu m3, [cq+indexq*1] > > > + punpckhbw m4, m3, m_zero > > > + punpcklbw m3, m_zero > > > + > > > + paddsw m1, m4 > > > + paddsw m0, m3 > > > + > > > + movu m3, [bq+indexq*1] > > > + punpckhbw m4, m3, m_zero > > > + punpcklbw m3, m_zero > > > + > > > + paddw m4, m4 > > > + paddw m3, m3 > > > + psubsw m1, m4 > > > + psubsw m0, m3 > > > + > > > + ABS1 m1, m5 > > > + ABS1 m0, m5 > > > + > > > + paddw m0, m1 > > > + punpckhwd m1, m0, m_zero > > > + punpcklwd m0, m_zero > > > + > > > + paddd m0, m1 > > > + paddd m_sum, m0 > > > + > > > + add indexq, 0x8 > > > + CMP widthd, indexd > > > + jg .loop > > > + > > > + mova m0, m_sum > > > + psrlq m_sum, 0x20 > > > + paddq m0, m_sum > > > + movd eax, m0 > > > + RET > > > +%endmacro > > > + > > > > > +%if ARCH_X86_32 > > > +INIT_MMX mmxext > > > +IDET_FILTER_LINE > > > + > > > +INIT_MMX mmx > > > +IDET_FILTER_LINE > > > +%endif > > > + > > > +;; SSE2 8-bit implementation that does 16-bytes at a time: > > > +INIT_XMM sse2 > > > > > +cglobal idet_filter_line, 4, 8, 6, a, b, c, width, index, total > > > > 4,6,7, AFAICT > > > > done > > > > > > > + xor indexq, indexq > > > + pxor m0, m0 > > > + pxor m1, m1 > > > + > > > +.sse2_loop: > > > + movu m2, [bq+indexq*1] ; B > > > + movu m3, [aq+indexq*1] ; A > > > + mova m6, m2 > > > + mova m4, m3 > > > + psubusb m5, m2, m3 ; ba > > > + > > > + movu m3, [cq+indexq*1] ; C > > > + add indexq, 0x10 > > > + psubusb m4, m2 ; ab > > > + CMP indexd, widthd > > > + > > > + psubusb m6, m3 ; bc > > > + psubusb m3, m2 ; cb > > > + > > > + psadbw m4, m6 ; |ab - bc| > > > + paddq m0, m4 > > > + psadbw m5, m3 ; |ba - cb| > > > + paddq m1, m5 > > > + jl .sse2_loop > > > + > > > + paddq m0, m1 > > > + movhlps m1, m0 > > > + paddq m0, m1 > > > + movd eax, m0 > > > + RET > > [...] > > > > -- > > Clément B. > >
> MAINTAINERS | 1 > libavfilter/vf_idet.c | 38 ++----------- > libavfilter/vf_idet.h | 58 ++++++++++++++++++++ > libavfilter/x86/Makefile | 2 > libavfilter/x86/vf_idet.asm | 114 > +++++++++++++++++++++++++++++++++++++++++ > libavfilter/x86/vf_idet_init.c | 70 +++++++++++++++++++++++++ > 6 files changed, 251 insertions(+), 32 deletions(-) > 38cb38b276a49f553af990d4cf9b2f21e4f30ee4 > 0001-MMX-MMXEXT-SSE2-implementation-of-idet-s-filter_line.patch > From 7eee81afae83a09d39659639b2ee6cadd67a42d6 Mon Sep 17 00:00:00 2001 > From: skal <pascal.massim...@gmail.com> > Date: Wed, 3 Sep 2014 11:02:32 +0200 > Subject: [PATCH] MMX/MMXEXT/SSE2 implementation of idet's filter_line() > > integration by Neil Birkbeck, with help from Vitor Sessak. > core SSE2 loop by Skal (pascal.massim...@gmail.com) patch applied thanks [...] -- Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB The real ebay dictionary, page 3 "Rare item" - "Common item with rare defect or maybe just a lie" "Professional" - "'Toy' made in china, not functional except as doorstop" "Experts will know" - "The seller hopes you are not an expert"
signature.asc
Description: Digital signature
_______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel