On Tue, 16 Jun 2015 14:16:11 +0200 Gwenole Beauchesne <gb.de...@gmail.com> wrote:
> Hi, > > 2015-06-16 14:03 GMT+02:00 Michael Niedermayer <michae...@gmx.at>: > > On Tue, Jun 16, 2015 at 10:35:52AM +0200, Stefano Sabatini wrote: > >> On date Tuesday 2015-06-16 10:20:31 +0200, wm4 encoded: > >> > On Mon, 15 Jun 2015 17:55:35 +0200 > >> > Stefano Sabatini <stefa...@gmail.com> wrote: > >> > > >> > > On date Monday 2015-06-15 11:56:13 +0200, Stefano Sabatini encoded: > >> > > [...] > >> > > > From 3a75ef1e86360cd6f30b8e550307404d0d1c1dba Mon Sep 17 00:00:00 > >> > > > 2001 > >> > > > From: Stefano Sabatini <stefa...@gmail.com> > >> > > > Date: Mon, 15 Jun 2015 11:02:50 +0200 > >> > > > Subject: [PATCH] lavu/mem: add av_memcpynt() function with x86 > >> > > > optimizations > >> > > > > >> > > > Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent > >> > > > Aimar > >> > > > <fen...@videolan.org>. > >> > > > > >> > > > TODO: bump minor, update APIchanges > >> > > > --- > >> > > > libavutil/mem.c | 9 +++++ > >> > > > libavutil/mem.h | 14 ++++++++ > >> > > > libavutil/mem_internal.h | 26 +++++++++++++++ > >> > > > libavutil/x86/Makefile | 1 + > >> > > > libavutil/x86/mem.c | 85 > >> > > > ++++++++++++++++++++++++++++++++++++++++++++++++ > >> > > > 5 files changed, 135 insertions(+) > >> > > > create mode 100644 libavutil/mem_internal.h > >> > > > create mode 100644 libavutil/x86/mem.c > >> > > > > >> > > > diff --git a/libavutil/mem.c b/libavutil/mem.c > >> > > > index da291fb..0e1eb01 100644 > >> > > > --- a/libavutil/mem.c > >> > > > +++ b/libavutil/mem.c > >> > > > @@ -42,6 +42,7 @@ > >> > > > #include "dynarray.h" > >> > > > #include "intreadwrite.h" > >> > > > #include "mem.h" > >> > > > +#include "mem_internal.h" > >> > > > > >> > > > #ifdef MALLOC_PREFIX > >> > > > > >> > > > @@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int > >> > > > *size, size_t min_size) > >> > > > ff_fast_malloc(ptr, size, min_size, 0); > >> > > > } > >> > > > > >> > > > +void av_memcpynt(void *dst, const void *src, size_t size, int > >> > > > cpu_flags) > >> > > > +{ > >> > > > +#if ARCH_X86 > >> > > > + ff_memcpynt_x86(dst, src, size, cpu_flags); > >> > > > +#else > >> > > > + memcpy(dst, src, size, cpu_flags); > >> > > > +#endif > >> > > > +} > >> > > > >> > > Alternatively, what about something like: > >> > > > >> > > av_memcpynt_fn av_memcpynt_get_fn(void); > >> > > > >> > > modeled after av_pixelutils_get_sad_fn()? This would skip the need for > >> > > a wrapper calling the right function. > >> > > >> > >> > I don't see much value in this, unless determining the right function > >> > causes too much overhead. > >> > >> I see two advantages, 1. no branch and function call when the function > >> is called, 2. the cpu_flags must not be passed around, so it's somehow > >> safer. > >> > >> I have no strong preference though, updated (untested patch) in > >> attachment. > >> -- > >> FFmpeg = Fierce and Forgiving Merciless Powered Extroverse Gargoyle > > > >> mem.c | 9 +++++ > >> mem.h | 13 +++++++ > >> mem_internal.h | 26 +++++++++++++++ > >> x86/Makefile | 1 > >> x86/mem.c | 98 > >> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> 5 files changed, 147 insertions(+) > >> f536b25834e0927b8cab5c996042aae697b8d773 > >> 0003-lavu-mem-add-av_memcpynt_get_fn.patch > >> From c005ff5405dd48e6b0fed24ed94947f69bfe2783 Mon Sep 17 00:00:00 2001 > >> From: Stefano Sabatini <stefa...@gmail.com> > >> Date: Mon, 15 Jun 2015 11:02:50 +0200 > >> Subject: [PATCH] lavu/mem: add av_memcpynt_get_fn() > >> > >> Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar > >> <fen...@videolan.org>. > >> > >> TODO: remove use of inline assembly, bump minor, update APIchanges > >> --- > >> libavutil/mem.c | 9 +++++ > >> libavutil/mem.h | 13 +++++++ > >> libavutil/mem_internal.h | 26 +++++++++++++ > >> libavutil/x86/Makefile | 1 + > >> libavutil/x86/mem.c | 98 > >> ++++++++++++++++++++++++++++++++++++++++++++++++ > >> 5 files changed, 147 insertions(+) > >> create mode 100644 libavutil/mem_internal.h > >> create mode 100644 libavutil/x86/mem.c > >> > >> diff --git a/libavutil/mem.c b/libavutil/mem.c > >> index da291fb..325bfc9 100644 > >> --- a/libavutil/mem.c > >> +++ b/libavutil/mem.c > >> @@ -42,6 +42,7 @@ > >> #include "dynarray.h" > >> #include "intreadwrite.h" > >> #include "mem.h" > >> +#include "mem_internal.h" > >> > >> #ifdef MALLOC_PREFIX > >> > >> @@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int *size, > >> size_t min_size) > >> ff_fast_malloc(ptr, size, min_size, 0); > >> } > >> > >> +av_memcpynt_fn av_memcpynt_get_fn(void) > >> +{ > >> +#if ARCH_X86 > >> + return ff_memcpynt_get_fn_x86(); > >> +#else > >> + return memcpy; > >> +#endif > >> +} > >> diff --git a/libavutil/mem.h b/libavutil/mem.h > >> index 2a1e36d..d9f1b7a 100644 > >> --- a/libavutil/mem.h > >> +++ b/libavutil/mem.h > >> @@ -382,6 +382,19 @@ void *av_fast_realloc(void *ptr, unsigned int *size, > >> size_t min_size); > >> */ > >> void av_fast_malloc(void *ptr, unsigned int *size, size_t min_size); > >> > >> +typedef void* (*av_memcpynt_fn)(void *dst, const void *src, size_t size); > >> + > >> +/** > >> + * Return possibly optimized function to copy size bytes from from src > >> + * to dst, using non-temporal copy. > >> + * > >> + * The returned function works as memcpy, but adopts non-temporal > >> + * instructios when available. This can lead to better performances > >> + * when transferring data from source to destination is expensive, for > >> + * example when reading from GPU memory. > >> + */ > >> +av_memcpynt_fn av_memcpynt_get_fn(void); > >> + > >> /** > >> * @} > >> */ > >> diff --git a/libavutil/mem_internal.h b/libavutil/mem_internal.h > >> new file mode 100644 > >> index 0000000..de61cba > >> --- /dev/null > >> +++ b/libavutil/mem_internal.h > >> @@ -0,0 +1,26 @@ > >> +/* > >> + * This file is part of FFmpeg. > >> + * > >> + * FFmpeg is free software; you can redistribute it and/or > >> + * modify it under the terms of the GNU Lesser General Public > >> + * License as published by the Free Software Foundation; either > >> + * version 2.1 of the License, or (at your option) any later version. > >> + * > >> + * FFmpeg is distributed in the hope that it will be useful, > >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of > >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > >> + * Lesser General Public License for more details. > >> + * > >> + * You should have received a copy of the GNU Lesser General Public > >> + * License along with FFmpeg; if not, write to the Free Software > >> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA > >> 02110-1301 USA > >> + */ > >> + > >> +#ifndef AVUTIL_MEM_INTERNAL_H > >> +#define AVUTIL_MEM_INTERNAL_H > >> + > >> +#include "mem.h" > >> + > >> +av_memcpynt_fn ff_memcpynt_get_fn_x86(void); > >> + > >> +#endif /* AVUTIL_MEM_INTERNAL_H */ > >> diff --git a/libavutil/x86/Makefile b/libavutil/x86/Makefile > >> index a719c00..171c351 100644 > >> --- a/libavutil/x86/Makefile > >> +++ b/libavutil/x86/Makefile > >> @@ -2,6 +2,7 @@ OBJS += x86/cpu.o > >> \ > >> x86/float_dsp_init.o \ > >> x86/imgutils.o \ > >> x86/lls_init.o \ > >> + x86/mem.o \ > >> > >> OBJS-$(CONFIG_PIXELUTILS) += x86/pixelutils_init.o \ > >> > >> diff --git a/libavutil/x86/mem.c b/libavutil/x86/mem.c > >> new file mode 100644 > >> index 0000000..6326c90 > >> --- /dev/null > >> +++ b/libavutil/x86/mem.c > >> @@ -0,0 +1,98 @@ > >> +/* > >> + * This file is part of FFmpeg. > >> + * > >> + * FFmpeg is free software; you can redistribute it and/or > >> + * modify it under the terms of the GNU Lesser General Public > >> + * License as published by the Free Software Foundation; either > >> + * version 2.1 of the License, or (at your option) any later version. > >> + * > >> + * FFmpeg is distributed in the hope that it will be useful, > >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of > >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > >> + * Lesser General Public License for more details. > >> + * > >> + * You should have received a copy of the GNU Lesser General Public > >> + * License along with FFmpeg; if not, write to the Free Software > >> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA > >> 02110-1301 USA > >> + */ > >> + > >> +#include <inttypes.h> > >> +#include "config.h" > >> +#include "libavutil/avassert.h" > >> +#include "libavutil/mem_internal.h" > >> + > >> +#if HAVE_SSE2 > >> +/* Copy 16/64 bytes from srcp to dstp loading data with the SSE>=2 > >> instruction > >> + * load and storing data with the SSE>=2 instruction store. > >> + */ > >> +#define COPY16(dstp, srcp, load, store) \ > >> + __asm__ volatile ( \ > >> + load " 0(%[src]), %%xmm1\n" \ > >> + store " %%xmm1, 0(%[dst])\n" \ > >> + : : [dst]"r"(dstp), [src]"r"(srcp) : "memory", "xmm1") > >> + > >> +#define COPY64(dstp, srcp, load, store) \ > >> + __asm__ volatile ( \ > >> + load " 0(%[src]), %%xmm1\n" \ > >> + load " 16(%[src]), %%xmm2\n" \ > >> + load " 32(%[src]), %%xmm3\n" \ > >> + load " 48(%[src]), %%xmm4\n" \ > >> + store " %%xmm1, 0(%[dst])\n" \ > >> + store " %%xmm2, 16(%[dst])\n" \ > >> + store " %%xmm3, 32(%[dst])\n" \ > >> + store " %%xmm4, 48(%[dst])\n" \ > >> + : : [dst]"r"(dstp), [src]"r"(srcp) : "memory", "xmm1", "xmm2", > >> "xmm3", "xmm4") > >> +#endif > >> + > >> +#define COPY_LINE(dstp, srcp, size, load) \ > >> + const unsigned unaligned = (-(uintptr_t)srcp) & 0x0f; \ > >> + unsigned x = unaligned; \ > >> + \ > >> + av_assert0(((intptr_t)dstp & 0x0f) == 0); \ > >> + \ > >> + __asm__ volatile ("mfence"); \ > >> + if (!unaligned) { \ > >> + for (; x+63 < size; x += 64) \ > >> + COPY64(&dstp[x], &srcp[x], load, "movdqa"); \ > >> + } else { \ > >> + COPY16(dst, src, "movdqu", "movdqa"); \ > >> + for (; x+63 < size; x += 64) \ > >> + COPY64(&dstp[x], &srcp[x], load, "movdqu"); \ > > > > to use SSE registers in inline asm operands or clobber list you need > > to build with -msse (which probably is default on on x86-64) > > > > files build with -msse will result in undefined behavior if anything > > in them is executed on a pre SSE cpu, as these allow gcc to put > > SSE instructions directly in the code where it likes > > > > The way out of this "design" is not to tell gcc that it passes > > a string with SSE code to the assembler > > that is not to use SSE registers in operands and not to put them > > on the clobber list unless gcc actually is in SSE mode and can use and > > need them there. > > see XMM_CLOBBERS* > > Well, from past experience, lying to gcc is generally not a good thing > either. There are multiple interesting ways it could fail from time to > time. :) > > Other approaches: > - With GCC >= 4.4, you can use __attribute__((target(T))) where T = > "ssse3", "sse4.1", etc. This is the easiest way ; > - Split into several separate files per target. Though, one would then > argue that while we are at it why not just start moving to yasm. > > The former approach looks more appealing to me, considering there may > be an effort to migrate to yasm afterwards. Or how about: - use yasm, which is used for all other (new) x86 SIMD usage exactly for avoiding such issues _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel