[FFmpeg-devel] Which additional files need to be modified when adding PPC-specific version of libswscale/input.c
I am working on a patch to improve ffmpeg performance on PPC by using vector SIMD facilities for those processor versions that have the capability. There are 50 functions in libswscale/input.c that I am modifying. Adding #ifdefs in all the functions probably isn't the way to go. Is there a preferred method to implement such a change? Thanks. Dan Parrot. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] Is this the expected behavior in libswscale/input.c
Line 72 of libswscale/input.c is: dstU[i] = (ru*r + gu*g + bu*b + (0x10001<<(RGB2YUV_SHIFT-1))) >> RGB2YUV_SHIFT; The definition of macro RGB2YUV_SHIFT in libswscale/swscale_internal.h is on line 417: #define RGB2YUV_SHIFT 15 By examining the result of executing line 72 in input.c it appears that the radix used for macro RGB2YUV_SHIFT is hexadecimal. So that 0x10001 is shifted left by 20 in subexpression 0x10001<<(RGB2YUV_SHIFT-1) with the result being 0x10. I was expecting a left-shift of 14, given the macro definition. Should the macro be interpreted as decimal 15 or hexadecimal 15? Thanks. Dan. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] Is this the expected behavior in libswscale/input.c
On Thu, 2016-06-09 at 17:01 -0400, Ronald S. Bultje wrote: > Hi, > > On Thu, Jun 9, 2016 at 4:02 PM, Dan Parrot wrote: > > > Line 72 of libswscale/input.c is: > > dstU[i] = (ru*r + gu*g + bu*b + (0x10001<<(RGB2YUV_SHIFT-1))) >> > > RGB2YUV_SHIFT; > > > > The definition of macro RGB2YUV_SHIFT in libswscale/swscale_internal.h > > is on line 417: > > #define RGB2YUV_SHIFT 15 > > > > By examining the result of executing line 72 in input.c it appears that > > the radix used for macro RGB2YUV_SHIFT is hexadecimal. So that 0x10001 > > is shifted left by 20 in subexpression 0x10001<<(RGB2YUV_SHIFT-1) with > > the result being 0x10. > > > > I was expecting a left-shift of 14, given the macro definition. > > > > Should the macro be interpreted as decimal 15 or hexadecimal 15? > > > It's decimal 15, it should shift by 14. Check disassembly, though... It's > almost impossible that your compiler would read this as hex and still be > able to produce runnable applications. Maybe submit a bug report to your > compiler's bug tracker? > > Ronald > ___ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel There is no issue with the software. I have been examining the code in gdb and what confused me was having changed the default gdb radix with "set radix 16". Thanks Ronald. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH] PPC64: IBM POWER8 SIMD Implementation
From e38eb7af05be27d8f36058373557d86e5a481db8 Mon Sep 17 00:00:00 2001 From: Dan Parrot Date: Tue, 14 Jun 2016 23:19:21 + Subject: [PATCH] PPC64: IBM POWER8 SIMD Implementation This is the first commit addressing Trac ticket #5570. Functions defined in libswscale/input.c have corresponding definitions in libswscale/ppc/input_vsx.h The corresponding function names in the latter contain the suffix "_vsx". --- libswscale/input.c | 38 ++- libswscale/ppc/input_vsx.h | 831 + 2 files changed, 853 insertions(+), 16 deletions(-) create mode 100644 libswscale/ppc/input_vsx.h diff --git a/libswscale/input.c b/libswscale/input.c index eed0f49..de4347e 100644 --- a/libswscale/input.c +++ b/libswscale/input.c @@ -40,6 +40,13 @@ #define r ((origin == AV_PIX_FMT_BGR48BE || origin == AV_PIX_FMT_BGR48LE || origin == AV_PIX_FMT_BGRA64BE || origin == AV_PIX_FMT_BGRA64LE) ? b_r : r_b) #define b ((origin == AV_PIX_FMT_BGR48BE || origin == AV_PIX_FMT_BGR48LE || origin == AV_PIX_FMT_BGRA64BE || origin == AV_PIX_FMT_BGRA64LE) ? r_b : b_r) +#ifdef HAVE_VSX +#include "ppc/input_vsx.h" +#define RENAME_SIMD(fname) fname ## _vsx +#elif +#define RENAME_SIMD(fname) fname +#endif + static av_always_inline void rgb64ToY_c_template(uint16_t *dst, const uint16_t *src, int width, enum AVPixelFormat origin, int32_t *rgb2yuv) @@ -99,7 +106,7 @@ static void pattern ## 64 ## BE_LE ## ToY_c(uint8_t *_dst, const uint8_t *_src, { \ const uint16_t *src = (const uint16_t *) _src; \ uint16_t *dst = (uint16_t *) _dst; \ -rgb64ToY_c_template(dst, src, width, origin, rgb2yuv); \ +RENAME_SIMD(rgb64ToY_c_template)(dst, src, width, origin, rgb2yuv); \ } \ \ static void pattern ## 64 ## BE_LE ## ToUV_c(uint8_t *_dstU, uint8_t *_dstV, \ @@ -109,7 +116,7 @@ static void pattern ## 64 ## BE_LE ## ToUV_c(uint8_t *_dstU, uint8_t *_dstV, \ const uint16_t *src1 = (const uint16_t *) _src1, \ *src2 = (const uint16_t *) _src2; \ uint16_t *dstU = (uint16_t *) _dstU, *dstV = (uint16_t *) _dstV; \ -rgb64ToUV_c_template(dstU, dstV, src1, src2, width, origin, rgb2yuv); \ +RENAME_SIMD(rgb64ToUV_c_template)(dstU, dstV, src1, src2, width, origin, rgb2yuv); \ } \ \ static void pattern ## 64 ## BE_LE ## ToUV_half_c(uint8_t *_dstU, uint8_t *_dstV, \ @@ -119,7 +126,7 @@ static void pattern ## 64 ## BE_LE ## ToUV_half_c(uint8_t *_dstU, uint8_t *_dstV const uint16_t *src1 = (const uint16_t *) _src1, \ *src2 = (const uint16_t *) _src2; \ uint16_t *dstU = (uint16_t *) _dstU, *dstV = (uint16_t *) _dstV; \ -rgb64ToUV_half_c_template(dstU, dstV, src1, src2, width, origin, rgb2yuv); \ +RENAME_SIMD(rgb64ToUV_half_c_template)(dstU, dstV, src1, src2, width, origin, rgb2yuv); \ } rgb64funcs(rgb, LE, AV_PIX_FMT_RGBA64LE) @@ -203,7 +210,7 @@ static void pattern ## 48 ## BE_LE ## ToY_c(uint8_t *_dst, \ { \ const uint16_t *src = (const uint16_t *)_src; \ uint16_t *dst = (uint16_t *)_dst; \ -rgb48ToY_c_template(dst, src, width, origin, rgb2yuv); \ +RENAME_SIMD(rgb48ToY_c_template)(dst, src, width, origin, rgb2yuv); \ } \ \ static void pattern ## 48 ## BE_LE ## ToUV_c(uint8_t *_dstU,\ @@ -218,7 +225,7 @@ static void pattern ## 48 ## BE_LE ## ToUV_c(uint8_t *_dstU,\ *src2 = (const uint16_t *)_src2; \ uint16_t *dstU = (uint16_t *)_dstU, \ *dstV = (uint16_t *)_dstV; \ -rgb48ToUV_c_template(dstU, dstV, src1, src2, width, origin, rgb2yuv);\ +RENAME_SIMD(rgb48ToUV_c_template)(dstU, dstV, src1, src2, width, origin, rgb2yuv);\ } \ \ static void pattern ## 48 ## BE_LE ## ToUV_half_c(uint8_t *_dstU, \ @@ -233,7 +240,7 @@ static void pattern ## 48 ## BE_LE ## ToUV_half_c(uint8_t *_dstU, \ *src2 = (const uint16_t *)_src2; \ uint16_t *dstU = (uint16_t *)_dstU, \ *dstV = (uint16_t *)_dstV; \ -rgb48ToUV_half_c_template(dstU, dstV, src1, src2, width, origin, rgb2yuv); \ +RENAME_SIMD(rgb48ToUV_half_c_template)(dstU, dstV, src1, src2, width, origin, rgb2yuv); \ } rgb48funcs(rgb, LE, AV_PIX_FMT_RGB48LE) @@ -273,7 +280,6 @@ static av_always_inline void rgb16_32ToY_c_template(int16_t *dst, dst[i] = (ry * r + g
Re: [FFmpeg-devel] [PATCH] PPC64: IBM POWER8 SIMD Implementation
On Tue, 2016-06-14 at 18:56 -0500, Dan Parrot wrote: > ___ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel Please disregard this attempted patch. Made wrong choice of using email client rather than git send-email. Apologies. Dan. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH] PPC64: Add IBM POWER8 SIMD Implementation
int16_t *dstV, const uint8_t *src, @@ -351,17 +357,17 @@ static av_always_inline void rgb16_32ToUV_half_c_template(int16_t *dstU, static void name ## ToY_c(uint8_t *dst, const uint8_t *src, const uint8_t *unused1, const uint8_t *unused2,\ int width, uint32_t *tab) \ { \ -rgb16_32ToY_c_template((int16_t*)dst, src, width, fmt, shr, shg, shb, shp, \ - maskr, maskg, maskb, rsh, gsh, bsh, S, tab); \ +RENAME_SIMD(rgb16_32ToY_c_template)((int16_t*)dst, src, width, fmt, shr, shg, shb, shp,\ +maskr, maskg, maskb, rsh, gsh, bsh, S, tab); \ } \ \ static void name ## ToUV_c(uint8_t *dstU, uint8_t *dstV,\ const uint8_t *unused0, const uint8_t *src, const uint8_t *dummy,\ int width, uint32_t *tab)\ { \ -rgb16_32ToUV_c_template((int16_t*)dstU, (int16_t*)dstV, src, width, fmt, \ -shr, shg, shb, shp, \ -maskr, maskg, maskb, rsh, gsh, bsh, S, tab);\ +RENAME_SIMD(rgb16_32ToUV_c_template)((int16_t*)dstU, (int16_t*)dstV, src, width, fmt,\ + shr, shg, shb, shp, \ + maskr, maskg, maskb, rsh, gsh, bsh, S, tab);\ } \ \ static void name ## ToUV_half_c(uint8_t *dstU, uint8_t *dstV, \ @@ -369,10 +375,10 @@ static void name ## ToUV_half_c(uint8_t *dstU, uint8_t *dstV, \ const uint8_t *dummy, \ int width, uint32_t *tab) \ { \ -rgb16_32ToUV_half_c_template((int16_t*)dstU, (int16_t*)dstV, src, width, fmt, \ - shr, shg, shb, shp,\ - maskr, maskg, maskb, \ - rsh, gsh, bsh, S, tab);\ +RENAME_SIMD(rgb16_32ToUV_half_c_template)((int16_t*)dstU, (int16_t*)dstV, src, width, fmt, \ + shr, shg, shb, shp, \ + maskr, maskg, maskb, \ + rsh, gsh, bsh, S, tab); \ } rgb16_32_wrapper(AV_PIX_FMT_BGR32,bgr32, 16, 0, 0, 0, 0xFF, 0xFF00, 0x00FF, 8, 0, 8, RGB2YUV_SHIFT + 8) @@ -978,7 +984,6 @@ av_cold void ff_sws_init_input_funcs(SwsContext *c) case AV_PIX_FMT_GBRP9LE: c->readChrPlanar = planar_rgb9le_to_uv; break; -case AV_PIX_FMT_GBRAP10LE: case AV_PIX_FMT_GBRP10LE: c->readChrPlanar = planar_rgb10le_to_uv; break; @@ -996,7 +1001,6 @@ av_cold void ff_sws_init_input_funcs(SwsContext *c) case AV_PIX_FMT_GBRP9BE: c->readChrPlanar = planar_rgb9be_to_uv; break; -case AV_PIX_FMT_GBRAP10BE: case AV_PIX_FMT_GBRP10BE: c->readChrPlanar = planar_rgb10be_to_uv; break; @@ -1260,8 +1264,6 @@ av_cold void ff_sws_init_input_funcs(SwsContext *c) case AV_PIX_FMT_GBRP9LE: c->readLumPlanar = planar_rgb9le_to_y; break; -case AV_PIX_FMT_GBRAP10LE: -c->readAlpPlanar = planar_rgb10le_to_a; case AV_PIX_FMT_GBRP10LE: c->readLumPlanar = planar_rgb10le_to_y; break; @@ -1281,8 +1283,6 @@ av_cold void ff_sws_init_input_funcs(SwsContext *c) case AV_PIX_FMT_GBRP9BE: c->readLumPlanar = planar_rgb9be_to_y; break; -case AV_PIX_FMT_GBRAP10BE: -c->readAlpPlanar = planar_rgb10be_to_a; case AV_PIX_FMT_GBRP10BE: c->readLumPlanar = planar_rgb10be_to_y; break; diff --git a/libswscale/ppc/input_vsx.h b/libswscale/ppc/input_vsx.h new file mode 100644 index 000..09fe8c1 --- /dev/null +++ b/libswscale/ppc/input_vsx.h @@ -0,0 +1,831 @@ +/* + * Copyright (C) 2016 Dan Parrot + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option)
Re: [FFmpeg-devel] [PATCH] PPC64: Add IBM POWER8 SIMD Implementation
On Wed, 2016-06-15 at 10:15 +0200, Michael Niedermayer wrote: > On Wed, Jun 15, 2016 at 04:25:11AM +0000, Dan Parrot wrote: > > This is the first commit addressing Trac ticket #5570. Functions defined in > > libswscale/input.c have corresponding definitions in > > libswscale/ppc/input_vsx.h > > The corresponding function names in the latter contain the suffix "_vsx". > > --- > > libswscale/input.c | 44 +-- > > libswscale/ppc/input_vsx.h | 831 > > + > > 2 files changed, 853 insertions(+), 22 deletions(-) > > create mode 100644 libswscale/ppc/input_vsx.h > > breaks build on x86 > ./configure && make -j12 > In file included from libswscale/input.c:44:0: > libswscale/ppc/input_vsx.h: In function ‘rgb64ToY_c_template_vsx’: > libswscale/ppc/input_vsx.h:34:5: error: ‘vector’ undeclared (first use in > this function) > libswscale/ppc/input_vsx.h:34:5: note: each undeclared identifier is reported > only once for each function it appears in > libswscale/ppc/input_vsx.h:34:12: error: expected ‘;’ before ‘int’ > libswscale/ppc/input_vsx.h:35:12: error: expected ‘;’ before ‘int’ > libswscale/ppc/input_vsx.h:36:12: error: expected ‘;’ before ‘int’ > > [...] > > @@ -978,7 +984,6 @@ av_cold void ff_sws_init_input_funcs(SwsContext *c) > > case AV_PIX_FMT_GBRP9LE: > > c->readChrPlanar = planar_rgb9le_to_uv; > > break; > > -case AV_PIX_FMT_GBRAP10LE: > > case AV_PIX_FMT_GBRP10LE: > > c->readChrPlanar = planar_rgb10le_to_uv; > > break; > > @@ -996,7 +1001,6 @@ av_cold void ff_sws_init_input_funcs(SwsContext *c) > > case AV_PIX_FMT_GBRP9BE: > > c->readChrPlanar = planar_rgb9be_to_uv; > > break; > > -case AV_PIX_FMT_GBRAP10BE: > > case AV_PIX_FMT_GBRP10BE: > > c->readChrPlanar = planar_rgb10be_to_uv; > > break; > > @@ -1260,8 +1264,6 @@ av_cold void ff_sws_init_input_funcs(SwsContext *c) > > case AV_PIX_FMT_GBRP9LE: > > c->readLumPlanar = planar_rgb9le_to_y; > > break; > > -case AV_PIX_FMT_GBRAP10LE: > > -c->readAlpPlanar = planar_rgb10le_to_a; > > case AV_PIX_FMT_GBRP10LE: > > c->readLumPlanar = planar_rgb10le_to_y; > > break; > > @@ -1281,8 +1283,6 @@ av_cold void ff_sws_init_input_funcs(SwsContext *c) > > case AV_PIX_FMT_GBRP9BE: > > c->readLumPlanar = planar_rgb9be_to_y; > > break; > > -case AV_PIX_FMT_GBRAP10BE: > > -c->readAlpPlanar = planar_rgb10be_to_a; > > case AV_PIX_FMT_GBRP10BE: > > c->readLumPlanar = planar_rgb10be_to_y; > > break; > > why do you remove these ? > thats not ppc related > > [...] > ___ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel Did not intend to touch anything non-PPC. I didn't review git diff carefully enough. I'll resend patch with modifications only for PPC. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] PPC64: Add IBM POWER8 SIMD Implementation
On Wed, 2016-06-15 at 11:19 +, Carl Eugen Hoyos wrote: > Dan Parrot mail.com> writes: > > [...] > > I know this is isn't completely related but do you have time > to look at ticket #5508? > https://trac.ffmpeg.org/ticket/5508 > No active developer has hardware and knowledge to look into > this issue;-( > > Carl Eugen > > ___ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel I hope to have some time this weekend. We'll see how much progress I can make on the ticket then. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] PPC64: Add IBM POWER8 SIMD Implementation
On Wed, 2016-06-15 at 16:51 +0200, Hendrik Leppkes wrote: > On Wed, Jun 15, 2016 at 6:25 AM, Dan Parrot wrote: > > This is the first commit addressing Trac ticket #5570. Functions defined in > > libswscale/input.c have corresponding definitions in > > libswscale/ppc/input_vsx.h > > The corresponding function names in the latter contain the suffix "_vsx". > > --- > > libswscale/input.c | 44 +-- > > libswscale/ppc/input_vsx.h | 831 > > + > > 2 files changed, 853 insertions(+), 22 deletions(-) > > create mode 100644 libswscale/ppc/input_vsx.h > > > > diff --git a/libswscale/input.c b/libswscale/input.c > > index 14ab5ab..de4347e 100644 > > --- a/libswscale/input.c > > +++ b/libswscale/input.c > > @@ -40,6 +40,13 @@ > > #define r ((origin == AV_PIX_FMT_BGR48BE || origin == AV_PIX_FMT_BGR48LE > > || origin == AV_PIX_FMT_BGRA64BE || origin == AV_PIX_FMT_BGRA64LE) ? b_r : > > r_b) > > #define b ((origin == AV_PIX_FMT_BGR48BE || origin == AV_PIX_FMT_BGR48LE > > || origin == AV_PIX_FMT_BGRA64BE || origin == AV_PIX_FMT_BGRA64LE) ? r_b : > > b_r) > > > > +#ifdef HAVE_VSX > > +#include "ppc/input_vsx.h" > > +#define RENAME_SIMD(fname) fname ## _vsx > > +#elif > > +#define RENAME_SIMD(fname) fname > > +#endif > > + > > static av_always_inline void > > rgb64ToY_c_template(uint16_t *dst, const uint16_t *src, int width, > > enum AVPixelFormat origin, int32_t *rgb2yuv) > > @@ -99,7 +106,7 @@ static void pattern ## 64 ## BE_LE ## ToY_c(uint8_t > > *_dst, const uint8_t *_src, > > { \ > > const uint16_t *src = (const uint16_t *) _src; \ > > uint16_t *dst = (uint16_t *) _dst; \ > > -rgb64ToY_c_template(dst, src, width, origin, rgb2yuv); \ > > +RENAME_SIMD(rgb64ToY_c_template)(dst, src, width, origin, rgb2yuv); \ > > } \ > > This is not how we integrate SIMD optimizations. These are the C > functions, they are not meant to perform the SIMD. > What you should do is provide SIMD functions and then provide a > SIMD-specific init function that overwrites the function pointers with > your SIMD functions. > ie. just how it is done on x86. But do not touch the C functions by > overriding them right in the code with SIMD variants, making the C > variants inaccessible. 1. The #ifdef HAVE_VSX at the top of the file should actually be #if HAVE_VSX. The intent is for C variants of functions to be replaced by SIMD versions. But only for PowerPC machines that have Vector-Scalar hardware. All other machines will retain exactly the same function names they currently possess. Would that change eliminate this particular objection? 2. To achieve the same effect using the style done for x86 requires more code. Just considering the first function in input.c, one must provide: i. A SIMD implementation of rgb64ToY_c_template ii. Definitions for results of the 4 macro expansions of rgb64funcs. These are the functions that actually instantiate rgb64ToY_c_template iii. Replication of all the code in ff_sws_init_input_funcs that assigns the results of item ii. above to function pointers. Why is the x86 approach preferred over the seemingly simpler preprocessor renaming? > > > \ > > static void pattern ## 64 ## BE_LE ## ToUV_c(uint8_t *_dstU, uint8_t > > *_dstV, \ > > @@ -109,7 +116,7 @@ static void pattern ## 64 ## BE_LE ## ToUV_c(uint8_t > > *_dstU, uint8_t *_dstV, \ > > const uint16_t *src1 = (const uint16_t *) _src1, \ > > *src2 = (const uint16_t *) _src2; \ > > uint16_t *dstU = (uint16_t *) _dstU, *dstV = (uint16_t *) _dstV; \ > > -rgb64ToUV_c_template(dstU, dstV, src1, src2, width, origin, rgb2yuv); \ > > +RENAME_SIMD(rgb64ToUV_c_template)(dstU, dstV, src1, src2, width, > > origin, rgb2yuv); \ > > } \ > > \ > > static void pattern ## 64 ## BE_LE ## ToUV_half_c(uint8_t *_dstU, uint8_t > > *_dstV, \ > > @@ -119,7 +126,7 @@ static void pattern ## 64 ## BE_LE ## > > ToUV_half_c(uint8_t *_dstU, uint8_t *_dstV > > const uint16_t *src1 = (const uint16_t *) _src1, \ > > *src2 = (const uint16_t *) _src2; \ > > uint16_t *dstU = (uint16_t *) _dstU, *dstV = (uint16_t *) _dstV; \ > > -rgb64ToUV_half_c_template(dstU, dstV, src1, src2, width, origin, > > rgb2yuv); \ > > +RENAME_SIMD(rgb64ToUV_half_c_template)(dstU, dstV, src1, src2, width, > > origin, rgb2yuv); \ > > } > > > > rgb64funcs(rgb, LE, AV_PIX_FMT_RGBA64LE) > > @@ -203,7 +210,7 @@ st
[FFmpeg-devel] [PATCH] PPC64: Add IBM POWER8 SIMD Implementation
First commit addressing Trac ticket #5570. Functions defined in libswscale/input.c have corresponding SIMD definitions in libswscale/ppc/input_vsx.c --- libswscale/ppc/Makefile |1 + libswscale/ppc/input_vsx.c| 1070 + libswscale/swscale.c |3 + libswscale/swscale_internal.h |1 + 4 files changed, 1075 insertions(+) create mode 100644 libswscale/ppc/input_vsx.c diff --git a/libswscale/ppc/Makefile b/libswscale/ppc/Makefile index d1b596e..2482893 100644 --- a/libswscale/ppc/Makefile +++ b/libswscale/ppc/Makefile @@ -1,3 +1,4 @@ OBJS += ppc/swscale_altivec.o \ +ppc/input_vsx.o \ ppc/yuv2rgb_altivec.o \ ppc/yuv2yuv_altivec.o \ diff --git a/libswscale/ppc/input_vsx.c b/libswscale/ppc/input_vsx.c new file mode 100644 index 000..adb0e38 --- /dev/null +++ b/libswscale/ppc/input_vsx.c @@ -0,0 +1,1070 @@ +/* + * Copyright (C) 2016 Dan Parrot + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include +#include +#include +#include + +#include "libavutil/avutil.h" +#include "libavutil/bswap.h" +#include "libavutil/cpu.h" +#include "libavutil/intreadwrite.h" +#include "libavutil/mathematics.h" +#include "libavutil/pixdesc.h" +#include "libavutil/avassert.h" +#include "config.h" +#include "libswscale/rgb2rgb.h" +#include "libswscale/swscale.h" +#include "libswscale/swscale_internal.h" + +#define input_pixel(pos) (isBE(origin) ? AV_RB16(pos) : AV_RL16(pos)) + +#define r ((origin == AV_PIX_FMT_BGR48BE || origin == AV_PIX_FMT_BGR48LE || origin == AV_PIX_FMT_BGRA64BE || origin == AV_PIX_FMT_BGRA64LE) ? b_r : r_b) +#define b ((origin == AV_PIX_FMT_BGR48BE || origin == AV_PIX_FMT_BGR48LE || origin == AV_PIX_FMT_BGRA64BE || origin == AV_PIX_FMT_BGRA64LE) ? r_b : b_r) + +#if HAVE_VSX + +// This is a SIMD version for IBM POWER8 of function rgb64ToY_c_template +// in file libswscale/input.c +static av_always_inline void +rgb64ToY_c_template_vsx(uint16_t *dst, const uint16_t *src, int width, +enum AVPixelFormat origin, int32_t *rgb2yuv) +{ +int32_t ry = rgb2yuv[RY_IDX], gy = rgb2yuv[GY_IDX], by = rgb2yuv[BY_IDX]; +int i, j; +int num_vec, frag; + +num_vec = width / 8; +frag= width % 8; + +vector int v_ry = vec_splats((int)ry); +vector int v_gy = vec_splats((int)gy); +vector int v_by = vec_splats((int)by); + +int s_opr2; +s_opr2 = (int)(0x2001 << (RGB2YUV_SHIFT-1)); + +vector int v_opr1 = vec_splats((int)RGB2YUV_SHIFT); +vector int v_opr2 = vec_splats((int)s_opr2); + +vector int v_r, v_g, v_b, v_tmp; +vector short v_tmpi, v_dst; + +for (i = 0; i < num_vec; i++) { +for (j = 7; j >= 0 ; j--) { +int r_b = input_pixel(&src[(i*8+j)*4+0]); +int g = input_pixel(&src[(i*8+j)*4+1]); +int b_r = input_pixel(&src[(i*8+j)*4+2]); + +v_r[j % 4] = r; +v_g[j % 4] = g; +v_b[j % 4] = b; + +if (!(j % 4)) { +v_tmp = v_ry * v_r; +v_tmp = v_tmp + v_gy * v_g; +v_tmp = v_tmp + v_by * v_b; +v_tmp = v_tmp + v_opr2; +v_tmp = vec_sr(v_tmp, (vector unsigned int)v_opr1); + +v_tmpi = (vector short)v_tmp; +v_dst[(j / 4) * 4 + 3] = v_tmpi[6]; +v_dst[(j / 4) * 4 + 2] = v_tmpi[4]; +v_dst[(j / 4) * 4 + 1] = v_tmpi[2]; +v_dst[(j / 4) * 4 + 0] = v_tmpi[0]; +} +} +vec_vsx_st(v_dst, 0, (short *)&dst[i*8]); +} + +// computation for any less than vector-length items at tail end +if( frag ) { +for (i = 0; i < frag; i++) { +unsigned int r_b = input_pixel(&src[(num_vec*8+i)*4+0]); +unsigned int g = input_pixel(&src[(num_vec*8+i)*4+1]); +unsigned int b_r = input_pixel(&src[(num_vec*8+i)*4+2]); + +
Re: [FFmpeg-devel] [PATCH] PPC64: Add IBM POWER8 SIMD Implementation
Could a PPC maintainer verify that this patch integrates cleanly into the design? I would like to proceed with the remaining changes to close out ticket #5570 but since the first patch was rejected, I am unsure on whether I'll have to rewrite the code. Thanks. Dan. On Sun, 2016-06-19 at 21:57 +0000, Dan Parrot wrote: > First commit addressing Trac ticket #5570. Functions defined in > libswscale/input.c > have corresponding SIMD definitions in libswscale/ppc/input_vsx.c > --- > libswscale/ppc/Makefile |1 + > libswscale/ppc/input_vsx.c| 1070 > + > libswscale/swscale.c |3 + > libswscale/swscale_internal.h |1 + > 4 files changed, 1075 insertions(+) > create mode 100644 libswscale/ppc/input_vsx.c > > diff --git a/libswscale/ppc/Makefile b/libswscale/ppc/Makefile > index d1b596e..2482893 100644 > --- a/libswscale/ppc/Makefile > +++ b/libswscale/ppc/Makefile > @@ -1,3 +1,4 @@ > OBJS += ppc/swscale_altivec.o \ > +ppc/input_vsx.o \ > ppc/yuv2rgb_altivec.o \ > ppc/yuv2yuv_altivec.o \ > diff --git a/libswscale/ppc/input_vsx.c b/libswscale/ppc/input_vsx.c > new file mode 100644 > index 000..adb0e38 > --- /dev/null > +++ b/libswscale/ppc/input_vsx.c > @@ -0,0 +1,1070 @@ > +/* > + * Copyright (C) 2016 Dan Parrot > + * > + * This file is part of FFmpeg. > + * > + * FFmpeg is free software; you can redistribute it and/or > + * modify it under the terms of the GNU Lesser General Public > + * License as published by the Free Software Foundation; either > + * version 2.1 of the License, or (at your option) any later version. > + * > + * FFmpeg is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * Lesser General Public License for more details. > + * > + * You should have received a copy of the GNU Lesser General Public > + * License along with FFmpeg; if not, write to the Free Software > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 > USA > + */ > + > +#include > +#include > +#include > +#include > + > +#include "libavutil/avutil.h" > +#include "libavutil/bswap.h" > +#include "libavutil/cpu.h" > +#include "libavutil/intreadwrite.h" > +#include "libavutil/mathematics.h" > +#include "libavutil/pixdesc.h" > +#include "libavutil/avassert.h" > +#include "config.h" > +#include "libswscale/rgb2rgb.h" > +#include "libswscale/swscale.h" > +#include "libswscale/swscale_internal.h" > + > +#define input_pixel(pos) (isBE(origin) ? AV_RB16(pos) : AV_RL16(pos)) > + > +#define r ((origin == AV_PIX_FMT_BGR48BE || origin == AV_PIX_FMT_BGR48LE || > origin == AV_PIX_FMT_BGRA64BE || origin == AV_PIX_FMT_BGRA64LE) ? b_r : r_b) > +#define b ((origin == AV_PIX_FMT_BGR48BE || origin == AV_PIX_FMT_BGR48LE || > origin == AV_PIX_FMT_BGRA64BE || origin == AV_PIX_FMT_BGRA64LE) ? r_b : b_r) > + > +#if HAVE_VSX > + > +// This is a SIMD version for IBM POWER8 of function rgb64ToY_c_template > +// in file libswscale/input.c > +static av_always_inline void > +rgb64ToY_c_template_vsx(uint16_t *dst, const uint16_t *src, int width, > +enum AVPixelFormat origin, int32_t *rgb2yuv) > +{ > +int32_t ry = rgb2yuv[RY_IDX], gy = rgb2yuv[GY_IDX], by = rgb2yuv[BY_IDX]; > +int i, j; > +int num_vec, frag; > + > +num_vec = width / 8; > +frag= width % 8; > + > +vector int v_ry = vec_splats((int)ry); > +vector int v_gy = vec_splats((int)gy); > +vector int v_by = vec_splats((int)by); > + > +int s_opr2; > +s_opr2 = (int)(0x2001 << (RGB2YUV_SHIFT-1)); > + > +vector int v_opr1 = vec_splats((int)RGB2YUV_SHIFT); > +vector int v_opr2 = vec_splats((int)s_opr2); > + > +vector int v_r, v_g, v_b, v_tmp; > +vector short v_tmpi, v_dst; > + > +for (i = 0; i < num_vec; i++) { > +for (j = 7; j >= 0 ; j--) { > +int r_b = input_pixel(&src[(i*8+j)*4+0]); > +int g = input_pixel(&src[(i*8+j)*4+1]); > +int b_r = input_pixel(&src[(i*8+j)*4+2]); > + > +v_r[j % 4] = r; > +v_g[j % 4] = g; > +v_b[j % 4] = b; > + > +if (!(j % 4)) { > +v_tmp = v_ry * v_r; > +
Re: [FFmpeg-devel] [PATCH] PPC64: Add IBM POWER8 SIMD Implementation
On Tue, 2016-06-21 at 00:45 +0200, Michael Niedermayer wrote: > On Sun, Jun 19, 2016 at 09:57:42PM +0000, Dan Parrot wrote: > > First commit addressing Trac ticket #5570. Functions defined in > > libswscale/input.c > > have corresponding SIMD definitions in libswscale/ppc/input_vsx.c > > --- > > libswscale/ppc/Makefile |1 + > > libswscale/ppc/input_vsx.c| 1070 > > + > > libswscale/swscale.c |3 + > > libswscale/swscale_internal.h |1 + > > 4 files changed, 1075 insertions(+) > > create mode 100644 libswscale/ppc/input_vsx.c > > > > diff --git a/libswscale/ppc/Makefile b/libswscale/ppc/Makefile > > index d1b596e..2482893 100644 > > --- a/libswscale/ppc/Makefile > > +++ b/libswscale/ppc/Makefile > > @@ -1,3 +1,4 @@ > > OBJS += ppc/swscale_altivec.o \ > > +ppc/input_vsx.o \ > > ppc/yuv2rgb_altivec.o \ > > ppc/yuv2yuv_altivec.o \ > > diff --git a/libswscale/ppc/input_vsx.c b/libswscale/ppc/input_vsx.c > > new file mode 100644 > > index 000..adb0e38 > > --- /dev/null > > +++ b/libswscale/ppc/input_vsx.c > > @@ -0,0 +1,1070 @@ > > +/* > > + * Copyright (C) 2016 Dan Parrot > > + * > > + * This file is part of FFmpeg. > > + * > > + * FFmpeg is free software; you can redistribute it and/or > > + * modify it under the terms of the GNU Lesser General Public > > + * License as published by the Free Software Foundation; either > > + * version 2.1 of the License, or (at your option) any later version. > > + * > > + * FFmpeg is distributed in the hope that it will be useful, > > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + * Lesser General Public License for more details. > > + * > > + * You should have received a copy of the GNU Lesser General Public > > + * License along with FFmpeg; if not, write to the Free Software > > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA > > 02110-1301 USA > > + */ > > + > > +#include > > +#include > > +#include > > +#include > > + > > +#include "libavutil/avutil.h" > > +#include "libavutil/bswap.h" > > +#include "libavutil/cpu.h" > > +#include "libavutil/intreadwrite.h" > > +#include "libavutil/mathematics.h" > > +#include "libavutil/pixdesc.h" > > +#include "libavutil/avassert.h" > > +#include "config.h" > > +#include "libswscale/rgb2rgb.h" > > +#include "libswscale/swscale.h" > > +#include "libswscale/swscale_internal.h" > > + > > +#define input_pixel(pos) (isBE(origin) ? AV_RB16(pos) : AV_RL16(pos)) > > + > > +#define r ((origin == AV_PIX_FMT_BGR48BE || origin == AV_PIX_FMT_BGR48LE > > || origin == AV_PIX_FMT_BGRA64BE || origin == AV_PIX_FMT_BGRA64LE) ? b_r : > > r_b) > > +#define b ((origin == AV_PIX_FMT_BGR48BE || origin == AV_PIX_FMT_BGR48LE > > || origin == AV_PIX_FMT_BGRA64BE || origin == AV_PIX_FMT_BGRA64LE) ? r_b : > > b_r) > > + > > +#if HAVE_VSX > > + > > +// This is a SIMD version for IBM POWER8 of function rgb64ToY_c_template > > +// in file libswscale/input.c > > +static av_always_inline void > > +rgb64ToY_c_template_vsx(uint16_t *dst, const uint16_t *src, int width, > > +enum AVPixelFormat origin, int32_t *rgb2yuv) > > +{ > > +int32_t ry = rgb2yuv[RY_IDX], gy = rgb2yuv[GY_IDX], by = > > rgb2yuv[BY_IDX]; > > +int i, j; > > +int num_vec, frag; > > + > > +num_vec = width / 8; > > +frag= width % 8; > > + > > +vector int v_ry = vec_splats((int)ry); > > +vector int v_gy = vec_splats((int)gy); > > +vector int v_by = vec_splats((int)by); > > + > > +int s_opr2; > > +s_opr2 = (int)(0x2001 << (RGB2YUV_SHIFT-1)); > > + > > +vector int v_opr1 = vec_splats((int)RGB2YUV_SHIFT); > > +vector int v_opr2 = vec_splats((int)s_opr2); > > + > > +vector int v_r, v_g, v_b, v_tmp; > > +vector short v_tmpi, v_dst; > > + > > +for (i = 0; i < num_vec; i++) { > > +for (j = 7; j >= 0 ; j--) { > > +int r_b = input_pixel(&src[(i*8+j)*4+0]
Re: [FFmpeg-devel] [PATCH] PPC64: Add IBM POWER8 SIMD Implementation
On Tue, 2016-06-21 at 01:06 +0200, Michael Niedermayer wrote: > On Mon, Jun 20, 2016 at 05:55:47PM -0500, Dan Parrot wrote: > > On Tue, 2016-06-21 at 00:45 +0200, Michael Niedermayer wrote: > > > On Sun, Jun 19, 2016 at 09:57:42PM +, Dan Parrot wrote: > > > > First commit addressing Trac ticket #5570. Functions defined in > > > > libswscale/input.c > > > > have corresponding SIMD definitions in libswscale/ppc/input_vsx.c > > > > --- > > > > libswscale/ppc/Makefile |1 + > > > > libswscale/ppc/input_vsx.c| 1070 > > > > + > > > > libswscale/swscale.c |3 + > > > > libswscale/swscale_internal.h |1 + > > > > 4 files changed, 1075 insertions(+) > > > > create mode 100644 libswscale/ppc/input_vsx.c > > > > > > > > diff --git a/libswscale/ppc/Makefile b/libswscale/ppc/Makefile > > > > index d1b596e..2482893 100644 > > > > --- a/libswscale/ppc/Makefile > > > > +++ b/libswscale/ppc/Makefile > > > > @@ -1,3 +1,4 @@ > > > > OBJS += ppc/swscale_altivec.o > > > > \ > > > > +ppc/input_vsx.o > > > > \ > > > > ppc/yuv2rgb_altivec.o > > > > \ > > > > ppc/yuv2yuv_altivec.o > > > > \ > > > > diff --git a/libswscale/ppc/input_vsx.c b/libswscale/ppc/input_vsx.c > > > > new file mode 100644 > > > > index 000..adb0e38 > > > > --- /dev/null > > > > +++ b/libswscale/ppc/input_vsx.c > > > > @@ -0,0 +1,1070 @@ > > > > +/* > > > > + * Copyright (C) 2016 Dan Parrot > > > > + * > > > > + * This file is part of FFmpeg. > > > > + * > > > > + * FFmpeg is free software; you can redistribute it and/or > > > > + * modify it under the terms of the GNU Lesser General Public > > > > + * License as published by the Free Software Foundation; either > > > > + * version 2.1 of the License, or (at your option) any later version. > > > > + * > > > > + * FFmpeg is distributed in the hope that it will be useful, > > > > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > > > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > > > + * Lesser General Public License for more details. > > > > + * > > > > + * You should have received a copy of the GNU Lesser General Public > > > > + * License along with FFmpeg; if not, write to the Free Software > > > > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA > > > > 02110-1301 USA > > > > + */ > > > > + > > > > +#include > > > > +#include > > > > +#include > > > > +#include > > > > + > > > > +#include "libavutil/avutil.h" > > > > +#include "libavutil/bswap.h" > > > > +#include "libavutil/cpu.h" > > > > +#include "libavutil/intreadwrite.h" > > > > +#include "libavutil/mathematics.h" > > > > +#include "libavutil/pixdesc.h" > > > > +#include "libavutil/avassert.h" > > > > +#include "config.h" > > > > +#include "libswscale/rgb2rgb.h" > > > > +#include "libswscale/swscale.h" > > > > +#include "libswscale/swscale_internal.h" > > > > + > > > > +#define input_pixel(pos) (isBE(origin) ? AV_RB16(pos) : AV_RL16(pos)) > > > > + > > > > +#define r ((origin == AV_PIX_FMT_BGR48BE || origin == > > > > AV_PIX_FMT_BGR48LE || origin == AV_PIX_FMT_BGRA64BE || origin == > > > > AV_PIX_FMT_BGRA64LE) ? b_r : r_b) > > > > +#define b ((origin == AV_PIX_FMT_BGR48BE || origin == > > > > AV_PIX_FMT_BGR48LE || origin == AV_PIX_FMT_BGRA64BE || origin == > > > > AV_PIX_FMT_BGRA64LE) ? r_b : b_r) > > > > + > > > > +#if HAVE_VSX > > > > + > > > > +// This is a SIMD version for IBM POWER8 of function > > > > rgb64ToY_c_template > > > > +// in file libswscale/input.c > > > > +static av_always_inline void > > > > +rgb64ToY_c_template_vsx(uint16_t *dst, const uint16_t *s
Re: [FFmpeg-devel] [PATCH] PPC64: Add IBM POWER8 SIMD Implementation
On Tue, 2016-06-21 at 02:22 +0200, Michael Niedermayer wrote: > On Mon, Jun 20, 2016 at 06:38:18PM -0500, Dan Parrot wrote: > > On Tue, 2016-06-21 at 01:06 +0200, Michael Niedermayer wrote: > > > On Mon, Jun 20, 2016 at 05:55:47PM -0500, Dan Parrot wrote: > > > > On Tue, 2016-06-21 at 00:45 +0200, Michael Niedermayer wrote: > > > > > On Sun, Jun 19, 2016 at 09:57:42PM +, Dan Parrot wrote: > > > > > > First commit addressing Trac ticket #5570. Functions defined in > > > > > > libswscale/input.c > > > > > > have corresponding SIMD definitions in libswscale/ppc/input_vsx.c > > > > > > --- > > > > > > libswscale/ppc/Makefile |1 + > > > > > > libswscale/ppc/input_vsx.c| 1070 > > > > > > + > > > > > > libswscale/swscale.c |3 + > > > > > > libswscale/swscale_internal.h |1 + > > > > > > 4 files changed, 1075 insertions(+) > > > > > > create mode 100644 libswscale/ppc/input_vsx.c > > > > > > > > > > > > diff --git a/libswscale/ppc/Makefile b/libswscale/ppc/Makefile > > > > > > index d1b596e..2482893 100644 > > > > > > --- a/libswscale/ppc/Makefile > > > > > > +++ b/libswscale/ppc/Makefile > > > > > > @@ -1,3 +1,4 @@ > > > > > > OBJS += ppc/swscale_altivec.o > > > > > > \ > > > > > > +ppc/input_vsx.o > > > > > > \ > > > > > > ppc/yuv2rgb_altivec.o > > > > > > \ > > > > > > ppc/yuv2yuv_altivec.o > > > > > > \ > > > > > > diff --git a/libswscale/ppc/input_vsx.c b/libswscale/ppc/input_vsx.c > > > > > > new file mode 100644 > > > > > > index 000..adb0e38 > > > > > > --- /dev/null > > > > > > +++ b/libswscale/ppc/input_vsx.c > > > > > > @@ -0,0 +1,1070 @@ > > > > > > +/* > > > > > > + * Copyright (C) 2016 Dan Parrot > > > > > > + * > > > > > > + * This file is part of FFmpeg. > > > > > > + * > > > > > > + * FFmpeg is free software; you can redistribute it and/or > > > > > > + * modify it under the terms of the GNU Lesser General Public > > > > > > + * License as published by the Free Software Foundation; either > > > > > > + * version 2.1 of the License, or (at your option) any later > > > > > > version. > > > > > > + * > > > > > > + * FFmpeg is distributed in the hope that it will be useful, > > > > > > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > > > > > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > > > > > > GNU > > > > > > + * Lesser General Public License for more details. > > > > > > + * > > > > > > + * You should have received a copy of the GNU Lesser General Public > > > > > > + * License along with FFmpeg; if not, write to the Free Software > > > > > > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA > > > > > > 02110-1301 USA > > > > > > + */ > > > > > > + > > > > > > +#include > > > > > > +#include > > > > > > +#include > > > > > > +#include > > > > > > + > > > > > > +#include "libavutil/avutil.h" > > > > > > +#include "libavutil/bswap.h" > > > > > > +#include "libavutil/cpu.h" > > > > > > +#include "libavutil/intreadwrite.h" > > > > > > +#include "libavutil/mathematics.h" > > > > > > +#include "libavutil/pixdesc.h" > > > > > > +#include "libavutil/avassert.h" > > > > > > +#include "config.h" > > > > > > +#include "libswscale/rgb2rgb.h" > > > > > > +#include "libswscale/swscale.h" > > > > > > +#include "libswscale/swscale_internal.h" > > > >
Re: [FFmpeg-devel] [PATCH] PPC64: Add IBM POWER8 SIMD Implementation
On Tue, 2016-06-21 at 02:22 +0200, Michael Niedermayer wrote: > On Mon, Jun 20, 2016 at 06:38:18PM -0500, Dan Parrot wrote: > > On Tue, 2016-06-21 at 01:06 +0200, Michael Niedermayer wrote: > > > On Mon, Jun 20, 2016 at 05:55:47PM -0500, Dan Parrot wrote: > > > > On Tue, 2016-06-21 at 00:45 +0200, Michael Niedermayer wrote: > > > > > On Sun, Jun 19, 2016 at 09:57:42PM +, Dan Parrot wrote: > > > > > > First commit addressing Trac ticket #5570. Functions defined in > > > > > > libswscale/input.c > > > > > > have corresponding SIMD definitions in libswscale/ppc/input_vsx.c > > > > > > --- > > > > > > libswscale/ppc/Makefile |1 + > > > > > > libswscale/ppc/input_vsx.c| 1070 > > > > > > + > > > > > > libswscale/swscale.c |3 + > > > > > > libswscale/swscale_internal.h |1 + > > > > > > 4 files changed, 1075 insertions(+) > > > > > > create mode 100644 libswscale/ppc/input_vsx.c > > > > > > > > > > > > diff --git a/libswscale/ppc/Makefile b/libswscale/ppc/Makefile > > > > > > index d1b596e..2482893 100644 > > > > > > --- a/libswscale/ppc/Makefile > > > > > > +++ b/libswscale/ppc/Makefile > > > > > > @@ -1,3 +1,4 @@ > > > > > > OBJS += ppc/swscale_altivec.o > > > > > > \ > > > > > > +ppc/input_vsx.o > > > > > > \ > > > > > > ppc/yuv2rgb_altivec.o > > > > > > \ > > > > > > ppc/yuv2yuv_altivec.o > > > > > > \ > > > > > > diff --git a/libswscale/ppc/input_vsx.c b/libswscale/ppc/input_vsx.c > > > > > > new file mode 100644 > > > > > > index 000..adb0e38 > > > > > > --- /dev/null > > > > > > +++ b/libswscale/ppc/input_vsx.c > > > > > > @@ -0,0 +1,1070 @@ > > > > > > +/* > > > > > > + * Copyright (C) 2016 Dan Parrot > > > > > > + * > > > > > > + * This file is part of FFmpeg. > > > > > > + * > > > > > > + * FFmpeg is free software; you can redistribute it and/or > > > > > > + * modify it under the terms of the GNU Lesser General Public > > > > > > + * License as published by the Free Software Foundation; either > > > > > > + * version 2.1 of the License, or (at your option) any later > > > > > > version. > > > > > > + * > > > > > > + * FFmpeg is distributed in the hope that it will be useful, > > > > > > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > > > > > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > > > > > > GNU > > > > > > + * Lesser General Public License for more details. > > > > > > + * > > > > > > + * You should have received a copy of the GNU Lesser General Public > > > > > > + * License along with FFmpeg; if not, write to the Free Software > > > > > > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA > > > > > > 02110-1301 USA > > > > > > + */ > > > > > > + > > > > > > +#include > > > > > > +#include > > > > > > +#include > > > > > > +#include > > > > > > + > > > > > > +#include "libavutil/avutil.h" > > > > > > +#include "libavutil/bswap.h" > > > > > > +#include "libavutil/cpu.h" > > > > > > +#include "libavutil/intreadwrite.h" > > > > > > +#include "libavutil/mathematics.h" > > > > > > +#include "libavutil/pixdesc.h" > > > > > > +#include "libavutil/avassert.h" > > > > > > +#include "config.h" > > > > > > +#include "libswscale/rgb2rgb.h" > > > > > > +#include "libswscale/swscale.h" > > > > > > +#include "libswscale/swscale_internal.h" > > > >
Re: [FFmpeg-devel] [PATCH] PPC64: Add IBM POWER8 SIMD Implementation
On Tue, 2016-06-21 at 00:04 -0500, Dan Parrot wrote: > On Tue, 2016-06-21 at 02:22 +0200, Michael Niedermayer wrote: > > On Mon, Jun 20, 2016 at 06:38:18PM -0500, Dan Parrot wrote: > > > On Tue, 2016-06-21 at 01:06 +0200, Michael Niedermayer wrote: > > > > On Mon, Jun 20, 2016 at 05:55:47PM -0500, Dan Parrot wrote: > > > > > On Tue, 2016-06-21 at 00:45 +0200, Michael Niedermayer wrote: > > > > > > On Sun, Jun 19, 2016 at 09:57:42PM +, Dan Parrot wrote: > > > > > > > First commit addressing Trac ticket #5570. Functions defined in > > > > > > > libswscale/input.c > > > > > > > have corresponding SIMD definitions in libswscale/ppc/input_vsx.c > > > > > > > --- > > > > > > > libswscale/ppc/Makefile |1 + > > > > > > > libswscale/ppc/input_vsx.c| 1070 > > > > > > > + > > > > > > > libswscale/swscale.c |3 + > > > > > > > libswscale/swscale_internal.h |1 + > > > > > > > 4 files changed, 1075 insertions(+) > > > > > > > create mode 100644 libswscale/ppc/input_vsx.c > > > > > > > > > > > > > > diff --git a/libswscale/ppc/Makefile b/libswscale/ppc/Makefile > > > > > > > index d1b596e..2482893 100644 > > > > > > > --- a/libswscale/ppc/Makefile > > > > > > > +++ b/libswscale/ppc/Makefile > > > > > > > @@ -1,3 +1,4 @@ > > > > > > > OBJS += ppc/swscale_altivec.o > > > > > > >\ > > > > > > > +ppc/input_vsx.o > > > > > > >\ > > > > > > > ppc/yuv2rgb_altivec.o > > > > > > >\ > > > > > > > ppc/yuv2yuv_altivec.o > > > > > > >\ > > > > > > > diff --git a/libswscale/ppc/input_vsx.c > > > > > > > b/libswscale/ppc/input_vsx.c > > > > > > > new file mode 100644 > > > > > > > index 000..adb0e38 > > > > > > > --- /dev/null > > > > > > > +++ b/libswscale/ppc/input_vsx.c > > > > > > > @@ -0,0 +1,1070 @@ > > > > > > > +/* > > > > > > > + * Copyright (C) 2016 Dan Parrot > > > > > > > + * > > > > > > > + * This file is part of FFmpeg. > > > > > > > + * > > > > > > > + * FFmpeg is free software; you can redistribute it and/or > > > > > > > + * modify it under the terms of the GNU Lesser General Public > > > > > > > + * License as published by the Free Software Foundation; either > > > > > > > + * version 2.1 of the License, or (at your option) any later > > > > > > > version. > > > > > > > + * > > > > > > > + * FFmpeg is distributed in the hope that it will be useful, > > > > > > > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > > > > > > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > > > > > > > GNU > > > > > > > + * Lesser General Public License for more details. > > > > > > > + * > > > > > > > + * You should have received a copy of the GNU Lesser General > > > > > > > Public > > > > > > > + * License along with FFmpeg; if not, write to the Free Software > > > > > > > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA > > > > > > > 02110-1301 USA > > > > > > > + */ > > > > > > > + > > > > > > > +#include > > > > > > > +#include > > > > > > > +#include > > > > > > > +#include > > > > > > > + > > > > > > > +#include "libavutil/avutil.h" > > > > > > > +#include "libavutil/bswap.h" > > > > > > > +#include "libavutil/cpu.h" > > > > > > > +#include "libavutil/intreadwrite.h" > > > > > >
[FFmpeg-devel] PPC64: PowerPC Maintainer information is incorrect
The MAINTAINERS file lists Luca Barbato for Linux/PowerPC. You can see from his response below how he feels about that. Forwarded Message > From: Luca Barbato > To: Dan Parrot > Subject: Re: [FFmpeg-devel] [PATCH] PPC64: Add IBM POWER8 SIMD > Implementation > Date: Tue, 21 Jun 2016 00:56:20 +0200 > > On 21/06/16 00:42, Dan Parrot wrote: > > You are listed as the responsible party in ffmpeg MAINTAINERS. Could you > > please indicate whether or not the patch is acceptable? > > > > I have no idea why FFmpeg people keep that, I contribute to Libav[1]. > > You are more than welcome to send the patches and to contribute to it if > you like =) > > lu > > [1]: http://libav.org > ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] PPC64: Add IBM POWER8 SIMD Implementation
Could I get a yes or no answer on whether the patch will be applied? ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] PPC64: Add IBM POWER8 SIMD Implementation
On Wed, 2016-06-22 at 21:02 +, Carl Eugen Hoyos wrote: > Dan Parrot mail.com> writes: > > > Could I get a yes or no answer on whether the patch will be applied? > > Please comment on my email: time make fate can be used to show > large performance changes (although it isn't optimal), I don't > think it can show the difference between a division and a shift. I was not trying to show the difference in relative cost between a division and a shift using "time make fate". What I was trying to show is that ffmpeg compiled with code that uses the division versus ffmpeg compiled with code that uses shifts resulted in running times for fate that were essentially equal. Now, if you want me to demonstrate the relative speeds of the division and shift, I can do that. I admit to not seeing the point in the exercise, but if that is what is holding up patch acceptance, I will do it. And yes, the "time" utility is imperfect, but so is any technique which does not turn off all hardware interrupts in order to guarantee that the code being timed runs to completion without ever being paused. The goal of Trac ticket #5570 as best I understand it is to use SIMD hardware on PPC64 to reduce execution time of the functions in libswscale/input.c. Right? So, the comparison we should be talking about is whether ffmpeg currently in the repository is faster/slower than ffmpeg compiled using the patch. Timing fate suite when run with the two different versions of ffmpeg should be an acceptable indicator of which is faster. If I mistaken in any of the above, I am always willing to learn. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] PPC64: Add IBM POWER8 SIMD Implementation
On Wed, 2016-06-22 at 22:36 +, Carl Eugen Hoyos wrote: > Dan Parrot mail.com> writes: > > [...] > > Did you already test the TIMER macros? No I did not test with the TIMER macros. I don't see what that has to do with Trac ticket #5570. > I don't know if they work on ppc64le but we won't know > until you tested them. > > Thank you, Carl Eugen > > ___ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] PPC64: Add IBM POWER8 SIMD Implementation
On Thu, 2016-06-23 at 01:03 +0200, Michael Niedermayer wrote: > On Tue, Jun 21, 2016 at 12:04:42AM -0500, Dan Parrot wrote: > > On Tue, 2016-06-21 at 02:22 +0200, Michael Niedermayer wrote: > > > On Mon, Jun 20, 2016 at 06:38:18PM -0500, Dan Parrot wrote: > > > > On Tue, 2016-06-21 at 01:06 +0200, Michael Niedermayer wrote: > > > > > On Mon, Jun 20, 2016 at 05:55:47PM -0500, Dan Parrot wrote: > > > > > > On Tue, 2016-06-21 at 00:45 +0200, Michael Niedermayer wrote: > > > > > > > On Sun, Jun 19, 2016 at 09:57:42PM +, Dan Parrot wrote: > > > > > > > > First commit addressing Trac ticket #5570. Functions defined in > > > > > > > > libswscale/input.c > > > > > > > > have corresponding SIMD definitions in > > > > > > > > libswscale/ppc/input_vsx.c > > > > > > > > --- > > > > > > > > libswscale/ppc/Makefile |1 + > > > > > > > > libswscale/ppc/input_vsx.c| 1070 > > > > > > > > + > > > > > > > > libswscale/swscale.c |3 + > > > > > > > > libswscale/swscale_internal.h |1 + > > > > > > > > 4 files changed, 1075 insertions(+) > > > > > > > > create mode 100644 libswscale/ppc/input_vsx.c > > > > > > > > > > > > > > > > diff --git a/libswscale/ppc/Makefile b/libswscale/ppc/Makefile > > > > > > > > index d1b596e..2482893 100644 > > > > > > > > --- a/libswscale/ppc/Makefile > > > > > > > > +++ b/libswscale/ppc/Makefile > > > > > > > > @@ -1,3 +1,4 @@ > > > > > > > > OBJS += ppc/swscale_altivec.o > > > > > > > > \ > > > > > > > > +ppc/input_vsx.o > > > > > > > > \ > > > > > > > > ppc/yuv2rgb_altivec.o > > > > > > > > \ > > > > > > > > ppc/yuv2yuv_altivec.o > > > > > > > > \ > > > > > > > > diff --git a/libswscale/ppc/input_vsx.c > > > > > > > > b/libswscale/ppc/input_vsx.c > > > > > > > > new file mode 100644 > > > > > > > > index 000..adb0e38 > > > > > > > > --- /dev/null > > > > > > > > +++ b/libswscale/ppc/input_vsx.c > > > > > > > > @@ -0,0 +1,1070 @@ > > > > > > > > +/* > > > > > > > > + * Copyright (C) 2016 Dan Parrot > > > > > > > > + * > > > > > > > > + * This file is part of FFmpeg. > > > > > > > > + * > > > > > > > > + * FFmpeg is free software; you can redistribute it and/or > > > > > > > > + * modify it under the terms of the GNU Lesser General Public > > > > > > > > + * License as published by the Free Software Foundation; either > > > > > > > > + * version 2.1 of the License, or (at your option) any later > > > > > > > > version. > > > > > > > > + * > > > > > > > > + * FFmpeg is distributed in the hope that it will be useful, > > > > > > > > + * but WITHOUT ANY WARRANTY; without even the implied warranty > > > > > > > > of > > > > > > > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See > > > > > > > > the GNU > > > > > > > > + * Lesser General Public License for more details. > > > > > > > > + * > > > > > > > > + * You should have received a copy of the GNU Lesser General > > > > > > > > Public > > > > > > > > + * License along with FFmpeg; if not, write to the Free > > > > > > > > Software > > > > > > > > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, > > > > > > > > MA 02110-1301 USA > > > > > > > > + */ > > > > > > > > + > > > > > > > > +#include > > > > >
Re: [FFmpeg-devel] [PATCH] PPC64: Add IBM POWER8 SIMD Implementation
On Wed, 2016-06-22 at 20:33 -0300, James Almer wrote: > On 6/22/2016 8:15 PM, Dan Parrot wrote: > > On Thu, 2016-06-23 at 01:03 +0200, Michael Niedermayer wrote: > >> On Tue, Jun 21, 2016 at 12:04:42AM -0500, Dan Parrot wrote: > >>> On Tue, 2016-06-21 at 02:22 +0200, Michael Niedermayer wrote: > >>>> On Mon, Jun 20, 2016 at 06:38:18PM -0500, Dan Parrot wrote: > >>>>> On Tue, 2016-06-21 at 01:06 +0200, Michael Niedermayer wrote: > >>>>>> On Mon, Jun 20, 2016 at 05:55:47PM -0500, Dan Parrot wrote: > >>>>>>> On Tue, 2016-06-21 at 00:45 +0200, Michael Niedermayer wrote: > >>>>>>>> On Sun, Jun 19, 2016 at 09:57:42PM +, Dan Parrot wrote: > >>>>>>>>> First commit addressing Trac ticket #5570. Functions defined in > >>>>>>>>> libswscale/input.c > >>>>>>>>> have corresponding SIMD definitions in libswscale/ppc/input_vsx.c > >>>>>>>>> --- > >>>>>>>>> libswscale/ppc/Makefile |1 + > >>>>>>>>> libswscale/ppc/input_vsx.c| 1070 > >>>>>>>>> + > >>>>>>>>> libswscale/swscale.c |3 + > >>>>>>>>> libswscale/swscale_internal.h |1 + > >>>>>>>>> 4 files changed, 1075 insertions(+) > >>>>>>>>> create mode 100644 libswscale/ppc/input_vsx.c > >>>>>>>>> > >>>>>>>>> diff --git a/libswscale/ppc/Makefile b/libswscale/ppc/Makefile > >>>>>>>>> index d1b596e..2482893 100644 > >>>>>>>>> --- a/libswscale/ppc/Makefile > >>>>>>>>> +++ b/libswscale/ppc/Makefile > >>>>>>>>> @@ -1,3 +1,4 @@ > >>>>>>>>> OBJS += ppc/swscale_altivec.o > >>>>>>>>> \ > >>>>>>>>> +ppc/input_vsx.o > >>>>>>>>> \ > >>>>>>>>> ppc/yuv2rgb_altivec.o > >>>>>>>>> \ > >>>>>>>>> ppc/yuv2yuv_altivec.o > >>>>>>>>> \ > >>>>>>>>> diff --git a/libswscale/ppc/input_vsx.c b/libswscale/ppc/input_vsx.c > >>>>>>>>> new file mode 100644 > >>>>>>>>> index 000..adb0e38 > >>>>>>>>> --- /dev/null > >>>>>>>>> +++ b/libswscale/ppc/input_vsx.c > >>>>>>>>> @@ -0,0 +1,1070 @@ > >>>>>>>>> +/* > >>>>>>>>> + * Copyright (C) 2016 Dan Parrot > >>>>>>>>> + * > >>>>>>>>> + * This file is part of FFmpeg. > >>>>>>>>> + * > >>>>>>>>> + * FFmpeg is free software; you can redistribute it and/or > >>>>>>>>> + * modify it under the terms of the GNU Lesser General Public > >>>>>>>>> + * License as published by the Free Software Foundation; either > >>>>>>>>> + * version 2.1 of the License, or (at your option) any later > >>>>>>>>> version. > >>>>>>>>> + * > >>>>>>>>> + * FFmpeg is distributed in the hope that it will be useful, > >>>>>>>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of > >>>>>>>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > >>>>>>>>> GNU > >>>>>>>>> + * Lesser General Public License for more details. > >>>>>>>>> + * > >>>>>>>>> + * You should have received a copy of the GNU Lesser General Public > >>>>>>>>> + * License along with FFmpeg; if not, write to the Free Software > >>>>>>>>> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA > >>>>>>>>> 02110-1301 USA > >>>>>>>>> + */ > >>>>>>>>> + > >>>>>>>
[FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.
This patch addresses Trac ticket #5570. The optimized functions are in file libswscale/ppc/input_vsx.c. Each optimized function name is a concatenation of the corresponding name in libswscale/input.c with suffix _vsx. --- libswscale/ppc/Makefile | 1 + libswscale/ppc/input_vsx.c| 437 ++ libswscale/swscale.c | 3 + libswscale/swscale_internal.h | 1 + 4 files changed, 442 insertions(+) create mode 100644 libswscale/ppc/input_vsx.c diff --git a/libswscale/ppc/Makefile b/libswscale/ppc/Makefile index d1b596e..2482893 100644 --- a/libswscale/ppc/Makefile +++ b/libswscale/ppc/Makefile @@ -1,3 +1,4 @@ OBJS += ppc/swscale_altivec.o \ +ppc/input_vsx.o \ ppc/yuv2rgb_altivec.o \ ppc/yuv2yuv_altivec.o \ diff --git a/libswscale/ppc/input_vsx.c b/libswscale/ppc/input_vsx.c new file mode 100644 index 000..d977a32 --- /dev/null +++ b/libswscale/ppc/input_vsx.c @@ -0,0 +1,437 @@ +/* + * Copyright (C) 2016 Dan Parrot + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include +#include +#include +#include + +#include "libavutil/avutil.h" +#include "libavutil/bswap.h" +#include "libavutil/cpu.h" +#include "libavutil/intreadwrite.h" +#include "libavutil/mathematics.h" +#include "libavutil/pixdesc.h" +#include "libavutil/avassert.h" +#include "config.h" +#include "libswscale/rgb2rgb.h" +#include "libswscale/swscale.h" +#include "libswscale/swscale_internal.h" + +#if HAVE_VSX + +static void abgrToA_c_vsx(uint8_t *_dst, const uint8_t *src, const uint8_t *unused1, const uint8_t *unused2, + int width, uint32_t *unused) +{ +int16_t *dst = (int16_t *)_dst; +int i, width_adj, frag_len; + +uintptr_t src_addr = (uintptr_t)src; +uintptr_t dst_addr = (uintptr_t)dst; + +// compute integral number of vector-length items and length of final fragment +width_adj = width >> 3; +width_adj = width_adj << 3; +frag_len = width - width_adj; + +for ( i = 0; i < width_adj; i += 8) { +vector int v_rd0 = vec_vsx_ld(0, (int *)src_addr); +vector int v_rd1 = vec_vsx_ld(0, (int *)(src_addr + 16)); + +v_rd0 = vec_and(v_rd0, vec_splats(0x0ff)); +v_rd1 = vec_and(v_rd1, vec_splats(0x0ff)); + +v_rd0 = vec_sl(v_rd0, vec_splats((unsigned)6)); +v_rd1 = vec_sl(v_rd1, vec_splats((unsigned)6)); + +vector int v_dst = vec_perm(v_rd0, v_rd1, ((vector unsigned char) + {0, 1, 4, 5, 8, 9, 12, 13, 16, 17, 20, 21, 24, 25, 28, 29})); +vec_vsx_st((vector unsigned char)v_dst, 0, (unsigned char *)dst_addr); + +src_addr += 32; +dst_addr += 16; +} + +for (i=width_adj; i< width_adj + frag_len; i++) { +dst[i]= src[4*i]<<6; +} +} + +static void rgbaToA_c_vsx(uint8_t *_dst, const uint8_t *src, const uint8_t *unused1, const uint8_t *unused2, + int width, uint32_t *unused) +{ +int16_t *dst = (int16_t *)_dst; +int i, width_adj, frag_len; + +uintptr_t src_addr = (uintptr_t)src; +uintptr_t dst_addr = (uintptr_t)dst; + +// compute integral number of vector-length items and length of final fragment +width_adj = width >> 3; +width_adj = width_adj << 3; +frag_len = width - width_adj; + +for ( i = 0; i < width_adj; i += 8) { +vector int v_rd0 = vec_vsx_ld(0, (int *)src_addr); +vector int v_rd1 = vec_vsx_ld(0, (int *)(src_addr + 16)); + +v_rd0 = vec_sld(v_rd0, v_rd0, 13); +v_rd1 = vec_sld(v_rd1, v_rd1, 13); + +v_rd0 = vec_and(v_rd0, vec_splats(0x0ff)); +v_rd1 = vec_and(v_rd1, vec_splats(0x0ff)); + +v_rd0 = vec_sl(v_rd0, vec_splats((unsigned)6)); +v_rd1 = vec_sl(v_rd1, vec_splats((unsigned)6)); + +vector int v_dst = vec_perm(v_rd0, v_rd1, ((vector unsigned char) +
Re: [FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.
Here are execution times of SIMD and non-SIMD functions. The times were obtained using SystemTap probes at functions' entry and return points. The dataset used was fate-filter-pixfmts-scale. SIMD versions have suffix _vsx: yuy2ToY_c_vsx. no. of calls: 864. min: 1880 ns. avg: 2014 ns. max: 29844 ns. total: 1740366 ns. yuy2ToY_c. no. of calls: 864. min: 2326 ns. avg: 2451 ns. max: 15950 ns. total: 2118226 ns. yvy2ToUV_c_vsx. no. of calls: 288. min: 1891 ns. avg: 1989 ns. max: 13644 ns. total: 573038 ns. yvy2ToUV_c. no. of calls: 288. min: 2089 ns. avg: 2131 ns. max: 2462 ns. total: 613813 ns. rgbaToA_c_vsx. no. of calls: 1152. min: 1975 ns. avg: 2123 ns. max: 31356 ns. total: 2446276 ns. rgbaToA_c. no. of calls: 1152. min: 2368 ns. avg: 2448 ns. max: 12496 ns. total: 2820401 ns. uyvyToUV_c_vsx. no. of calls: 288. min: 1901 ns. avg: 1932 ns. max: 2122 ns. total: 556697 ns. uyvyToUV_c. no. of calls: 288. min: 2088 ns. avg: 2129 ns. max: 2370 ns. total: 613202 ns. uyvyToY_c_vsx. no. of calls: 576. min: 1877 ns. avg: 1956 ns. max: 15821 ns. total: 1127222 ns. uyvyToY_c. no. of calls: 576. min: 2325 ns. avg: 2408 ns. max: 15332 ns. total: 1387168 ns. nv12ToUV_c_vsx. no. of calls: 144. min: 1869 ns. avg: 2006 ns. max: 15480 ns. total: 288867 ns. nv12ToUV_c. no. of calls: 144. min: 2101 ns. avg: 2273 ns. max: 19774 ns. total: 327432 ns. abgrToA_c_vsx. no. of calls: 1152. min: 1949 ns. avg: 2060 ns. max: 15496 ns. total: 2373206 ns. abgrToA_c. no. of calls: 1152. min: 2374 ns. avg: 2471 ns. max: 52452 ns. total: 2847044 ns. yuy2ToUV_c_vsx. no. of calls: 288. min: 1873 ns. avg: 1972 ns. max: 16608 ns. total: 568154 ns. yuy2ToUV_c. no. of calls: 288. min: 2087 ns. avg: 2123 ns. max: 2252 ns. total: 611621 ns. nv21ToUV_c_vsx. no. of calls: 144. min: 1879 ns. avg: 2019 ns. max: 14290 ns. total: 290860 ns. nv21ToUV_c. no. of calls: 144. min: 2098 ns. avg: 2233 ns. max: 14750 ns. total: 321692 ns. = ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.
Finish providing SIMD versions for POWER8 VSX of functions in libswscale/input.c That should allow trac ticket #5570 to be closed. --- libswscale/ppc/input_vsx.c | 1018 +++- 1 file changed, 1014 insertions(+), 4 deletions(-) diff --git a/libswscale/ppc/input_vsx.c b/libswscale/ppc/input_vsx.c index d977a32..2c6f0ce 100644 --- a/libswscale/ppc/input_vsx.c +++ b/libswscale/ppc/input_vsx.c @@ -54,6 +54,7 @@ static void abgrToA_c_vsx(uint8_t *_dst, const uint8_t *src, const uint8_t *unus for ( i = 0; i < width_adj; i += 8) { vector int v_rd0 = vec_vsx_ld(0, (int *)src_addr); vector int v_rd1 = vec_vsx_ld(0, (int *)(src_addr + 16)); +vector int v_dst; v_rd0 = vec_and(v_rd0, vec_splats(0x0ff)); v_rd1 = vec_and(v_rd1, vec_splats(0x0ff)); @@ -61,8 +62,8 @@ static void abgrToA_c_vsx(uint8_t *_dst, const uint8_t *src, const uint8_t *unus v_rd0 = vec_sl(v_rd0, vec_splats((unsigned)6)); v_rd1 = vec_sl(v_rd1, vec_splats((unsigned)6)); -vector int v_dst = vec_perm(v_rd0, v_rd1, ((vector unsigned char) - {0, 1, 4, 5, 8, 9, 12, 13, 16, 17, 20, 21, 24, 25, 28, 29})); +v_dst = vec_perm(v_rd0, v_rd1, ((vector unsigned char) +{0, 1, 4, 5, 8, 9, 12, 13, 16, 17, 20, 21, 24, 25, 28, 29})); vec_vsx_st((vector unsigned char)v_dst, 0, (unsigned char *)dst_addr); src_addr += 32; @@ -91,6 +92,7 @@ static void rgbaToA_c_vsx(uint8_t *_dst, const uint8_t *src, const uint8_t *unus for ( i = 0; i < width_adj; i += 8) { vector int v_rd0 = vec_vsx_ld(0, (int *)src_addr); vector int v_rd1 = vec_vsx_ld(0, (int *)(src_addr + 16)); +vector int v_dst; v_rd0 = vec_sld(v_rd0, v_rd0, 13); v_rd1 = vec_sld(v_rd1, v_rd1, 13); @@ -101,8 +103,8 @@ static void rgbaToA_c_vsx(uint8_t *_dst, const uint8_t *src, const uint8_t *unus v_rd0 = vec_sl(v_rd0, vec_splats((unsigned)6)); v_rd1 = vec_sl(v_rd1, vec_splats((unsigned)6)); -vector int v_dst = vec_perm(v_rd0, v_rd1, ((vector unsigned char) - {0, 1, 4, 5, 8, 9, 12, 13, 16, 17, 20, 21, 24, 25, 28, 29})); +v_dst = vec_perm(v_rd0, v_rd1, ((vector unsigned char) +{0, 1, 4, 5, 8, 9, 12, 13, 16, 17, 20, 21, 24, 25, 28, 29})); vec_vsx_st((vector unsigned char)v_dst, 0, (unsigned char *)dst_addr); src_addr += 32; @@ -114,6 +116,175 @@ static void rgbaToA_c_vsx(uint8_t *_dst, const uint8_t *src, const uint8_t *unus } } +static void monoblack2Y_c_vsx(uint8_t *_dst, const uint8_t *src, const uint8_t *unused1, const uint8_t *unused2, + int width, uint32_t *unused) +{ +int16_t *dst = (int16_t *)_dst; +int i, j, width_adj, frag_len; + +vector unsigned charv_rd; +vector signed short v_din, v_d, v_dst; +vector unsigned short v_opr; + +uintptr_t src_addr = (uintptr_t)src; +uintptr_t dst_addr = (uintptr_t)dst; + +width = (width + 7) >> 3; + +// compute integral number of vector-length items and length of final fragment +width_adj = width >> 3; +width_adj = width_adj << 3; +frag_len = width - width_adj; + +v_opr = (vector unsigned short) {7, 6, 5, 4, 3, 2, 1, 0}; + +for (i = 0; i < width_adj; i += 8) { +if (i & 0x0f) { +v_rd = vec_sld(v_rd, v_rd, 8); +} else { +v_rd = vec_vsx_ld(0, (unsigned char *)src_addr); +src_addr += 16; +} + +v_din = vec_unpackh((vector signed char)v_rd); +v_din = vec_and(v_din, vec_splats((short)0x00ff)); + +for (j = 0; j < 8; j++) { +switch(j) { +case 0: +v_d = vec_splat(v_din, 0); +break; +case 1: +v_d = vec_splat(v_din, 1); +break; +case 2: +v_d = vec_splat(v_din, 2); +break; +case 3: +v_d = vec_splat(v_din, 3); +break; +case 4: +v_d = vec_splat(v_din, 4); +break; +case 5: +v_d = vec_splat(v_din, 5); +break; +case 6: +v_d = vec_splat(v_din, 6); +break; +case 7: +v_d = vec_splat(v_din, 7); +break; +} + +v_dst = vec_sr(v_d, v_opr); +v_dst = vec_and(v_dst, vec_splats((short)1)); +v_dst = v_dst * vec_splats((short)16383); + +vec_vsx_st(v_dst, 0, (short *)dst_addr); +dst_addr += 16; +} +} + +for (i = width_adj; i < width_adj + frag_len; i++) { +int d = src[i]; +for (j = 0; j < 8; j++) +dst[8*i+j]= ((d>>(7-j))&1
Re: [FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.
On Mon, 2016-07-04 at 06:22 +, Carl Eugen Hoyos wrote: > Dan Parrot mail.com> writes: > > > Finish providing SIMD versions for POWER8 VSX of functions > > in libswscale/input.c > > That should allow trac ticket #5570 to be closed. > > Please add some numbers: > Either for single functions or for a single ffmpeg command. > (for rgb/bgr, mono is irrelevant) > > Carl Eugen > > ___ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel The data below show the running times, for each of the functions, obtained using SystemTap. The dataset used was the entire FATE regression suite. Only the first calls are used in obtaining the data (for functions called more often). SIMD functions have suffix "_vsx". The unit of time used is nanosecond. --- name: abgrToA_c_vsx. no. of calls: 1408. min: 2772 ns. avg: 3106 ns. max: 44282 ns. total: 4373993 ns. name: abgrToA_c. no. of calls: 1408. min: 3088 ns. avg: 3385 ns. max: 24698 ns. total: 4766911 ns. --- name: bgr24ToUV_c_vsx. no. of calls: 288. min: 5213 ns. avg: 5452 ns. max: 26635 ns. total: 1570338 ns. name: bgr24ToUV_c. no. of calls: 288. min: 5351 ns. avg: 5636 ns. max: 27284 ns. total: 1623277 ns. --- name: bgr24ToUV_half_c_vsx. no. of calls: . min: 4792 ns. avg: 4941 ns. max: 34340 ns. total: 49411622 ns. name: bgr24ToUV_half_c. no. of calls: . min: 4795 ns. avg: 6012 ns. max: 66135 ns. total: 60122454 ns. --- name: bgr24ToY_c_vsx. no. of calls: . min: 4475 ns. avg: 4654 ns. max: 28739 ns. total: 46539077 ns. name: bgr24ToY_c. no. of calls: . min: 4551 ns. avg: 5974 ns. max: 218357 ns. total: 59741865 ns. --- name: monoblack2Y_c_vsx. no. of calls: 288. min: 2902 ns. avg: 3102 ns. max: 25454 ns. total: 893490 ns. name: monoblack2Y_c. no. of calls: 288. min: 3011 ns. avg: 3203 ns. max: 26008 ns. total: 922515 ns. --- name: monowhite2Y_c_vsx. no. of calls: . min: 2813 ns. avg: 3025 ns. max: 81510 ns. total: 30248113 ns. name: monowhite2Y_c. no. of calls: . min: 2692 ns. avg: 2891 ns. max: 43653 ns. total: 28911676 ns. --- name: nv12ToUV_c_vsx. no. of calls: 144. min: 2709 ns. avg: 2960 ns. max: 26249 ns. total: 426364 ns. name: nv12ToUV_c. no. of calls: 144. min: 2930 ns. avg: 3169 ns. max: 24483 ns. total: 456353 ns. --- name: nv21ToUV_c_vsx. no. of calls: 144. min: 2707 ns. avg: 3001 ns. max: 26050 ns. total: 432150 ns. name: nv21ToUV_c. no. of calls: 144. min: 2887 ns. avg: 3141 ns. max: 24704 ns. total: 452426 ns. --- name: planar_rgb_to_a_vsx. no. of calls: 288. min: 2977 ns. avg: 3223 ns. max: 24993 ns. total: 928305 ns. name: planar_rgb_to_a. no. of calls: 288. min: 3306 ns. avg: 3538 ns. max: 24350 ns. total: 1019154 ns. --- name: planar_rgb_to_uv_vsx. no. of calls: 576. min: 5092 ns. avg: 5295 ns. max: 27170 ns. total: 3050431 ns. name: planar_rgb_to_uv. no. of calls: 576. min: 5605 ns. avg: 5864 ns. max: 26177 ns. total: 3377983 ns. --- name: planar_rgb_to_y_vsx. no. of calls: 576. min: 4459 ns. avg: 4666 ns. max: 27760 ns. total: 2688039 ns. name: planar_rgb_to_y. no. of calls: 576. min: 4877 ns. avg: 5149 ns. max: 27879 ns. total: 2965982 ns. --- name: rgb24ToUV_c_vsx. no. of calls: 688. min: 4090 ns. avg: 4791 ns. max: 25602 ns. total: 3296223 ns. name: rgb24ToUV_c. no. of calls: 688. min: 4077 ns. avg: 4891 ns. max: 26629 ns. total: 3365385 ns. --- name: rgb24ToUV_half_c_vsx. no. of calls: . min: 4062 ns. avg: 5074 ns.
Re: [FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.
On Mon, 2016-07-04 at 09:20 +, Carl Eugen Hoyos wrote: > Dan Parrot mail.com> writes: > > > The dataset used was the entire FATE regression suite. > > I don't think this is a particularly useful testcase: > It takes very long but mostly tests other things. > > Did you test if using ffmpeg -benchmark -f rawvideo -i /dev/zero... > showed different results? > I believe this should be both easier and faster to test. Sorry, I don't understand what that command line just above is trying to achieve. Could you elaborate? > > name: rgb24ToY_c_vsx. > > no. of calls: . min: 3832 ns. avg: 4709 ns. max: 37550 ns. > > total: 47093533 ns. > > > > name: rgb24ToY_c. > > no. of calls: . min: 3809 ns. avg: 4707 ns. max: 29041 ns. > > total: 47072923 ns. > > Without any data, I would have thought that this is the most > important function (and "no. of calls" seems to confirm this). > > Why is this not faster? Surprisingly, gcc is producing some badly suboptimal assembly. I need to follow up with IBM's Linux Technology Center. The major issue is that multiplication of vector quantities in C is generating as many multiplications in assembly as would scalar multiplication in a loop. No way that should be occurring. > Can you confirm with START_TIMER / STOP_TIMER that there is no > gain? SystemTap probes provide identical functionality by measuring deltas between function entry and function return. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.
On Mon, 2016-07-04 at 16:30 +, Carl Eugen Hoyos wrote: > Dan Parrot mail.com> writes: > > > > Did you test if using ffmpeg -benchmark -f rawvideo -i /dev/zero... > > > showed different results? > > > I believe this should be both easier and faster to test. > > > > Sorry, I don't understand what that command line just above > > is trying to achieve. Could you elaborate? > > Instead of running the whole fate suite that takes long and > does not test libswscale for most commands, just test an > ffmpeg command line that only tests libswscale: > $ ffmpeg -benchmark -f rawvideo -pix_fmt rgb24 > -i /dev/zero -pix_fmt yuv420p -f null -vframes 1 - > vs > > $ ffmpeg -cpuflags 0 -benchmark -f rawvideo -pix_fmt rgb24 > -i /dev/zero -pix_fmt yuv420p -f null -vframes 1 - > Ok. Thanks for the explanation. I will run those commands and post the reported results. > [...] > > > Surprisingly, gcc is producing some badly suboptimal assembly. > > Just to make sure I don't misunderstand: > Does this mean intrinsics are suboptimal to write assembly > code? Here's what I mean: All variables below are of type "vector int" 1. v0 = v2 * v3 2. v0 = v4 * v5 + v6 * v7 + v8 * v9 The first statement produces 1 multiply, 1 multiply-sum and 1 addition instruction in assembly. The second produces 6 multiply, 6 multiply-sum, and 10 addition instructions in assembly! I expected 3, 3, 3 of each respective operations from (1) plus 2 additions. > > > > Can you confirm with START_TIMER / STOP_TIMER that there is no > > > gain? > > > > SystemTap probes provide identical functionality by measuring > > deltas between function entry and function return. > > Sorry, I don't understand: > Did you test with both methods to verify that they provide > the same results? > > Note that if it turns out that START_TIMER / STOP_TIMER > cannot be used on ppc64 (le) this would be important > information for us. > I'll insert these macros and inform of the results if the code compiles and runs. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.
> > Just to make sure I don't misunderstand: > > Does this mean intrinsics are suboptimal to write assembly > > code? > Here's what I mean: All variables below are of type "vector int" > > 1. v0 = v2 * v3 > 2. v0 = v4 * v5 + v6 * v7 + v8 * v9 > > The first statement produces 1 multiply, 1 multiply-sum and 1 addition > instruction in assembly. > > The second produces 6 multiply, 6 multiply-sum, and 10 addition > instructions in assembly! I expected 3, 3, 3 of each respective > operations from (1) plus 2 additions. The operations counts given above were obtained using gcc 5.3.1 on Fedora 22. I just created a simple test with those same statements and compiled using gcc 6.1.1 on Fedora 24. The assembly operation counts are what I had expected initially and more reasonable. So, I'm going to move my ffmpeg development onto the Fedora 24 cloud image and see if the SIMD performance there is better than was on Fedora 22. The reason I'm moving to Fedora 24 instead of trying to upgrade gcc on Fedora 22 is that I've learned to prefer standard pre-installed images to the wrecks I've managed to create doing my own sysadmin on the POWER8 cloud. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.
On Mon, 2016-07-04 at 20:55 +0200, Hendrik Leppkes wrote: > On Mon, Jul 4, 2016 at 5:20 PM, Dan Parrot wrote: > >> Why is this not faster? > > Surprisingly, gcc is producing some badly suboptimal assembly. I need to > > follow up with IBM's Linux Technology Center. The major issue is that > > multiplication of vector quantities in C is generating as many > > multiplications in assembly as would scalar multiplication in a loop. No > > way that should be occurring. > > > > This is the reason why we generally don't allow intrinsic > optimizations and instead ask people to write full assembly instead. > It behaves more consistently everywhere. Is this then a requirement to abandon the use of intrinsics for PPC64 SIMD and instead re-implement in assembly? ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.
On Mon, 2016-07-04 at 16:30 +, Carl Eugen Hoyos wrote: > Dan Parrot mail.com> writes: > > > > Did you test if using ffmpeg -benchmark -f rawvideo -i /dev/zero... > > > showed different results? > > > I believe this should be both easier and faster to test. > > > > Sorry, I don't understand what that command line just above > > is trying to achieve. Could you elaborate? > > Instead of running the whole fate suite that takes long and > does not test libswscale for most commands, just test an > ffmpeg command line that only tests libswscale: > $ ffmpeg -benchmark -f rawvideo -pix_fmt rgb24 > -i /dev/zero -pix_fmt yuv420p -f null -vframes 1 - $ ./ffmpeg -benchmark -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt yuv420p -f null -vframes 1000 - frame= 1000 fps= 16 q=-0.0 Lsize=N/A time=00:00:40.00 bitrate=N/A speed=0.632x video:477kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown bench: utime=62.794s bench: maxrss=21184kB > vs > > $ ffmpeg -cpuflags 0 -benchmark -f rawvideo -pix_fmt rgb24 > -i /dev/zero -pix_fmt yuv420p -f null -vframes 1 - $ ./ffmpeg -cpuflags 0 -benchmark -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt yuv420p -f null -vframes 1000 - frame= 1000 fps= 12 q=-0.0 Lsize=N/A time=00:00:40.00 bitrate=N/A speed=0.479x video:477kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown bench: utime=82.918s bench: maxrss=21120kB > [...] > > > Surprisingly, gcc is producing some badly suboptimal assembly. > > Just to make sure I don't misunderstand: > Does this mean intrinsics are suboptimal to write assembly > code? So, the latest version of GCC does produce more efficient assembly. To recap: GCC 5.3.1 produces assembly that does not take full advantage of PPC64 POWER8 SIMD instructions. GCC 6.1.1 is much better and produces shorter sequences that do use SIMD assembly instructions. > > > Can you confirm with START_TIMER / STOP_TIMER that there is no > > > gain? > > > > SystemTap probes provide identical functionality by measuring > > deltas between function entry and function return. > > Sorry, I don't understand: > Did you test with both methods to verify that they provide > the same results? > Note that if it turns out that START_TIMER / STOP_TIMER > cannot be used on ppc64 (le) this would be important > information for us. These start/stop macros are the last issue I have outstanding. I hope to be done in a few hours. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.
> > > > Can you confirm with START_TIMER / STOP_TIMER that there is no > > > > gain? > > > > > > SystemTap probes provide identical functionality by measuring > > > deltas between function entry and function return. > > > > Sorry, I don't understand: > > Did you test with both methods to verify that they provide > > the same results? > > > > Note that if it turns out that START_TIMER / STOP_TIMER > > cannot be used on ppc64 (le) this would be important > > information for us. > > > I'll insert these macros and inform of the results if the code compiles > and runs. These results for START_TIMER/STOP_TIMER are with ffmpeg compiled using GCC 6.1.1 == Existing non-SIMD version: ./ffmpeg -report -cpuflags 0 -benchmark -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt yuv420p -f null -vframes 1000 - 33770 UNITS in rgb24ToY_c, 1 runs, 0 skips 33430 UNITS in rgb24ToY_c, 2 runs, 0 skips 33292 UNITS in rgb24ToY_c, 4 runs, 0 skips 33128 UNITS in rgb24ToY_c, 8 runs, 0 skips 32848 UNITS in rgb24ToY_c, 16 runs, 0 skips 32347 UNITS in rgb24ToY_c, 32 runs, 0 skips 31831 UNITS in rgb24ToY_c, 64 runs, 0 skips 31594 UNITS in rgb24ToY_c, 128 runs, 0 skips 31513 UNITS in rgb24ToY_c, 256 runs, 0 skips 31628 UNITS in rgb24ToY_c, 512 runs, 0 skips 31466 UNITS in rgb24ToY_c,1024 runs, 0 skips 31390 UNITS in rgb24ToY_c,2048 runs, 0 skips 31411 UNITS in rgb24ToY_c,4096 runs, 0 skips 31411 UNITS in rgb24ToY_c,8192 runs, 0 skipstrate=N/A speed=0.522x 31399 UNITS in rgb24ToY_c, 16384 runs, 0 skipstrate=N/A speed=0.486x 31416 UNITS in rgb24ToY_c, 32763 runs, 5 skipstrate=N/A speed=0.467x 31413 UNITS in rgb24ToY_c, 65530 runs, 6 skipstrate=N/A speed=0.458x 31421 UNITS in rgb24ToY_c, 131064 runs, 8 skipstrate=N/A speed=0.454x 31430 UNITS in rgb24ToY_c, 262131 runs, 13 skipstrate=N/A speed=0.449x 31422 UNITS in rgb24ToY_c, 524264 runs, 24 skipstrate=N/A speed=0.449x 31424 UNITS in rgb24ToY_c, 1048532 runs, 44 skipstrate=N/A speed=0.45x frame= 1000 fps= 11 q=-0.0 Lsize=N/A time=00:00:40.00 bitrate=N/A speed=0.449x video:477kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown bench: utime=88.212s bench: maxrss=21120kB -- ./ffmpeg -report -benchmark -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt yuv420p -f null -vframes 1000 - 35440 UNITS in rgb24ToY_c, 1 runs, 0 skips 34290 UNITS in rgb24ToY_c, 2 runs, 0 skips 33670 UNITS in rgb24ToY_c, 4 runs, 0 skips 33387 UNITS in rgb24ToY_c, 8 runs, 0 skips 32786 UNITS in rgb24ToY_c, 16 runs, 0 skips 32317 UNITS in rgb24ToY_c, 32 runs, 0 skips 32008 UNITS in rgb24ToY_c, 64 runs, 0 skips 31944 UNITS in rgb24ToY_c, 128 runs, 0 skips 32049 UNITS in rgb24ToY_c, 256 runs, 0 skips 31913 UNITS in rgb24ToY_c, 512 runs, 0 skips 31822 UNITS in rgb24ToY_c,1024 runs, 0 skips 31805 UNITS in rgb24ToY_c,2048 runs, 0 skips 31841 UNITS in rgb24ToY_c,4096 runs, 0 skips 31825 UNITS in rgb24ToY_c,8192 runs, 0 skips 31803 UNITS in rgb24ToY_c, 16383 runs, 1 skipstrate=N/A speed=0.649x 31822 UNITS in rgb24ToY_c, 32766 runs, 2 skipstrate=N/A speed=0.602x 31816 UNITS in rgb24ToY_c, 65532 runs, 4 skipstrate=N/A speed=0.59x 31811 UNITS in rgb24ToY_c, 131064 runs, 8 skipstrate=N/A speed=0.584x 31810 UNITS in rgb24ToY_c, 262133 runs, 11 skipstrate=N/A speed=0.583x 31811 UNITS in rgb24ToY_c, 524266 runs, 22 skipstrate=N/A speed=0.583x 31822 UNITS in rgb24ToY_c, 1048527 runs, 49 skipstrate=N/A speed=0.582x frame= 1000 fps= 15 q=-0.0 Lsize=N/A time=00:00:40.00 bitrate=N/A speed=0.581x video:477kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown bench: utime=68.211s bench: maxrss=21120kB SIMD version in patch submitted earlier in this message thread: ./ffmpeg -report -cpuflags 0 -benchmark -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt yuv420p -f null -vframes 1000 - && ./ffmpeg -report -benchmark -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt yuv420p -f null -vframes 1000 23950 UNITS in rgb24ToY_c_vsx, 1 runs, 0 skips 23175 UNITS in rgb24ToY_c_vsx, 2 runs, 0 skips 22752 UNITS in rgb24ToY_c_vsx, 4 runs, 0 skips 22401 UNITS in rgb24ToY_c_vsx, 8 runs, 0 skips 22106 UNITS in rgb24ToY_c_vsx, 16 runs, 0 skips 21585 UNITS in rgb24ToY_c_vsx, 32 runs, 0 skips 21126 UNITS in rgb24ToY_c_vsx, 64 runs, 0 skips 2
Re: [FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.
On Mon, 2016-07-04 at 09:20 +, Carl Eugen Hoyos wrote: > Dan Parrot mail.com> writes: > > > The dataset used was the entire FATE regression suite. > > I don't think this is a particularly useful testcase: > It takes very long but mostly tests other things. > > Did you test if using ffmpeg -benchmark -f rawvideo -i /dev/zero... > showed different results? > I believe this should be both easier and faster to test. > > > name: rgb24ToY_c_vsx. > > no. of calls: . min: 3832 ns. avg: 4709 ns. max: 37550 ns. > > total: 47093533 ns. > > > > name: rgb24ToY_c. > > no. of calls: . min: 3809 ns. avg: 4707 ns. max: 29041 ns. > > total: 47072923 ns. > > Without any data, I would have thought that this is the most > important function (and "no. of calls" seems to confirm this). > > Why is this not faster? > Can you confirm with START_TIMER / STOP_TIMER that there is no > gain? > > Thank you, Carl Eugen > > ___ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.
On Mon, 2016-07-04 at 23:31 -0500, Dan Parrot wrote: > On Mon, 2016-07-04 at 09:20 +, Carl Eugen Hoyos wrote: > > Dan Parrot mail.com> writes: > > > > > The dataset used was the entire FATE regression suite. > > > > I don't think this is a particularly useful testcase: > > It takes very long but mostly tests other things. > > > > Did you test if using ffmpeg -benchmark -f rawvideo -i /dev/zero... > > showed different results? > > I believe this should be both easier and faster to test. > > > > > name: rgb24ToY_c_vsx. > > > no. of calls: . min: 3832 ns. avg: 4709 ns. max: 37550 ns. > > > total: 47093533 ns. > > > > > > name: rgb24ToY_c. > > > no. of calls: . min: 3809 ns. avg: 4707 ns. max: 29041 ns. > > > total: 47072923 ns. > > > > Without any data, I would have thought that this is the most > > important function (and "no. of calls" seems to confirm this). > > > > Why is this not faster? I believe I have answered, in earlier posts, all the questions you raised. Finally, just to satisfy my curiosity, I used SystemTap to probe during a run of the entire FATE regression. Here are the same two functions, this time with GCC 6.1.1 instead of 5.3.1 (it is representative of all other functions) name: rgb24ToY_c_vsx. no. of calls: . min: 3053 ns. avg: 3298 ns. max: 69359 ns. total: 32983050 ns. name: rgb24ToY_c. no. of calls: . min: 3040 ns. avg: 4056 ns. max: 79159 ns. total: 40561568 ns. Non-trivial improvement is seen for the SIMD code. So: would you accept and apply the patch? ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.
On Mon, 2016-07-04 at 06:22 +, Carl Eugen Hoyos wrote: > Dan Parrot mail.com> writes: > > > Finish providing SIMD versions for POWER8 VSX of functions > > in libswscale/input.c > > That should allow trac ticket #5570 to be closed. > > Please add some numbers: > Either for single functions or for a single ffmpeg command. > (for rgb/bgr, mono is irrelevant) > > Carl Eugen All questions you posed have now been answered in the thread. It here are no new issues, could you apply the patch and close trac ticket #5570. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.
On Tue, 2016-07-05 at 15:45 +, Carl Eugen Hoyos wrote: > Dan Parrot mail.com> writes: > > > These results for START_TIMER/STOP_TIMER are with ffmpeg > > compiled using GCC 6.1.1 > > I believe your results indicate that -cpuflags 0 has no > effect on vsx. > On x86, this would be a blocker. > > While I would prefer if you had tested with vanilla > gcc instead of a version patched by a distributor, no > more comments from me. Just to be clear, I am not making any claims about any special patching in GCC 6.1.1. All I am saying about GCC is that version 5.3.1 produces suboptimal SIMD sequences; 6.1.1 gives much better SIMD sequences. Both versions are preinstalled on Fedora images in the IBM cloud. I have no way of knowing how those versions of GCC behave on different machines. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.
Finish providing SIMD versions for POWER8 VSX of functions in libswscale/input.c That should allow trac ticket #5570 to be closed. The speedups obtained for the functions are: abgrToA_c 1.19 bgr24ToUV_c 1.23 bgr24ToUV_half_c1.37 bgr24ToY_c_vsx 1.43 nv12ToUV_c 1.05 nv21ToUV_c 1.06 planar_rgb_to_uv1.25 planar_rgb_to_y 1.26 rgb24ToUV_c 1.11 rgb24ToUV_half_c1.10 rgb24ToY_c 0.92 rgbaToA_c 0.88 uyvyToUV_c 1.05 uyvyToY_c 1.15 yuy2ToUV_c 1.07 yuy2ToY_c 1.17 yvy2ToUV_c 1.05 --- libswscale/ppc/input_vsx.c | 1021 +++- 1 file changed, 1017 insertions(+), 4 deletions(-) diff --git a/libswscale/ppc/input_vsx.c b/libswscale/ppc/input_vsx.c index d977a32..35edd5e 100644 --- a/libswscale/ppc/input_vsx.c +++ b/libswscale/ppc/input_vsx.c @@ -30,6 +30,7 @@ #include "libavutil/mathematics.h" #include "libavutil/pixdesc.h" #include "libavutil/avassert.h" +#include "libavutil/timer.h" #include "config.h" #include "libswscale/rgb2rgb.h" #include "libswscale/swscale.h" @@ -54,6 +55,7 @@ static void abgrToA_c_vsx(uint8_t *_dst, const uint8_t *src, const uint8_t *unus for ( i = 0; i < width_adj; i += 8) { vector int v_rd0 = vec_vsx_ld(0, (int *)src_addr); vector int v_rd1 = vec_vsx_ld(0, (int *)(src_addr + 16)); +vector int v_dst; v_rd0 = vec_and(v_rd0, vec_splats(0x0ff)); v_rd1 = vec_and(v_rd1, vec_splats(0x0ff)); @@ -61,8 +63,8 @@ static void abgrToA_c_vsx(uint8_t *_dst, const uint8_t *src, const uint8_t *unus v_rd0 = vec_sl(v_rd0, vec_splats((unsigned)6)); v_rd1 = vec_sl(v_rd1, vec_splats((unsigned)6)); -vector int v_dst = vec_perm(v_rd0, v_rd1, ((vector unsigned char) - {0, 1, 4, 5, 8, 9, 12, 13, 16, 17, 20, 21, 24, 25, 28, 29})); +v_dst = vec_perm(v_rd0, v_rd1, ((vector unsigned char) +{0, 1, 4, 5, 8, 9, 12, 13, 16, 17, 20, 21, 24, 25, 28, 29})); vec_vsx_st((vector unsigned char)v_dst, 0, (unsigned char *)dst_addr); src_addr += 32; @@ -91,6 +93,7 @@ static void rgbaToA_c_vsx(uint8_t *_dst, const uint8_t *src, const uint8_t *unus for ( i = 0; i < width_adj; i += 8) { vector int v_rd0 = vec_vsx_ld(0, (int *)src_addr); vector int v_rd1 = vec_vsx_ld(0, (int *)(src_addr + 16)); +vector int v_dst; v_rd0 = vec_sld(v_rd0, v_rd0, 13); v_rd1 = vec_sld(v_rd1, v_rd1, 13); @@ -101,8 +104,8 @@ static void rgbaToA_c_vsx(uint8_t *_dst, const uint8_t *src, const uint8_t *unus v_rd0 = vec_sl(v_rd0, vec_splats((unsigned)6)); v_rd1 = vec_sl(v_rd1, vec_splats((unsigned)6)); -vector int v_dst = vec_perm(v_rd0, v_rd1, ((vector unsigned char) - {0, 1, 4, 5, 8, 9, 12, 13, 16, 17, 20, 21, 24, 25, 28, 29})); +v_dst = vec_perm(v_rd0, v_rd1, ((vector unsigned char) +{0, 1, 4, 5, 8, 9, 12, 13, 16, 17, 20, 21, 24, 25, 28, 29})); vec_vsx_st((vector unsigned char)v_dst, 0, (unsigned char *)dst_addr); src_addr += 32; @@ -114,6 +117,175 @@ static void rgbaToA_c_vsx(uint8_t *_dst, const uint8_t *src, const uint8_t *unus } } +static void monoblack2Y_c_vsx(uint8_t *_dst, const uint8_t *src, const uint8_t *unused1, const uint8_t *unused2, + int width, uint32_t *unused) +{ +int16_t *dst = (int16_t *)_dst; +int i, j, width_adj, frag_len; + +vector unsigned charv_rd; +vector signed short v_din, v_d, v_dst; +vector unsigned short v_opr; + +uintptr_t src_addr = (uintptr_t)src; +uintptr_t dst_addr = (uintptr_t)dst; + +width = (width + 7) >> 3; + +// compute integral number of vector-length items and length of final fragment +width_adj = width >> 3; +width_adj = width_adj << 3; +frag_len = width - width_adj; + +v_opr = (vector unsigned short) {7, 6, 5, 4, 3, 2, 1, 0}; + +for (i = 0; i < width_adj; i += 8) { +if (i & 0x0f) { +v_rd = vec_sld(v_rd, v_rd, 8); +} else { +v_rd = vec_vsx_ld(0, (unsigned char *)src_addr); +src_addr += 16; +} + +v_din = vec_unpackh((vector signed char)v_rd); +v_din = vec_and(v_din, vec_splats((short)0x00ff)); + +for (j = 0; j < 8; j++) { +switch(j) { +case 0: +v_d = vec_splat(v_din, 0); +break; +case 1: +v_d = vec_splat(v_din, 1); +break; +case 2: +v_d = vec_splat(v_din, 2); +break; +case 3: +v_d = vec_splat(v_din, 3); +break; +
Re: [FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.
On Wed, 2016-07-06 at 09:07 +0200, Hendrik Leppkes wrote: > On Wed, Jul 6, 2016 at 4:37 AM, Dan Parrot wrote: > > Finish providing SIMD versions for POWER8 VSX of functions in > > libswscale/input.c That should allow trac ticket #5570 to be closed. > > The speedups obtained for the functions are: > > > > abgrToA_c 1.19 > > bgr24ToUV_c 1.23 > > bgr24ToUV_half_c1.37 > > bgr24ToY_c_vsx 1.43 > > nv12ToUV_c 1.05 > > nv21ToUV_c 1.06 > > planar_rgb_to_uv1.25 > > planar_rgb_to_y 1.26 > > rgb24ToUV_c 1.11 > > rgb24ToUV_half_c1.10 > > rgb24ToY_c 0.92 > > rgbaToA_c 0.88 > > uyvyToUV_c 1.05 > > uyvyToY_c 1.15 > > yuy2ToUV_c 1.07 > > yuy2ToY_c 1.17 > > yvy2ToUV_c 1.05 > > SIMD implementations that in the best case improve the speed by 43% > (and in some cases is *slower*) seem barely worth it. One would expect > a proper SIMD implementation to offer 100% or higher increases, at > least thats the general expectation on x86 with SSE/AVX. It sounds like you have either forgotten or never learned a very basic principle of computer architecture. I recommend the text by Patterson and Hennessey. The principle is Amdahl's Law. Before you start throwing numbers around, make sure you understand what was being parallelized. > So the question here is - is thats VSX being bad, or the intrinsics > being bad? How would the speedup be in proper hand-written ASM? > If hand-written ASM can give us the usual 100-200% improvements we would > expect from SIMD, then this is what should generally be favored. I am not got to write assembly just so you get a nice fuzzy feeling. If that's a deal-breaker, so be it. > Also, one further thought: > From the commit message, it sounds like you might only be doing this > for the bounty in #5570, do you plan to maintain these optimizations > in the future? Unless you are a mind reader, STFU about my motivation in writing code. One other thing: why didn't this come up when the earlier patch was submitted and applied? ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel