from:"Dan Parrot"

[FFmpeg-devel] Which additional files need to be modified when adding PPC-specific version of libswscale/input.c

2016-05-22 Thread Dan Parrot

I am working on a patch to improve ffmpeg performance on PPC by using
vector SIMD facilities for those processor versions that have the
capability. 

There are 50 functions in libswscale/input.c that I am modifying. Adding
#ifdefs in all the functions probably isn't the way to go.

Is there a preferred method to implement such a change?

Thanks.
Dan Parrot.


___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

[FFmpeg-devel] Is this the expected behavior in libswscale/input.c

2016-06-09 Thread Dan Parrot

Line 72 of libswscale/input.c is:
dstU[i] = (ru*r + gu*g + bu*b + (0x10001<<(RGB2YUV_SHIFT-1))) >>
RGB2YUV_SHIFT;

The definition of macro RGB2YUV_SHIFT in libswscale/swscale_internal.h
is on line 417:
#define RGB2YUV_SHIFT 15

By examining the result of executing line 72 in input.c it appears that
the radix used for macro RGB2YUV_SHIFT is hexadecimal. So that 0x10001
is shifted left by 20 in subexpression 0x10001<<(RGB2YUV_SHIFT-1) with
the result being 0x10.

I was expecting a left-shift of 14, given the macro definition.

Should the macro be interpreted as decimal 15 or hexadecimal 15?

Thanks.
Dan.


___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] Is this the expected behavior in libswscale/input.c

2016-06-09 Thread Dan Parrot

On Thu, 2016-06-09 at 17:01 -0400, Ronald S. Bultje wrote:
> Hi,
> 
> On Thu, Jun 9, 2016 at 4:02 PM, Dan Parrot  wrote:
> 
> > Line 72 of libswscale/input.c is:
> > dstU[i] = (ru*r + gu*g + bu*b + (0x10001<<(RGB2YUV_SHIFT-1))) >>
> > RGB2YUV_SHIFT;
> >
> > The definition of macro RGB2YUV_SHIFT in libswscale/swscale_internal.h
> > is on line 417:
> > #define RGB2YUV_SHIFT 15
> >
> > By examining the result of executing line 72 in input.c it appears that
> > the radix used for macro RGB2YUV_SHIFT is hexadecimal. So that 0x10001
> > is shifted left by 20 in subexpression 0x10001<<(RGB2YUV_SHIFT-1) with
> > the result being 0x10.
> >
> > I was expecting a left-shift of 14, given the macro definition.
> >
> > Should the macro be interpreted as decimal 15 or hexadecimal 15?
> 
> 
> It's decimal 15, it should shift by 14. Check disassembly, though... It's
> almost impossible that your compiler would read this as hex and still be
> able to produce runnable applications. Maybe submit a bug report to your
> compiler's bug tracker?
> 
> Ronald
> ___
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

There is no issue with the software. I have been examining the code in
gdb and what confused me was having changed the default gdb radix with
"set radix 16".

Thanks Ronald.


___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

[FFmpeg-devel] [PATCH] PPC64: IBM POWER8 SIMD Implementation

2016-06-14 Thread Dan Parrot


From e38eb7af05be27d8f36058373557d86e5a481db8 Mon Sep 17 00:00:00 2001
From: Dan Parrot 
Date: Tue, 14 Jun 2016 23:19:21 +
Subject: [PATCH] PPC64: IBM POWER8 SIMD Implementation

This is the first commit addressing Trac ticket #5570. Functions defined in
libswscale/input.c have corresponding definitions in libswscale/ppc/input_vsx.h
The corresponding function names in the latter contain the suffix "_vsx".
---
 libswscale/input.c |  38 ++-
 libswscale/ppc/input_vsx.h | 831 +
 2 files changed, 853 insertions(+), 16 deletions(-)
 create mode 100644 libswscale/ppc/input_vsx.h

diff --git a/libswscale/input.c b/libswscale/input.c
index eed0f49..de4347e 100644
--- a/libswscale/input.c
+++ b/libswscale/input.c
@@ -40,6 +40,13 @@
 #define r ((origin == AV_PIX_FMT_BGR48BE || origin == AV_PIX_FMT_BGR48LE || origin == AV_PIX_FMT_BGRA64BE || origin == AV_PIX_FMT_BGRA64LE) ? b_r : r_b)
 #define b ((origin == AV_PIX_FMT_BGR48BE || origin == AV_PIX_FMT_BGR48LE || origin == AV_PIX_FMT_BGRA64BE || origin == AV_PIX_FMT_BGRA64LE) ? r_b : b_r)
 
+#ifdef HAVE_VSX
+#include "ppc/input_vsx.h"
+#define RENAME_SIMD(fname) fname ## _vsx
+#elif
+#define RENAME_SIMD(fname) fname
+#endif
+
 static av_always_inline void
 rgb64ToY_c_template(uint16_t *dst, const uint16_t *src, int width,
 enum AVPixelFormat origin, int32_t *rgb2yuv)
@@ -99,7 +106,7 @@ static void pattern ## 64 ## BE_LE ## ToY_c(uint8_t *_dst, const uint8_t *_src,
 { \
 const uint16_t *src = (const uint16_t *) _src; \
 uint16_t *dst = (uint16_t *) _dst; \
-rgb64ToY_c_template(dst, src, width, origin, rgb2yuv); \
+RENAME_SIMD(rgb64ToY_c_template)(dst, src, width, origin, rgb2yuv); \
 } \
  \
 static void pattern ## 64 ## BE_LE ## ToUV_c(uint8_t *_dstU, uint8_t *_dstV, \
@@ -109,7 +116,7 @@ static void pattern ## 64 ## BE_LE ## ToUV_c(uint8_t *_dstU, uint8_t *_dstV, \
 const uint16_t *src1 = (const uint16_t *) _src1, \
*src2 = (const uint16_t *) _src2; \
 uint16_t *dstU = (uint16_t *) _dstU, *dstV = (uint16_t *) _dstV; \
-rgb64ToUV_c_template(dstU, dstV, src1, src2, width, origin, rgb2yuv); \
+RENAME_SIMD(rgb64ToUV_c_template)(dstU, dstV, src1, src2, width, origin, rgb2yuv); \
 } \
  \
 static void pattern ## 64 ## BE_LE ## ToUV_half_c(uint8_t *_dstU, uint8_t *_dstV, \
@@ -119,7 +126,7 @@ static void pattern ## 64 ## BE_LE ## ToUV_half_c(uint8_t *_dstU, uint8_t *_dstV
 const uint16_t *src1 = (const uint16_t *) _src1, \
*src2 = (const uint16_t *) _src2; \
 uint16_t *dstU = (uint16_t *) _dstU, *dstV = (uint16_t *) _dstV; \
-rgb64ToUV_half_c_template(dstU, dstV, src1, src2, width, origin, rgb2yuv); \
+RENAME_SIMD(rgb64ToUV_half_c_template)(dstU, dstV, src1, src2, width, origin, rgb2yuv); \
 }
 
 rgb64funcs(rgb, LE, AV_PIX_FMT_RGBA64LE)
@@ -203,7 +210,7 @@ static void pattern ## 48 ## BE_LE ## ToY_c(uint8_t *_dst,  \
 {   \
 const uint16_t *src = (const uint16_t *)_src;   \
 uint16_t *dst   = (uint16_t *)_dst; \
-rgb48ToY_c_template(dst, src, width, origin, rgb2yuv);  \
+RENAME_SIMD(rgb48ToY_c_template)(dst, src, width, origin, rgb2yuv);  \
 }   \
 \
 static void pattern ## 48 ## BE_LE ## ToUV_c(uint8_t *_dstU,\
@@ -218,7 +225,7 @@ static void pattern ## 48 ## BE_LE ## ToUV_c(uint8_t *_dstU,\
*src2 = (const uint16_t *)_src2; \
 uint16_t *dstU = (uint16_t *)_dstU, \
  *dstV = (uint16_t *)_dstV; \
-rgb48ToUV_c_template(dstU, dstV, src1, src2, width, origin, rgb2yuv);\
+RENAME_SIMD(rgb48ToUV_c_template)(dstU, dstV, src1, src2, width, origin, rgb2yuv);\
 }   \
 \
 static void pattern ## 48 ## BE_LE ## ToUV_half_c(uint8_t *_dstU,   \
@@ -233,7 +240,7 @@ static void pattern ## 48 ## BE_LE ## ToUV_half_c(uint8_t *_dstU,   \
*src2 = (const uint16_t *)_src2; \
 uint16_t *dstU = (uint16_t *)_dstU, \
  *dstV = (uint16_t *)_dstV; \
-rgb48ToUV_half_c_template(dstU, dstV, src1, src2, width, origin, rgb2yuv);   \
+RENAME_SIMD(rgb48ToUV_half_c_template)(dstU, dstV, src1, src2, width, origin, rgb2yuv);   \
 }
 
 rgb48funcs(rgb, LE, AV_PIX_FMT_RGB48LE)
@@ -273,7 +280,6 @@ static av_always_inline void rgb16_32ToY_c_template(int16_t *dst,
 dst[i] = (ry * r + g

Re: [FFmpeg-devel] [PATCH] PPC64: IBM POWER8 SIMD Implementation

2016-06-14 Thread Dan Parrot

On Tue, 2016-06-14 at 18:56 -0500, Dan Parrot wrote:
> ___
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Please disregard this attempted patch. Made wrong choice of using email
client rather than git send-email.

Apologies.

Dan.

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

[FFmpeg-devel] [PATCH] PPC64: Add IBM POWER8 SIMD Implementation

2016-06-14 Thread Dan Parrot

 int16_t *dstV,
  const uint8_t *src,
@@ -351,17 +357,17 @@ static av_always_inline void 
rgb16_32ToUV_half_c_template(int16_t *dstU,
 static void name ## ToY_c(uint8_t *dst, const uint8_t *src, const uint8_t 
*unused1, const uint8_t *unused2,\
   int width, uint32_t *tab) \
 {   \
-rgb16_32ToY_c_template((int16_t*)dst, src, width, fmt, shr, shg, shb, shp, 
   \
-   maskr, maskg, maskb, rsh, gsh, bsh, S, tab); \
+RENAME_SIMD(rgb16_32ToY_c_template)((int16_t*)dst, src, width, fmt, shr, 
shg, shb, shp,\
+maskr, maskg, maskb, rsh, gsh, bsh, S, 
tab); \
 }   \
 \
 static void name ## ToUV_c(uint8_t *dstU, uint8_t *dstV,\
const uint8_t *unused0, const uint8_t *src, const 
uint8_t *dummy,\
int width, uint32_t *tab)\
 {   \
-rgb16_32ToUV_c_template((int16_t*)dstU, (int16_t*)dstV, src, width, fmt,   
 \
-shr, shg, shb, shp, \
-maskr, maskg, maskb, rsh, gsh, bsh, S, tab);\
+RENAME_SIMD(rgb16_32ToUV_c_template)((int16_t*)dstU, (int16_t*)dstV, src, 
width, fmt,\
+ shr, shg, shb, shp,   
  \
+ maskr, maskg, maskb, rsh, gsh, bsh, 
S, tab);\
 }   \
 \
 static void name ## ToUV_half_c(uint8_t *dstU, uint8_t *dstV,   \
@@ -369,10 +375,10 @@ static void name ## ToUV_half_c(uint8_t *dstU, uint8_t 
*dstV,   \
 const uint8_t *dummy,   \
 int width, uint32_t *tab)   \
 {   \
-rgb16_32ToUV_half_c_template((int16_t*)dstU, (int16_t*)dstV, src, width, 
fmt,   \
- shr, shg, shb, shp,\
- maskr, maskg, maskb,   \
- rsh, gsh, bsh, S, tab);\
+RENAME_SIMD(rgb16_32ToUV_half_c_template)((int16_t*)dstU, (int16_t*)dstV, 
src, width, fmt,   \
+  shr, shg, shb, shp,  
  \
+  maskr, maskg, maskb, 
  \
+  rsh, gsh, bsh, S, tab);  
  \
 }
 
 rgb16_32_wrapper(AV_PIX_FMT_BGR32,bgr32,  16, 0,  0, 0, 0xFF, 0xFF00,  
 0x00FF,  8, 0,  8, RGB2YUV_SHIFT + 8)
@@ -978,7 +984,6 @@ av_cold void ff_sws_init_input_funcs(SwsContext *c)
 case AV_PIX_FMT_GBRP9LE:
 c->readChrPlanar = planar_rgb9le_to_uv;
 break;
-case AV_PIX_FMT_GBRAP10LE:
 case AV_PIX_FMT_GBRP10LE:
 c->readChrPlanar = planar_rgb10le_to_uv;
 break;
@@ -996,7 +1001,6 @@ av_cold void ff_sws_init_input_funcs(SwsContext *c)
 case AV_PIX_FMT_GBRP9BE:
 c->readChrPlanar = planar_rgb9be_to_uv;
 break;
-case AV_PIX_FMT_GBRAP10BE:
 case AV_PIX_FMT_GBRP10BE:
 c->readChrPlanar = planar_rgb10be_to_uv;
 break;
@@ -1260,8 +1264,6 @@ av_cold void ff_sws_init_input_funcs(SwsContext *c)
 case AV_PIX_FMT_GBRP9LE:
 c->readLumPlanar = planar_rgb9le_to_y;
 break;
-case AV_PIX_FMT_GBRAP10LE:
-c->readAlpPlanar = planar_rgb10le_to_a;
 case AV_PIX_FMT_GBRP10LE:
 c->readLumPlanar = planar_rgb10le_to_y;
 break;
@@ -1281,8 +1283,6 @@ av_cold void ff_sws_init_input_funcs(SwsContext *c)
 case AV_PIX_FMT_GBRP9BE:
 c->readLumPlanar = planar_rgb9be_to_y;
 break;
-case AV_PIX_FMT_GBRAP10BE:
-c->readAlpPlanar = planar_rgb10be_to_a;
 case AV_PIX_FMT_GBRP10BE:
 c->readLumPlanar = planar_rgb10be_to_y;
 break;
diff --git a/libswscale/ppc/input_vsx.h b/libswscale/ppc/input_vsx.h
new file mode 100644
index 000..09fe8c1
--- /dev/null
+++ b/libswscale/ppc/input_vsx.h
@@ -0,0 +1,831 @@
+/*
+ * Copyright (C) 2016 Dan Parrot 
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option)

Re: [FFmpeg-devel] [PATCH] PPC64: Add IBM POWER8 SIMD Implementation

2016-06-15 Thread Dan Parrot

On Wed, 2016-06-15 at 10:15 +0200, Michael Niedermayer wrote:
> On Wed, Jun 15, 2016 at 04:25:11AM +0000, Dan Parrot wrote:
> > This is the first commit addressing Trac ticket #5570. Functions defined in
> > libswscale/input.c have corresponding definitions in 
> > libswscale/ppc/input_vsx.h
> > The corresponding function names in the latter contain the suffix "_vsx".
> > ---
> >  libswscale/input.c |  44 +--
> >  libswscale/ppc/input_vsx.h | 831 
> > +
> >  2 files changed, 853 insertions(+), 22 deletions(-)
> >  create mode 100644 libswscale/ppc/input_vsx.h
> 
> breaks build on x86
>  ./configure && make -j12
> In file included from libswscale/input.c:44:0:
> libswscale/ppc/input_vsx.h: In function ‘rgb64ToY_c_template_vsx’:
> libswscale/ppc/input_vsx.h:34:5: error: ‘vector’ undeclared (first use in 
> this function)
> libswscale/ppc/input_vsx.h:34:5: note: each undeclared identifier is reported 
> only once for each function it appears in
> libswscale/ppc/input_vsx.h:34:12: error: expected ‘;’ before ‘int’
> libswscale/ppc/input_vsx.h:35:12: error: expected ‘;’ before ‘int’
> libswscale/ppc/input_vsx.h:36:12: error: expected ‘;’ before ‘int’
> 
> [...]
> > @@ -978,7 +984,6 @@ av_cold void ff_sws_init_input_funcs(SwsContext *c)
> >  case AV_PIX_FMT_GBRP9LE:
> >  c->readChrPlanar = planar_rgb9le_to_uv;
> >  break;
> > -case AV_PIX_FMT_GBRAP10LE:
> >  case AV_PIX_FMT_GBRP10LE:
> >  c->readChrPlanar = planar_rgb10le_to_uv;
> >  break;
> > @@ -996,7 +1001,6 @@ av_cold void ff_sws_init_input_funcs(SwsContext *c)
> >  case AV_PIX_FMT_GBRP9BE:
> >  c->readChrPlanar = planar_rgb9be_to_uv;
> >  break;
> > -case AV_PIX_FMT_GBRAP10BE:
> >  case AV_PIX_FMT_GBRP10BE:
> >  c->readChrPlanar = planar_rgb10be_to_uv;
> >  break;
> > @@ -1260,8 +1264,6 @@ av_cold void ff_sws_init_input_funcs(SwsContext *c)
> >  case AV_PIX_FMT_GBRP9LE:
> >  c->readLumPlanar = planar_rgb9le_to_y;
> >  break;
> > -case AV_PIX_FMT_GBRAP10LE:
> > -c->readAlpPlanar = planar_rgb10le_to_a;
> >  case AV_PIX_FMT_GBRP10LE:
> >  c->readLumPlanar = planar_rgb10le_to_y;
> >  break;
> > @@ -1281,8 +1283,6 @@ av_cold void ff_sws_init_input_funcs(SwsContext *c)
> >  case AV_PIX_FMT_GBRP9BE:
> >  c->readLumPlanar = planar_rgb9be_to_y;
> >  break;
> > -case AV_PIX_FMT_GBRAP10BE:
> > -c->readAlpPlanar = planar_rgb10be_to_a;
> >  case AV_PIX_FMT_GBRP10BE:
> >  c->readLumPlanar = planar_rgb10be_to_y;
> >  break;
> 
> why do you remove these ?
> thats not ppc related
> 
> [...]
> ___
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Did not intend to touch anything non-PPC. I didn't review git diff
carefully enough. I'll resend patch with modifications only for PPC.


___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH] PPC64: Add IBM POWER8 SIMD Implementation

2016-06-15 Thread Dan Parrot

On Wed, 2016-06-15 at 11:19 +, Carl Eugen Hoyos wrote:
> Dan Parrot  mail.com> writes:
> 
> [...]
> 
> I know this is isn't completely related but do you have time 
> to look at ticket #5508?
> https://trac.ffmpeg.org/ticket/5508
> No active developer has hardware and knowledge to look into 
> this issue;-(
> 
> Carl Eugen
> 
> ___
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

I hope to have some time this weekend. We'll see how much progress I can
make on the ticket then.


___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH] PPC64: Add IBM POWER8 SIMD Implementation

2016-06-15 Thread Dan Parrot

On Wed, 2016-06-15 at 16:51 +0200, Hendrik Leppkes wrote:
> On Wed, Jun 15, 2016 at 6:25 AM, Dan Parrot  wrote:
> > This is the first commit addressing Trac ticket #5570. Functions defined in
> > libswscale/input.c have corresponding definitions in 
> > libswscale/ppc/input_vsx.h
> > The corresponding function names in the latter contain the suffix "_vsx".
> > ---
> >  libswscale/input.c |  44 +--
> >  libswscale/ppc/input_vsx.h | 831 
> > +
> >  2 files changed, 853 insertions(+), 22 deletions(-)
> >  create mode 100644 libswscale/ppc/input_vsx.h
> >
> > diff --git a/libswscale/input.c b/libswscale/input.c
> > index 14ab5ab..de4347e 100644
> > --- a/libswscale/input.c
> > +++ b/libswscale/input.c
> > @@ -40,6 +40,13 @@
> >  #define r ((origin == AV_PIX_FMT_BGR48BE || origin == AV_PIX_FMT_BGR48LE 
> > || origin == AV_PIX_FMT_BGRA64BE || origin == AV_PIX_FMT_BGRA64LE) ? b_r : 
> > r_b)
> >  #define b ((origin == AV_PIX_FMT_BGR48BE || origin == AV_PIX_FMT_BGR48LE 
> > || origin == AV_PIX_FMT_BGRA64BE || origin == AV_PIX_FMT_BGRA64LE) ? r_b : 
> > b_r)
> >
> > +#ifdef HAVE_VSX
> > +#include "ppc/input_vsx.h"
> > +#define RENAME_SIMD(fname) fname ## _vsx
> > +#elif
> > +#define RENAME_SIMD(fname) fname
> > +#endif
> > +
> >  static av_always_inline void
> >  rgb64ToY_c_template(uint16_t *dst, const uint16_t *src, int width,
> >  enum AVPixelFormat origin, int32_t *rgb2yuv)
> > @@ -99,7 +106,7 @@ static void pattern ## 64 ## BE_LE ## ToY_c(uint8_t 
> > *_dst, const uint8_t *_src,
> >  { \
> >  const uint16_t *src = (const uint16_t *) _src; \
> >  uint16_t *dst = (uint16_t *) _dst; \
> > -rgb64ToY_c_template(dst, src, width, origin, rgb2yuv); \
> > +RENAME_SIMD(rgb64ToY_c_template)(dst, src, width, origin, rgb2yuv); \
> >  } \
> 
> This is not how we integrate SIMD optimizations. These are the C
> functions, they are not meant to perform the SIMD.
> What you should do is provide SIMD functions and then provide a
> SIMD-specific init function that overwrites the function pointers with
> your SIMD functions.
> ie. just how it is done on x86. But do not touch the C functions by
> overriding them right in the code with SIMD variants, making the C
> variants inaccessible.

1. The #ifdef HAVE_VSX at the top of the file should actually be #if
HAVE_VSX. The intent is for C variants of functions to be replaced by
SIMD versions. But only for PowerPC machines that have Vector-Scalar
hardware. All other machines will retain exactly the same function names
they currently possess. Would that change eliminate this particular
objection?

2. To achieve the same effect using the style done for x86 requires more
code. Just considering the first function in input.c, one must provide:
i. A SIMD implementation of rgb64ToY_c_template
ii. Definitions for results of the 4 macro expansions of rgb64funcs.
These are the functions that actually instantiate rgb64ToY_c_template
iii. Replication of all the code in ff_sws_init_input_funcs that assigns
the results of item ii. above to function pointers.

Why is the x86 approach preferred over the seemingly simpler
preprocessor renaming?

> 
> >   \
> >  static void pattern ## 64 ## BE_LE ## ToUV_c(uint8_t *_dstU, uint8_t 
> > *_dstV, \
> > @@ -109,7 +116,7 @@ static void pattern ## 64 ## BE_LE ## ToUV_c(uint8_t 
> > *_dstU, uint8_t *_dstV, \
> >  const uint16_t *src1 = (const uint16_t *) _src1, \
> > *src2 = (const uint16_t *) _src2; \
> >  uint16_t *dstU = (uint16_t *) _dstU, *dstV = (uint16_t *) _dstV; \
> > -rgb64ToUV_c_template(dstU, dstV, src1, src2, width, origin, rgb2yuv); \
> > +RENAME_SIMD(rgb64ToUV_c_template)(dstU, dstV, src1, src2, width, 
> > origin, rgb2yuv); \
> >  } \
> >   \
> >  static void pattern ## 64 ## BE_LE ## ToUV_half_c(uint8_t *_dstU, uint8_t 
> > *_dstV, \
> > @@ -119,7 +126,7 @@ static void pattern ## 64 ## BE_LE ## 
> > ToUV_half_c(uint8_t *_dstU, uint8_t *_dstV
> >  const uint16_t *src1 = (const uint16_t *) _src1, \
> > *src2 = (const uint16_t *) _src2; \
> >  uint16_t *dstU = (uint16_t *) _dstU, *dstV = (uint16_t *) _dstV; \
> > -rgb64ToUV_half_c_template(dstU, dstV, src1, src2, width, origin, 
> > rgb2yuv); \
> > +RENAME_SIMD(rgb64ToUV_half_c_template)(dstU, dstV, src1, src2, width, 
> > origin, rgb2yuv); \
> >  }
> >
> >  rgb64funcs(rgb, LE, AV_PIX_FMT_RGBA64LE)
> > @@ -203,7 +210,7 @@ st

[FFmpeg-devel] [PATCH] PPC64: Add IBM POWER8 SIMD Implementation

2016-06-19 Thread Dan Parrot

First commit addressing Trac ticket #5570. Functions defined in 
libswscale/input.c
have corresponding SIMD definitions in libswscale/ppc/input_vsx.c
---
 libswscale/ppc/Makefile   |1 +
 libswscale/ppc/input_vsx.c| 1070 +
 libswscale/swscale.c  |3 +
 libswscale/swscale_internal.h |1 +
 4 files changed, 1075 insertions(+)
 create mode 100644 libswscale/ppc/input_vsx.c

diff --git a/libswscale/ppc/Makefile b/libswscale/ppc/Makefile
index d1b596e..2482893 100644
--- a/libswscale/ppc/Makefile
+++ b/libswscale/ppc/Makefile
@@ -1,3 +1,4 @@
 OBJS += ppc/swscale_altivec.o   \
+ppc/input_vsx.o \
 ppc/yuv2rgb_altivec.o   \
 ppc/yuv2yuv_altivec.o   \
diff --git a/libswscale/ppc/input_vsx.c b/libswscale/ppc/input_vsx.c
new file mode 100644
index 000..adb0e38
--- /dev/null
+++ b/libswscale/ppc/input_vsx.c
@@ -0,0 +1,1070 @@
+/*
+ * Copyright (C) 2016 Dan Parrot 
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#include "libavutil/avutil.h"
+#include "libavutil/bswap.h"
+#include "libavutil/cpu.h"
+#include "libavutil/intreadwrite.h"
+#include "libavutil/mathematics.h"
+#include "libavutil/pixdesc.h"
+#include "libavutil/avassert.h"
+#include "config.h"
+#include "libswscale/rgb2rgb.h"
+#include "libswscale/swscale.h"
+#include "libswscale/swscale_internal.h"
+
+#define input_pixel(pos) (isBE(origin) ? AV_RB16(pos) : AV_RL16(pos))
+
+#define r ((origin == AV_PIX_FMT_BGR48BE || origin == AV_PIX_FMT_BGR48LE || 
origin == AV_PIX_FMT_BGRA64BE || origin == AV_PIX_FMT_BGRA64LE) ? b_r : r_b)
+#define b ((origin == AV_PIX_FMT_BGR48BE || origin == AV_PIX_FMT_BGR48LE || 
origin == AV_PIX_FMT_BGRA64BE || origin == AV_PIX_FMT_BGRA64LE) ? r_b : b_r)
+
+#if HAVE_VSX
+
+// This is a SIMD version for IBM POWER8 of function rgb64ToY_c_template
+// in file libswscale/input.c
+static av_always_inline void
+rgb64ToY_c_template_vsx(uint16_t *dst, const uint16_t *src, int width,
+enum AVPixelFormat origin, int32_t *rgb2yuv)
+{
+int32_t ry = rgb2yuv[RY_IDX], gy = rgb2yuv[GY_IDX], by = rgb2yuv[BY_IDX];
+int i, j;
+int num_vec, frag;
+
+num_vec = width / 8;
+frag= width % 8;
+
+vector int v_ry = vec_splats((int)ry);
+vector int v_gy = vec_splats((int)gy);
+vector int v_by = vec_splats((int)by);
+
+int s_opr2;
+s_opr2 = (int)(0x2001 << (RGB2YUV_SHIFT-1));
+
+vector int v_opr1 = vec_splats((int)RGB2YUV_SHIFT);
+vector int v_opr2 = vec_splats((int)s_opr2);
+
+vector int v_r, v_g, v_b, v_tmp;
+vector short v_tmpi, v_dst;
+
+for (i = 0; i < num_vec; i++) {
+for (j = 7; j >= 0  ; j--) {
+int r_b = input_pixel(&src[(i*8+j)*4+0]);
+int g   = input_pixel(&src[(i*8+j)*4+1]);
+int b_r = input_pixel(&src[(i*8+j)*4+2]);
+
+v_r[j % 4] = r;
+v_g[j % 4] = g;
+v_b[j % 4] = b;
+
+if (!(j % 4)) {
+v_tmp = v_ry * v_r;
+v_tmp = v_tmp + v_gy * v_g;
+v_tmp = v_tmp + v_by * v_b;
+v_tmp = v_tmp + v_opr2;
+v_tmp = vec_sr(v_tmp, (vector unsigned int)v_opr1);
+
+v_tmpi  = (vector short)v_tmp;
+v_dst[(j / 4) * 4 + 3]  = v_tmpi[6];
+v_dst[(j / 4) * 4 + 2]  = v_tmpi[4];
+v_dst[(j / 4) * 4 + 1]  = v_tmpi[2];
+v_dst[(j / 4) * 4 + 0]  = v_tmpi[0];
+}
+}
+vec_vsx_st(v_dst, 0, (short *)&dst[i*8]);
+}
+
+// computation for any less than vector-length items at tail end
+if( frag ) {
+for (i = 0; i < frag; i++) {
+unsigned int r_b = input_pixel(&src[(num_vec*8+i)*4+0]);
+unsigned int   g = input_pixel(&src[(num_vec*8+i)*4+1]);
+unsigned int b_r = input_pixel(&src[(num_vec*8+i)*4+2]);
+
+

Re: [FFmpeg-devel] [PATCH] PPC64: Add IBM POWER8 SIMD Implementation

2016-06-20 Thread Dan Parrot

Could a PPC maintainer verify that this patch integrates cleanly into
the design? I would like to proceed with the remaining changes to close
out ticket #5570 but since the first patch was rejected, I am unsure on
whether I'll have to rewrite the code.

Thanks.
Dan.

On Sun, 2016-06-19 at 21:57 +0000, Dan Parrot wrote:
> First commit addressing Trac ticket #5570. Functions defined in 
> libswscale/input.c
> have corresponding SIMD definitions in libswscale/ppc/input_vsx.c
> ---
>  libswscale/ppc/Makefile   |1 +
>  libswscale/ppc/input_vsx.c| 1070 
> +
>  libswscale/swscale.c  |3 +
>  libswscale/swscale_internal.h |1 +
>  4 files changed, 1075 insertions(+)
>  create mode 100644 libswscale/ppc/input_vsx.c
> 
> diff --git a/libswscale/ppc/Makefile b/libswscale/ppc/Makefile
> index d1b596e..2482893 100644
> --- a/libswscale/ppc/Makefile
> +++ b/libswscale/ppc/Makefile
> @@ -1,3 +1,4 @@
>  OBJS += ppc/swscale_altivec.o   \
> +ppc/input_vsx.o \
>  ppc/yuv2rgb_altivec.o   \
>  ppc/yuv2yuv_altivec.o   \
> diff --git a/libswscale/ppc/input_vsx.c b/libswscale/ppc/input_vsx.c
> new file mode 100644
> index 000..adb0e38
> --- /dev/null
> +++ b/libswscale/ppc/input_vsx.c
> @@ -0,0 +1,1070 @@
> +/*
> + * Copyright (C) 2016 Dan Parrot 
> + *
> + * This file is part of FFmpeg.
> + *
> + * FFmpeg is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * FFmpeg is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with FFmpeg; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 
> USA
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "libavutil/avutil.h"
> +#include "libavutil/bswap.h"
> +#include "libavutil/cpu.h"
> +#include "libavutil/intreadwrite.h"
> +#include "libavutil/mathematics.h"
> +#include "libavutil/pixdesc.h"
> +#include "libavutil/avassert.h"
> +#include "config.h"
> +#include "libswscale/rgb2rgb.h"
> +#include "libswscale/swscale.h"
> +#include "libswscale/swscale_internal.h"
> +
> +#define input_pixel(pos) (isBE(origin) ? AV_RB16(pos) : AV_RL16(pos))
> +
> +#define r ((origin == AV_PIX_FMT_BGR48BE || origin == AV_PIX_FMT_BGR48LE || 
> origin == AV_PIX_FMT_BGRA64BE || origin == AV_PIX_FMT_BGRA64LE) ? b_r : r_b)
> +#define b ((origin == AV_PIX_FMT_BGR48BE || origin == AV_PIX_FMT_BGR48LE || 
> origin == AV_PIX_FMT_BGRA64BE || origin == AV_PIX_FMT_BGRA64LE) ? r_b : b_r)
> +
> +#if HAVE_VSX
> +
> +// This is a SIMD version for IBM POWER8 of function rgb64ToY_c_template
> +// in file libswscale/input.c
> +static av_always_inline void
> +rgb64ToY_c_template_vsx(uint16_t *dst, const uint16_t *src, int width,
> +enum AVPixelFormat origin, int32_t *rgb2yuv)
> +{
> +int32_t ry = rgb2yuv[RY_IDX], gy = rgb2yuv[GY_IDX], by = rgb2yuv[BY_IDX];
> +int i, j;
> +int num_vec, frag;
> +
> +num_vec = width / 8;
> +frag= width % 8;
> +
> +vector int v_ry = vec_splats((int)ry);
> +vector int v_gy = vec_splats((int)gy);
> +vector int v_by = vec_splats((int)by);
> +
> +int s_opr2;
> +s_opr2 = (int)(0x2001 << (RGB2YUV_SHIFT-1));
> +
> +vector int v_opr1 = vec_splats((int)RGB2YUV_SHIFT);
> +vector int v_opr2 = vec_splats((int)s_opr2);
> +
> +vector int v_r, v_g, v_b, v_tmp;
> +vector short v_tmpi, v_dst;
> +
> +for (i = 0; i < num_vec; i++) {
> +for (j = 7; j >= 0  ; j--) {
> +int r_b = input_pixel(&src[(i*8+j)*4+0]);
> +int g   = input_pixel(&src[(i*8+j)*4+1]);
> +int b_r = input_pixel(&src[(i*8+j)*4+2]);
> +
> +v_r[j % 4] = r;
> +v_g[j % 4] = g;
> +v_b[j % 4] = b;
> +
> +if (!(j % 4)) {
> +v_tmp = v_ry * v_r;
> +

Re: [FFmpeg-devel] [PATCH] PPC64: Add IBM POWER8 SIMD Implementation

2016-06-20 Thread Dan Parrot

On Tue, 2016-06-21 at 00:45 +0200, Michael Niedermayer wrote:
> On Sun, Jun 19, 2016 at 09:57:42PM +0000, Dan Parrot wrote:
> > First commit addressing Trac ticket #5570. Functions defined in 
> > libswscale/input.c
> > have corresponding SIMD definitions in libswscale/ppc/input_vsx.c
> > ---
> >  libswscale/ppc/Makefile   |1 +
> >  libswscale/ppc/input_vsx.c| 1070 
> > +
> >  libswscale/swscale.c  |3 +
> >  libswscale/swscale_internal.h |1 +
> >  4 files changed, 1075 insertions(+)
> >  create mode 100644 libswscale/ppc/input_vsx.c
> > 
> > diff --git a/libswscale/ppc/Makefile b/libswscale/ppc/Makefile
> > index d1b596e..2482893 100644
> > --- a/libswscale/ppc/Makefile
> > +++ b/libswscale/ppc/Makefile
> > @@ -1,3 +1,4 @@
> >  OBJS += ppc/swscale_altivec.o   \
> > +ppc/input_vsx.o \
> >  ppc/yuv2rgb_altivec.o   \
> >  ppc/yuv2yuv_altivec.o   \
> > diff --git a/libswscale/ppc/input_vsx.c b/libswscale/ppc/input_vsx.c
> > new file mode 100644
> > index 000..adb0e38
> > --- /dev/null
> > +++ b/libswscale/ppc/input_vsx.c
> > @@ -0,0 +1,1070 @@
> > +/*
> > + * Copyright (C) 2016 Dan Parrot 
> > + *
> > + * This file is part of FFmpeg.
> > + *
> > + * FFmpeg is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU Lesser General Public
> > + * License as published by the Free Software Foundation; either
> > + * version 2.1 of the License, or (at your option) any later version.
> > + *
> > + * FFmpeg is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > + * Lesser General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU Lesser General Public
> > + * License along with FFmpeg; if not, write to the Free Software
> > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 
> > 02110-1301 USA
> > + */
> > +
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +#include "libavutil/avutil.h"
> > +#include "libavutil/bswap.h"
> > +#include "libavutil/cpu.h"
> > +#include "libavutil/intreadwrite.h"
> > +#include "libavutil/mathematics.h"
> > +#include "libavutil/pixdesc.h"
> > +#include "libavutil/avassert.h"
> > +#include "config.h"
> > +#include "libswscale/rgb2rgb.h"
> > +#include "libswscale/swscale.h"
> > +#include "libswscale/swscale_internal.h"
> > +
> > +#define input_pixel(pos) (isBE(origin) ? AV_RB16(pos) : AV_RL16(pos))
> > +
> > +#define r ((origin == AV_PIX_FMT_BGR48BE || origin == AV_PIX_FMT_BGR48LE 
> > || origin == AV_PIX_FMT_BGRA64BE || origin == AV_PIX_FMT_BGRA64LE) ? b_r : 
> > r_b)
> > +#define b ((origin == AV_PIX_FMT_BGR48BE || origin == AV_PIX_FMT_BGR48LE 
> > || origin == AV_PIX_FMT_BGRA64BE || origin == AV_PIX_FMT_BGRA64LE) ? r_b : 
> > b_r)
> > +
> > +#if HAVE_VSX
> > +
> > +// This is a SIMD version for IBM POWER8 of function rgb64ToY_c_template
> > +// in file libswscale/input.c
> > +static av_always_inline void
> > +rgb64ToY_c_template_vsx(uint16_t *dst, const uint16_t *src, int width,
> > +enum AVPixelFormat origin, int32_t *rgb2yuv)
> > +{
> > +int32_t ry = rgb2yuv[RY_IDX], gy = rgb2yuv[GY_IDX], by = 
> > rgb2yuv[BY_IDX];
> > +int i, j;
> > +int num_vec, frag;
> > +
> > +num_vec = width / 8;
> > +frag= width % 8;
> > +
> > +vector int v_ry = vec_splats((int)ry);
> > +vector int v_gy = vec_splats((int)gy);
> > +vector int v_by = vec_splats((int)by);
> > +
> > +int s_opr2;
> > +s_opr2 = (int)(0x2001 << (RGB2YUV_SHIFT-1));
> > +
> > +vector int v_opr1 = vec_splats((int)RGB2YUV_SHIFT);
> > +vector int v_opr2 = vec_splats((int)s_opr2);
> > +
> > +vector int v_r, v_g, v_b, v_tmp;
> > +vector short v_tmpi, v_dst;
> > +
> > +for (i = 0; i < num_vec; i++) {
> > +for (j = 7; j >= 0  ; j--) {
> > +int r_b = input_pixel(&src[(i*8+j)*4+0]

Re: [FFmpeg-devel] [PATCH] PPC64: Add IBM POWER8 SIMD Implementation

2016-06-20 Thread Dan Parrot

On Tue, 2016-06-21 at 01:06 +0200, Michael Niedermayer wrote:
> On Mon, Jun 20, 2016 at 05:55:47PM -0500, Dan Parrot wrote:
> > On Tue, 2016-06-21 at 00:45 +0200, Michael Niedermayer wrote:
> > > On Sun, Jun 19, 2016 at 09:57:42PM +, Dan Parrot wrote:
> > > > First commit addressing Trac ticket #5570. Functions defined in 
> > > > libswscale/input.c
> > > > have corresponding SIMD definitions in libswscale/ppc/input_vsx.c
> > > > ---
> > > >  libswscale/ppc/Makefile   |1 +
> > > >  libswscale/ppc/input_vsx.c| 1070 
> > > > +
> > > >  libswscale/swscale.c  |3 +
> > > >  libswscale/swscale_internal.h |1 +
> > > >  4 files changed, 1075 insertions(+)
> > > >  create mode 100644 libswscale/ppc/input_vsx.c
> > > > 
> > > > diff --git a/libswscale/ppc/Makefile b/libswscale/ppc/Makefile
> > > > index d1b596e..2482893 100644
> > > > --- a/libswscale/ppc/Makefile
> > > > +++ b/libswscale/ppc/Makefile
> > > > @@ -1,3 +1,4 @@
> > > >  OBJS += ppc/swscale_altivec.o  
> > > >  \
> > > > +ppc/input_vsx.o
> > > >  \
> > > >  ppc/yuv2rgb_altivec.o  
> > > >  \
> > > >  ppc/yuv2yuv_altivec.o          
> > > >  \
> > > > diff --git a/libswscale/ppc/input_vsx.c b/libswscale/ppc/input_vsx.c
> > > > new file mode 100644
> > > > index 000..adb0e38
> > > > --- /dev/null
> > > > +++ b/libswscale/ppc/input_vsx.c
> > > > @@ -0,0 +1,1070 @@
> > > > +/*
> > > > + * Copyright (C) 2016 Dan Parrot 
> > > > + *
> > > > + * This file is part of FFmpeg.
> > > > + *
> > > > + * FFmpeg is free software; you can redistribute it and/or
> > > > + * modify it under the terms of the GNU Lesser General Public
> > > > + * License as published by the Free Software Foundation; either
> > > > + * version 2.1 of the License, or (at your option) any later version.
> > > > + *
> > > > + * FFmpeg is distributed in the hope that it will be useful,
> > > > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > > > + * Lesser General Public License for more details.
> > > > + *
> > > > + * You should have received a copy of the GNU Lesser General Public
> > > > + * License along with FFmpeg; if not, write to the Free Software
> > > > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 
> > > > 02110-1301 USA
> > > > + */
> > > > +
> > > > +#include 
> > > > +#include 
> > > > +#include 
> > > > +#include 
> > > > +
> > > > +#include "libavutil/avutil.h"
> > > > +#include "libavutil/bswap.h"
> > > > +#include "libavutil/cpu.h"
> > > > +#include "libavutil/intreadwrite.h"
> > > > +#include "libavutil/mathematics.h"
> > > > +#include "libavutil/pixdesc.h"
> > > > +#include "libavutil/avassert.h"
> > > > +#include "config.h"
> > > > +#include "libswscale/rgb2rgb.h"
> > > > +#include "libswscale/swscale.h"
> > > > +#include "libswscale/swscale_internal.h"
> > > > +
> > > > +#define input_pixel(pos) (isBE(origin) ? AV_RB16(pos) : AV_RL16(pos))
> > > > +
> > > > +#define r ((origin == AV_PIX_FMT_BGR48BE || origin == 
> > > > AV_PIX_FMT_BGR48LE || origin == AV_PIX_FMT_BGRA64BE || origin == 
> > > > AV_PIX_FMT_BGRA64LE) ? b_r : r_b)
> > > > +#define b ((origin == AV_PIX_FMT_BGR48BE || origin == 
> > > > AV_PIX_FMT_BGR48LE || origin == AV_PIX_FMT_BGRA64BE || origin == 
> > > > AV_PIX_FMT_BGRA64LE) ? r_b : b_r)
> > > > +
> > > > +#if HAVE_VSX
> > > > +
> > > > +// This is a SIMD version for IBM POWER8 of function 
> > > > rgb64ToY_c_template
> > > > +// in file libswscale/input.c
> > > > +static av_always_inline void
> > > > +rgb64ToY_c_template_vsx(uint16_t *dst, const uint16_t *s

Re: [FFmpeg-devel] [PATCH] PPC64: Add IBM POWER8 SIMD Implementation

2016-06-20 Thread Dan Parrot

On Tue, 2016-06-21 at 02:22 +0200, Michael Niedermayer wrote:
> On Mon, Jun 20, 2016 at 06:38:18PM -0500, Dan Parrot wrote:
> > On Tue, 2016-06-21 at 01:06 +0200, Michael Niedermayer wrote:
> > > On Mon, Jun 20, 2016 at 05:55:47PM -0500, Dan Parrot wrote:
> > > > On Tue, 2016-06-21 at 00:45 +0200, Michael Niedermayer wrote:
> > > > > On Sun, Jun 19, 2016 at 09:57:42PM +, Dan Parrot wrote:
> > > > > > First commit addressing Trac ticket #5570. Functions defined in 
> > > > > > libswscale/input.c
> > > > > > have corresponding SIMD definitions in libswscale/ppc/input_vsx.c
> > > > > > ---
> > > > > >  libswscale/ppc/Makefile   |1 +
> > > > > >  libswscale/ppc/input_vsx.c| 1070 
> > > > > > +
> > > > > >  libswscale/swscale.c  |3 +
> > > > > >  libswscale/swscale_internal.h |1 +
> > > > > >  4 files changed, 1075 insertions(+)
> > > > > >  create mode 100644 libswscale/ppc/input_vsx.c
> > > > > > 
> > > > > > diff --git a/libswscale/ppc/Makefile b/libswscale/ppc/Makefile
> > > > > > index d1b596e..2482893 100644
> > > > > > --- a/libswscale/ppc/Makefile
> > > > > > +++ b/libswscale/ppc/Makefile
> > > > > > @@ -1,3 +1,4 @@
> > > > > >  OBJS += ppc/swscale_altivec.o  
> > > > > >  \
> > > > > > +ppc/input_vsx.o
> > > > > >  \
> > > > > >  ppc/yuv2rgb_altivec.o          
> > > > > >  \
> > > > > >  ppc/yuv2yuv_altivec.o  
> > > > > >  \
> > > > > > diff --git a/libswscale/ppc/input_vsx.c b/libswscale/ppc/input_vsx.c
> > > > > > new file mode 100644
> > > > > > index 000..adb0e38
> > > > > > --- /dev/null
> > > > > > +++ b/libswscale/ppc/input_vsx.c
> > > > > > @@ -0,0 +1,1070 @@
> > > > > > +/*
> > > > > > + * Copyright (C) 2016 Dan Parrot 
> > > > > > + *
> > > > > > + * This file is part of FFmpeg.
> > > > > > + *
> > > > > > + * FFmpeg is free software; you can redistribute it and/or
> > > > > > + * modify it under the terms of the GNU Lesser General Public
> > > > > > + * License as published by the Free Software Foundation; either
> > > > > > + * version 2.1 of the License, or (at your option) any later 
> > > > > > version.
> > > > > > + *
> > > > > > + * FFmpeg is distributed in the hope that it will be useful,
> > > > > > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > > > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the 
> > > > > > GNU
> > > > > > + * Lesser General Public License for more details.
> > > > > > + *
> > > > > > + * You should have received a copy of the GNU Lesser General Public
> > > > > > + * License along with FFmpeg; if not, write to the Free Software
> > > > > > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 
> > > > > > 02110-1301 USA
> > > > > > + */
> > > > > > +
> > > > > > +#include 
> > > > > > +#include 
> > > > > > +#include 
> > > > > > +#include 
> > > > > > +
> > > > > > +#include "libavutil/avutil.h"
> > > > > > +#include "libavutil/bswap.h"
> > > > > > +#include "libavutil/cpu.h"
> > > > > > +#include "libavutil/intreadwrite.h"
> > > > > > +#include "libavutil/mathematics.h"
> > > > > > +#include "libavutil/pixdesc.h"
> > > > > > +#include "libavutil/avassert.h"
> > > > > > +#include "config.h"
> > > > > > +#include "libswscale/rgb2rgb.h"
> > > > > > +#include "libswscale/swscale.h"
> > > > > > +#include "libswscale/swscale_internal.h"
> > > >

Re: [FFmpeg-devel] [PATCH] PPC64: Add IBM POWER8 SIMD Implementation

2016-06-20 Thread Dan Parrot

On Tue, 2016-06-21 at 02:22 +0200, Michael Niedermayer wrote:
> On Mon, Jun 20, 2016 at 06:38:18PM -0500, Dan Parrot wrote:
> > On Tue, 2016-06-21 at 01:06 +0200, Michael Niedermayer wrote:
> > > On Mon, Jun 20, 2016 at 05:55:47PM -0500, Dan Parrot wrote:
> > > > On Tue, 2016-06-21 at 00:45 +0200, Michael Niedermayer wrote:
> > > > > On Sun, Jun 19, 2016 at 09:57:42PM +, Dan Parrot wrote:
> > > > > > First commit addressing Trac ticket #5570. Functions defined in 
> > > > > > libswscale/input.c
> > > > > > have corresponding SIMD definitions in libswscale/ppc/input_vsx.c
> > > > > > ---
> > > > > >  libswscale/ppc/Makefile   |1 +
> > > > > >  libswscale/ppc/input_vsx.c| 1070 
> > > > > > +
> > > > > >  libswscale/swscale.c  |3 +
> > > > > >  libswscale/swscale_internal.h |1 +
> > > > > >  4 files changed, 1075 insertions(+)
> > > > > >  create mode 100644 libswscale/ppc/input_vsx.c
> > > > > > 
> > > > > > diff --git a/libswscale/ppc/Makefile b/libswscale/ppc/Makefile
> > > > > > index d1b596e..2482893 100644
> > > > > > --- a/libswscale/ppc/Makefile
> > > > > > +++ b/libswscale/ppc/Makefile
> > > > > > @@ -1,3 +1,4 @@
> > > > > >  OBJS += ppc/swscale_altivec.o  
> > > > > >  \
> > > > > > +ppc/input_vsx.o
> > > > > >  \
> > > > > >  ppc/yuv2rgb_altivec.o          
> > > > > >  \
> > > > > >  ppc/yuv2yuv_altivec.o  
> > > > > >  \
> > > > > > diff --git a/libswscale/ppc/input_vsx.c b/libswscale/ppc/input_vsx.c
> > > > > > new file mode 100644
> > > > > > index 000..adb0e38
> > > > > > --- /dev/null
> > > > > > +++ b/libswscale/ppc/input_vsx.c
> > > > > > @@ -0,0 +1,1070 @@
> > > > > > +/*
> > > > > > + * Copyright (C) 2016 Dan Parrot 
> > > > > > + *
> > > > > > + * This file is part of FFmpeg.
> > > > > > + *
> > > > > > + * FFmpeg is free software; you can redistribute it and/or
> > > > > > + * modify it under the terms of the GNU Lesser General Public
> > > > > > + * License as published by the Free Software Foundation; either
> > > > > > + * version 2.1 of the License, or (at your option) any later 
> > > > > > version.
> > > > > > + *
> > > > > > + * FFmpeg is distributed in the hope that it will be useful,
> > > > > > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > > > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the 
> > > > > > GNU
> > > > > > + * Lesser General Public License for more details.
> > > > > > + *
> > > > > > + * You should have received a copy of the GNU Lesser General Public
> > > > > > + * License along with FFmpeg; if not, write to the Free Software
> > > > > > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 
> > > > > > 02110-1301 USA
> > > > > > + */
> > > > > > +
> > > > > > +#include 
> > > > > > +#include 
> > > > > > +#include 
> > > > > > +#include 
> > > > > > +
> > > > > > +#include "libavutil/avutil.h"
> > > > > > +#include "libavutil/bswap.h"
> > > > > > +#include "libavutil/cpu.h"
> > > > > > +#include "libavutil/intreadwrite.h"
> > > > > > +#include "libavutil/mathematics.h"
> > > > > > +#include "libavutil/pixdesc.h"
> > > > > > +#include "libavutil/avassert.h"
> > > > > > +#include "config.h"
> > > > > > +#include "libswscale/rgb2rgb.h"
> > > > > > +#include "libswscale/swscale.h"
> > > > > > +#include "libswscale/swscale_internal.h"
> > > >

Re: [FFmpeg-devel] [PATCH] PPC64: Add IBM POWER8 SIMD Implementation

2016-06-21 Thread Dan Parrot

On Tue, 2016-06-21 at 00:04 -0500, Dan Parrot wrote:
> On Tue, 2016-06-21 at 02:22 +0200, Michael Niedermayer wrote:
> > On Mon, Jun 20, 2016 at 06:38:18PM -0500, Dan Parrot wrote:
> > > On Tue, 2016-06-21 at 01:06 +0200, Michael Niedermayer wrote:
> > > > On Mon, Jun 20, 2016 at 05:55:47PM -0500, Dan Parrot wrote:
> > > > > On Tue, 2016-06-21 at 00:45 +0200, Michael Niedermayer wrote:
> > > > > > On Sun, Jun 19, 2016 at 09:57:42PM +, Dan Parrot wrote:
> > > > > > > First commit addressing Trac ticket #5570. Functions defined in 
> > > > > > > libswscale/input.c
> > > > > > > have corresponding SIMD definitions in libswscale/ppc/input_vsx.c
> > > > > > > ---
> > > > > > >  libswscale/ppc/Makefile   |1 +
> > > > > > >  libswscale/ppc/input_vsx.c| 1070 
> > > > > > > +
> > > > > > >  libswscale/swscale.c  |3 +
> > > > > > >  libswscale/swscale_internal.h |1 +
> > > > > > >  4 files changed, 1075 insertions(+)
> > > > > > >  create mode 100644 libswscale/ppc/input_vsx.c
> > > > > > > 
> > > > > > > diff --git a/libswscale/ppc/Makefile b/libswscale/ppc/Makefile
> > > > > > > index d1b596e..2482893 100644
> > > > > > > --- a/libswscale/ppc/Makefile
> > > > > > > +++ b/libswscale/ppc/Makefile
> > > > > > > @@ -1,3 +1,4 @@
> > > > > > >  OBJS += ppc/swscale_altivec.o
> > > > > > >\
> > > > > > > +ppc/input_vsx.o  
> > > > > > >\
> > > > > > >          ppc/yuv2rgb_altivec.o
> > > > > > >\
> > > > > > >  ppc/yuv2yuv_altivec.o
> > > > > > >\
> > > > > > > diff --git a/libswscale/ppc/input_vsx.c 
> > > > > > > b/libswscale/ppc/input_vsx.c
> > > > > > > new file mode 100644
> > > > > > > index 000..adb0e38
> > > > > > > --- /dev/null
> > > > > > > +++ b/libswscale/ppc/input_vsx.c
> > > > > > > @@ -0,0 +1,1070 @@
> > > > > > > +/*
> > > > > > > + * Copyright (C) 2016 Dan Parrot 
> > > > > > > + *
> > > > > > > + * This file is part of FFmpeg.
> > > > > > > + *
> > > > > > > + * FFmpeg is free software; you can redistribute it and/or
> > > > > > > + * modify it under the terms of the GNU Lesser General Public
> > > > > > > + * License as published by the Free Software Foundation; either
> > > > > > > + * version 2.1 of the License, or (at your option) any later 
> > > > > > > version.
> > > > > > > + *
> > > > > > > + * FFmpeg is distributed in the hope that it will be useful,
> > > > > > > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > > > > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the 
> > > > > > > GNU
> > > > > > > + * Lesser General Public License for more details.
> > > > > > > + *
> > > > > > > + * You should have received a copy of the GNU Lesser General 
> > > > > > > Public
> > > > > > > + * License along with FFmpeg; if not, write to the Free Software
> > > > > > > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 
> > > > > > > 02110-1301 USA
> > > > > > > + */
> > > > > > > +
> > > > > > > +#include 
> > > > > > > +#include 
> > > > > > > +#include 
> > > > > > > +#include 
> > > > > > > +
> > > > > > > +#include "libavutil/avutil.h"
> > > > > > > +#include "libavutil/bswap.h"
> > > > > > > +#include "libavutil/cpu.h"
> > > > > > > +#include "libavutil/intreadwrite.h"
> > > > > >

[FFmpeg-devel] PPC64: PowerPC Maintainer information is incorrect

2016-06-22 Thread Dan Parrot

The MAINTAINERS file lists Luca Barbato for Linux/PowerPC. You can see
from his response below how he feels about that.

 Forwarded Message 
> From: Luca Barbato 
> To: Dan Parrot 
> Subject: Re: [FFmpeg-devel] [PATCH] PPC64: Add IBM POWER8 SIMD
> Implementation
> Date: Tue, 21 Jun 2016 00:56:20 +0200
> 
> On 21/06/16 00:42, Dan Parrot wrote:
> > You are listed as the responsible party in ffmpeg MAINTAINERS. Could you
> > please indicate whether or not the patch is acceptable?
> > 
> 
> I have no idea why FFmpeg people keep that, I contribute to Libav[1].
> 
> You are more than welcome to send the patches and to contribute to it if
> you like =)
> 
> lu
> 
> [1]: http://libav.org
> 


___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH] PPC64: Add IBM POWER8 SIMD Implementation

2016-06-22 Thread Dan Parrot

Could I get a yes or no answer on whether the patch will be applied?

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH] PPC64: Add IBM POWER8 SIMD Implementation

2016-06-22 Thread Dan Parrot

On Wed, 2016-06-22 at 21:02 +, Carl Eugen Hoyos wrote:
> Dan Parrot  mail.com> writes:
> 
> > Could I get a yes or no answer on whether the patch will be applied?
> 
> Please comment on my email: time make fate can be used to show 
> large performance changes (although it isn't optimal), I don't 
> think it can show the difference between a division and a shift.

I was not trying to show the difference in relative cost between a
division and a shift using "time make fate". What I was trying to show
is that ffmpeg compiled with code that uses the division versus ffmpeg
compiled with code that uses shifts resulted in running times for fate
that were essentially equal.

Now, if you want me to demonstrate the relative speeds of the division
and shift, I can do that. I admit to not seeing the point in the
exercise, but if that is what is holding up patch acceptance, I will do
it. And yes, the "time" utility is imperfect, but so is any technique
which does not turn off all hardware interrupts in order to guarantee
that the code being timed runs to completion without ever being paused.

The goal of Trac ticket #5570 as best I understand it is to use SIMD
hardware on PPC64 to reduce execution time of the functions in
libswscale/input.c. Right? So, the comparison we should be talking about
is whether ffmpeg currently in the repository is faster/slower than
ffmpeg compiled using the patch. Timing fate suite when run with the two
different versions of ffmpeg should be an acceptable indicator of which
is faster.

If I mistaken in any of the above, I am always willing to learn.

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH] PPC64: Add IBM POWER8 SIMD Implementation

2016-06-22 Thread Dan Parrot

On Wed, 2016-06-22 at 22:36 +, Carl Eugen Hoyos wrote:
> Dan Parrot  mail.com> writes:
> 
> [...]
> 
> Did you already test the TIMER macros?

No I did not test with the TIMER macros. I don't see what that has to do
with Trac ticket #5570.

> I don't know if they work on ppc64le but we won't know 
> until you tested them.
> 
> Thank you, Carl Eugen
> 
> ___
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel



___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH] PPC64: Add IBM POWER8 SIMD Implementation

2016-06-22 Thread Dan Parrot

On Thu, 2016-06-23 at 01:03 +0200, Michael Niedermayer wrote:
> On Tue, Jun 21, 2016 at 12:04:42AM -0500, Dan Parrot wrote:
> > On Tue, 2016-06-21 at 02:22 +0200, Michael Niedermayer wrote:
> > > On Mon, Jun 20, 2016 at 06:38:18PM -0500, Dan Parrot wrote:
> > > > On Tue, 2016-06-21 at 01:06 +0200, Michael Niedermayer wrote:
> > > > > On Mon, Jun 20, 2016 at 05:55:47PM -0500, Dan Parrot wrote:
> > > > > > On Tue, 2016-06-21 at 00:45 +0200, Michael Niedermayer wrote:
> > > > > > > On Sun, Jun 19, 2016 at 09:57:42PM +, Dan Parrot wrote:
> > > > > > > > First commit addressing Trac ticket #5570. Functions defined in 
> > > > > > > > libswscale/input.c
> > > > > > > > have corresponding SIMD definitions in 
> > > > > > > > libswscale/ppc/input_vsx.c
> > > > > > > > ---
> > > > > > > >  libswscale/ppc/Makefile   |1 +
> > > > > > > >  libswscale/ppc/input_vsx.c| 1070 
> > > > > > > > +
> > > > > > > >  libswscale/swscale.c  |3 +
> > > > > > > >  libswscale/swscale_internal.h |1 +
> > > > > > > >  4 files changed, 1075 insertions(+)
> > > > > > > >  create mode 100644 libswscale/ppc/input_vsx.c
> > > > > > > > 
> > > > > > > > diff --git a/libswscale/ppc/Makefile b/libswscale/ppc/Makefile
> > > > > > > > index d1b596e..2482893 100644
> > > > > > > > --- a/libswscale/ppc/Makefile
> > > > > > > > +++ b/libswscale/ppc/Makefile
> > > > > > > > @@ -1,3 +1,4 @@
> > > > > > > >  OBJS += ppc/swscale_altivec.o  
> > > > > > > >  \
> > > > > > > > +ppc/input_vsx.o
> > > > > > > >  \
> > > > > > > >  ppc/yuv2rgb_altivec.o  
> > > > > > > >  \
> > > > > > > >  ppc/yuv2yuv_altivec.o  
> > > > > > > >  \
> > > > > > > > diff --git a/libswscale/ppc/input_vsx.c 
> > > > > > > > b/libswscale/ppc/input_vsx.c
> > > > > > > > new file mode 100644
> > > > > > > > index 000..adb0e38
> > > > > > > > --- /dev/null
> > > > > > > > +++ b/libswscale/ppc/input_vsx.c
> > > > > > > > @@ -0,0 +1,1070 @@
> > > > > > > > +/*
> > > > > > > > + * Copyright (C) 2016 Dan Parrot 
> > > > > > > > + *
> > > > > > > > + * This file is part of FFmpeg.
> > > > > > > > + *
> > > > > > > > + * FFmpeg is free software; you can redistribute it and/or
> > > > > > > > + * modify it under the terms of the GNU Lesser General Public
> > > > > > > > + * License as published by the Free Software Foundation; either
> > > > > > > > + * version 2.1 of the License, or (at your option) any later 
> > > > > > > > version.
> > > > > > > > + *
> > > > > > > > + * FFmpeg is distributed in the hope that it will be useful,
> > > > > > > > + * but WITHOUT ANY WARRANTY; without even the implied warranty 
> > > > > > > > of
> > > > > > > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See 
> > > > > > > > the GNU
> > > > > > > > + * Lesser General Public License for more details.
> > > > > > > > + *
> > > > > > > > + * You should have received a copy of the GNU Lesser General 
> > > > > > > > Public
> > > > > > > > + * License along with FFmpeg; if not, write to the Free 
> > > > > > > > Software
> > > > > > > > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, 
> > > > > > > > MA 02110-1301 USA
> > > > > > > > + */
> > > > > > > > +
> > > > > > > > +#include 
> > > > >

Re: [FFmpeg-devel] [PATCH] PPC64: Add IBM POWER8 SIMD Implementation

2016-06-28 Thread Dan Parrot

On Wed, 2016-06-22 at 20:33 -0300, James Almer wrote:
> On 6/22/2016 8:15 PM, Dan Parrot wrote:
> > On Thu, 2016-06-23 at 01:03 +0200, Michael Niedermayer wrote:
> >> On Tue, Jun 21, 2016 at 12:04:42AM -0500, Dan Parrot wrote:
> >>> On Tue, 2016-06-21 at 02:22 +0200, Michael Niedermayer wrote:
> >>>> On Mon, Jun 20, 2016 at 06:38:18PM -0500, Dan Parrot wrote:
> >>>>> On Tue, 2016-06-21 at 01:06 +0200, Michael Niedermayer wrote:
> >>>>>> On Mon, Jun 20, 2016 at 05:55:47PM -0500, Dan Parrot wrote:
> >>>>>>> On Tue, 2016-06-21 at 00:45 +0200, Michael Niedermayer wrote:
> >>>>>>>> On Sun, Jun 19, 2016 at 09:57:42PM +, Dan Parrot wrote:
> >>>>>>>>> First commit addressing Trac ticket #5570. Functions defined in 
> >>>>>>>>> libswscale/input.c
> >>>>>>>>> have corresponding SIMD definitions in libswscale/ppc/input_vsx.c
> >>>>>>>>> ---
> >>>>>>>>>  libswscale/ppc/Makefile   |1 +
> >>>>>>>>>  libswscale/ppc/input_vsx.c| 1070 
> >>>>>>>>> +
> >>>>>>>>>  libswscale/swscale.c  |3 +
> >>>>>>>>>  libswscale/swscale_internal.h |1 +
> >>>>>>>>>  4 files changed, 1075 insertions(+)
> >>>>>>>>>  create mode 100644 libswscale/ppc/input_vsx.c
> >>>>>>>>>
> >>>>>>>>> diff --git a/libswscale/ppc/Makefile b/libswscale/ppc/Makefile
> >>>>>>>>> index d1b596e..2482893 100644
> >>>>>>>>> --- a/libswscale/ppc/Makefile
> >>>>>>>>> +++ b/libswscale/ppc/Makefile
> >>>>>>>>> @@ -1,3 +1,4 @@
> >>>>>>>>>  OBJS += ppc/swscale_altivec.o  
> >>>>>>>>>  \
> >>>>>>>>> +ppc/input_vsx.o
> >>>>>>>>>  \
> >>>>>>>>>  ppc/yuv2rgb_altivec.o  
> >>>>>>>>>  \
> >>>>>>>>>  ppc/yuv2yuv_altivec.o  
> >>>>>>>>>  \
> >>>>>>>>> diff --git a/libswscale/ppc/input_vsx.c b/libswscale/ppc/input_vsx.c
> >>>>>>>>> new file mode 100644
> >>>>>>>>> index 000..adb0e38
> >>>>>>>>> --- /dev/null
> >>>>>>>>> +++ b/libswscale/ppc/input_vsx.c
> >>>>>>>>> @@ -0,0 +1,1070 @@
> >>>>>>>>> +/*
> >>>>>>>>> + * Copyright (C) 2016 Dan Parrot 
> >>>>>>>>> + *
> >>>>>>>>> + * This file is part of FFmpeg.
> >>>>>>>>> + *
> >>>>>>>>> + * FFmpeg is free software; you can redistribute it and/or
> >>>>>>>>> + * modify it under the terms of the GNU Lesser General Public
> >>>>>>>>> + * License as published by the Free Software Foundation; either
> >>>>>>>>> + * version 2.1 of the License, or (at your option) any later 
> >>>>>>>>> version.
> >>>>>>>>> + *
> >>>>>>>>> + * FFmpeg is distributed in the hope that it will be useful,
> >>>>>>>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> >>>>>>>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the 
> >>>>>>>>> GNU
> >>>>>>>>> + * Lesser General Public License for more details.
> >>>>>>>>> + *
> >>>>>>>>> + * You should have received a copy of the GNU Lesser General Public
> >>>>>>>>> + * License along with FFmpeg; if not, write to the Free Software
> >>>>>>>>> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 
> >>>>>>>>> 02110-1301 USA
> >>>>>>>>> + */
> >>>>>>>>> +
> >>>>>>>

[FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.

2016-06-29 Thread Dan Parrot

This patch addresses Trac ticket #5570. The optimized functions are in file
libswscale/ppc/input_vsx.c. Each optimized function name is a concatenation of 
the
corresponding name in libswscale/input.c with suffix _vsx.
---
 libswscale/ppc/Makefile   |   1 +
 libswscale/ppc/input_vsx.c| 437 ++
 libswscale/swscale.c  |   3 +
 libswscale/swscale_internal.h |   1 +
 4 files changed, 442 insertions(+)
 create mode 100644 libswscale/ppc/input_vsx.c

diff --git a/libswscale/ppc/Makefile b/libswscale/ppc/Makefile
index d1b596e..2482893 100644
--- a/libswscale/ppc/Makefile
+++ b/libswscale/ppc/Makefile
@@ -1,3 +1,4 @@
 OBJS += ppc/swscale_altivec.o   \
+ppc/input_vsx.o \
 ppc/yuv2rgb_altivec.o   \
 ppc/yuv2yuv_altivec.o   \
diff --git a/libswscale/ppc/input_vsx.c b/libswscale/ppc/input_vsx.c
new file mode 100644
index 000..d977a32
--- /dev/null
+++ b/libswscale/ppc/input_vsx.c
@@ -0,0 +1,437 @@
+/*
+ * Copyright (C) 2016 Dan Parrot 
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#include "libavutil/avutil.h"
+#include "libavutil/bswap.h"
+#include "libavutil/cpu.h"
+#include "libavutil/intreadwrite.h"
+#include "libavutil/mathematics.h"
+#include "libavutil/pixdesc.h"
+#include "libavutil/avassert.h"
+#include "config.h"
+#include "libswscale/rgb2rgb.h"
+#include "libswscale/swscale.h"
+#include "libswscale/swscale_internal.h"
+
+#if HAVE_VSX
+
+static void abgrToA_c_vsx(uint8_t *_dst, const uint8_t *src, const uint8_t 
*unused1, const uint8_t *unused2,
+  int width, uint32_t *unused)
+{
+int16_t *dst = (int16_t *)_dst;
+int i, width_adj, frag_len;
+
+uintptr_t src_addr = (uintptr_t)src;
+uintptr_t dst_addr = (uintptr_t)dst;
+
+// compute integral number of vector-length items and length of final 
fragment
+width_adj = width >> 3;
+width_adj = width_adj << 3;
+frag_len = width - width_adj;
+
+for ( i = 0; i < width_adj; i += 8) {
+vector int v_rd0 = vec_vsx_ld(0, (int *)src_addr);
+vector int v_rd1 = vec_vsx_ld(0, (int *)(src_addr + 16));
+
+v_rd0 = vec_and(v_rd0, vec_splats(0x0ff));
+v_rd1 = vec_and(v_rd1, vec_splats(0x0ff));
+
+v_rd0 = vec_sl(v_rd0, vec_splats((unsigned)6));
+v_rd1 = vec_sl(v_rd1, vec_splats((unsigned)6));
+
+vector int v_dst = vec_perm(v_rd0, v_rd1, ((vector unsigned char)
+   {0, 1, 4, 5, 8, 9, 12, 13, 
16, 17, 20, 21, 24, 25, 28, 29}));
+vec_vsx_st((vector unsigned char)v_dst, 0, (unsigned char *)dst_addr);
+
+src_addr += 32;
+dst_addr += 16;
+}
+
+for (i=width_adj; i< width_adj + frag_len; i++) {
+dst[i]= src[4*i]<<6;
+}
+}
+
+static void rgbaToA_c_vsx(uint8_t *_dst, const uint8_t *src, const uint8_t 
*unused1, const uint8_t *unused2,
+  int width, uint32_t *unused)
+{
+int16_t *dst = (int16_t *)_dst;
+int i, width_adj, frag_len;
+
+uintptr_t src_addr = (uintptr_t)src;
+uintptr_t dst_addr = (uintptr_t)dst;
+
+// compute integral number of vector-length items and length of final 
fragment
+width_adj = width >> 3;
+width_adj = width_adj << 3;
+frag_len = width - width_adj;
+
+for ( i = 0; i < width_adj; i += 8) {
+vector int v_rd0 = vec_vsx_ld(0, (int *)src_addr);
+vector int v_rd1 = vec_vsx_ld(0, (int *)(src_addr + 16));
+
+v_rd0 = vec_sld(v_rd0, v_rd0, 13);
+v_rd1 = vec_sld(v_rd1, v_rd1, 13);
+
+v_rd0 = vec_and(v_rd0, vec_splats(0x0ff));
+v_rd1 = vec_and(v_rd1, vec_splats(0x0ff));
+
+v_rd0 = vec_sl(v_rd0, vec_splats((unsigned)6));
+v_rd1 = vec_sl(v_rd1, vec_splats((unsigned)6));
+
+vector int v_dst = vec_perm(v_rd0, v_rd1, ((vector unsigned char)
+

Re: [FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.

2016-06-29 Thread Dan Parrot

Here are execution times of SIMD and non-SIMD functions. The times were
obtained using SystemTap probes at functions' entry and return points.
The dataset used was fate-filter-pixfmts-scale.

SIMD versions have suffix _vsx:

yuy2ToY_c_vsx.
no. of calls: 864. min: 1880 ns. avg: 2014 ns. max: 29844 ns. total:
1740366 ns.
yuy2ToY_c.
no. of calls: 864. min: 2326 ns. avg: 2451 ns. max: 15950 ns. total:
2118226 ns.

yvy2ToUV_c_vsx.
no. of calls: 288. min: 1891 ns. avg: 1989 ns. max: 13644 ns. total:
573038 ns.
yvy2ToUV_c.
no. of calls: 288. min: 2089 ns. avg: 2131 ns. max: 2462 ns. total:
613813 ns.

rgbaToA_c_vsx.
no. of calls: 1152. min: 1975 ns. avg: 2123 ns. max: 31356 ns. total:
2446276 ns.
rgbaToA_c.
no. of calls: 1152. min: 2368 ns. avg: 2448 ns. max: 12496 ns. total:
2820401 ns.

uyvyToUV_c_vsx.
no. of calls: 288. min: 1901 ns. avg: 1932 ns. max: 2122 ns. total:
556697 ns.
uyvyToUV_c.
no. of calls: 288. min: 2088 ns. avg: 2129 ns. max: 2370 ns. total:
613202 ns.

uyvyToY_c_vsx.
no. of calls: 576. min: 1877 ns. avg: 1956 ns. max: 15821 ns. total:
1127222 ns.
uyvyToY_c.
no. of calls: 576. min: 2325 ns. avg: 2408 ns. max: 15332 ns. total:
1387168 ns.

nv12ToUV_c_vsx.
no. of calls: 144. min: 1869 ns. avg: 2006 ns. max: 15480 ns. total:
288867 ns.
nv12ToUV_c.
no. of calls: 144. min: 2101 ns. avg: 2273 ns. max: 19774 ns. total:
327432 ns.

abgrToA_c_vsx.
no. of calls: 1152. min: 1949 ns. avg: 2060 ns. max: 15496 ns. total:
2373206 ns.
abgrToA_c.
no. of calls: 1152. min: 2374 ns. avg: 2471 ns. max: 52452 ns. total:
2847044 ns.

yuy2ToUV_c_vsx.
no. of calls: 288. min: 1873 ns. avg: 1972 ns. max: 16608 ns. total:
568154 ns.
yuy2ToUV_c.
no. of calls: 288. min: 2087 ns. avg: 2123 ns. max: 2252 ns. total:
611621 ns.

nv21ToUV_c_vsx.
no. of calls: 144. min: 1879 ns. avg: 2019 ns. max: 14290 ns. total:
290860 ns.
nv21ToUV_c.
no. of calls: 144. min: 2098 ns. avg: 2233 ns. max: 14750 ns. total:
321692 ns.
=


___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

[FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.

2016-07-03 Thread Dan Parrot

Finish providing SIMD versions for POWER8 VSX of functions in libswscale/input.c
That should allow trac ticket #5570 to be closed.
---
 libswscale/ppc/input_vsx.c | 1018 +++-
 1 file changed, 1014 insertions(+), 4 deletions(-)

diff --git a/libswscale/ppc/input_vsx.c b/libswscale/ppc/input_vsx.c
index d977a32..2c6f0ce 100644
--- a/libswscale/ppc/input_vsx.c
+++ b/libswscale/ppc/input_vsx.c
@@ -54,6 +54,7 @@ static void abgrToA_c_vsx(uint8_t *_dst, const uint8_t *src, 
const uint8_t *unus
 for ( i = 0; i < width_adj; i += 8) {
 vector int v_rd0 = vec_vsx_ld(0, (int *)src_addr);
 vector int v_rd1 = vec_vsx_ld(0, (int *)(src_addr + 16));
+vector int v_dst;
 
 v_rd0 = vec_and(v_rd0, vec_splats(0x0ff));
 v_rd1 = vec_and(v_rd1, vec_splats(0x0ff));
@@ -61,8 +62,8 @@ static void abgrToA_c_vsx(uint8_t *_dst, const uint8_t *src, 
const uint8_t *unus
 v_rd0 = vec_sl(v_rd0, vec_splats((unsigned)6));
 v_rd1 = vec_sl(v_rd1, vec_splats((unsigned)6));
 
-vector int v_dst = vec_perm(v_rd0, v_rd1, ((vector unsigned char)
-   {0, 1, 4, 5, 8, 9, 12, 13, 
16, 17, 20, 21, 24, 25, 28, 29}));
+v_dst = vec_perm(v_rd0, v_rd1, ((vector unsigned char)
+{0, 1, 4, 5, 8, 9, 12, 13, 16, 17, 20, 
21, 24, 25, 28, 29}));
 vec_vsx_st((vector unsigned char)v_dst, 0, (unsigned char *)dst_addr);
 
 src_addr += 32;
@@ -91,6 +92,7 @@ static void rgbaToA_c_vsx(uint8_t *_dst, const uint8_t *src, 
const uint8_t *unus
 for ( i = 0; i < width_adj; i += 8) {
 vector int v_rd0 = vec_vsx_ld(0, (int *)src_addr);
 vector int v_rd1 = vec_vsx_ld(0, (int *)(src_addr + 16));
+vector int v_dst;
 
 v_rd0 = vec_sld(v_rd0, v_rd0, 13);
 v_rd1 = vec_sld(v_rd1, v_rd1, 13);
@@ -101,8 +103,8 @@ static void rgbaToA_c_vsx(uint8_t *_dst, const uint8_t 
*src, const uint8_t *unus
 v_rd0 = vec_sl(v_rd0, vec_splats((unsigned)6));
 v_rd1 = vec_sl(v_rd1, vec_splats((unsigned)6));
 
-vector int v_dst = vec_perm(v_rd0, v_rd1, ((vector unsigned char)
-   {0, 1, 4, 5, 8, 9, 12, 13, 
16, 17, 20, 21, 24, 25, 28, 29}));
+v_dst = vec_perm(v_rd0, v_rd1, ((vector unsigned char)
+{0, 1, 4, 5, 8, 9, 12, 13, 16, 17, 20, 
21, 24, 25, 28, 29}));
 vec_vsx_st((vector unsigned char)v_dst, 0, (unsigned char *)dst_addr);
 
 src_addr += 32;
@@ -114,6 +116,175 @@ static void rgbaToA_c_vsx(uint8_t *_dst, const uint8_t 
*src, const uint8_t *unus
 }
 }
 
+static void monoblack2Y_c_vsx(uint8_t *_dst, const uint8_t *src, const uint8_t 
*unused1, const uint8_t *unused2,
+  int width, uint32_t *unused)
+{
+int16_t *dst = (int16_t *)_dst;
+int i, j, width_adj, frag_len;
+
+vector unsigned charv_rd;
+vector signed short v_din, v_d, v_dst;
+vector unsigned short   v_opr;
+
+uintptr_t src_addr = (uintptr_t)src;
+uintptr_t dst_addr = (uintptr_t)dst;
+
+width = (width + 7) >> 3;
+
+// compute integral number of vector-length items and length of final 
fragment
+width_adj = width >> 3;
+width_adj = width_adj << 3;
+frag_len = width - width_adj;
+
+v_opr = (vector unsigned short) {7, 6, 5, 4, 3, 2, 1, 0};
+
+for (i = 0; i < width_adj; i += 8) {
+if (i & 0x0f) {
+v_rd = vec_sld(v_rd, v_rd, 8);
+} else {
+v_rd = vec_vsx_ld(0, (unsigned char *)src_addr);
+src_addr += 16;
+}
+
+v_din = vec_unpackh((vector signed char)v_rd);
+v_din = vec_and(v_din, vec_splats((short)0x00ff));
+
+for (j = 0; j < 8; j++) {
+switch(j) {
+case 0:
+v_d = vec_splat(v_din, 0);
+break;
+case 1:
+v_d = vec_splat(v_din, 1);
+break;
+case 2:
+v_d = vec_splat(v_din, 2);
+break;
+case 3:
+v_d = vec_splat(v_din, 3);
+break;
+case 4:
+v_d = vec_splat(v_din, 4);
+break;
+case 5:
+v_d = vec_splat(v_din, 5);
+break;
+case 6:
+v_d = vec_splat(v_din, 6);
+break;
+case 7:
+v_d = vec_splat(v_din, 7);
+break;
+}
+
+v_dst = vec_sr(v_d, v_opr);
+v_dst = vec_and(v_dst, vec_splats((short)1));
+v_dst = v_dst * vec_splats((short)16383);
+
+vec_vsx_st(v_dst, 0, (short *)dst_addr);
+dst_addr += 16;
+}
+}
+
+for (i = width_adj; i < width_adj + frag_len; i++) {
+int d = src[i];
+for (j = 0; j < 8; j++)
+dst[8*i+j]= ((d>>(7-j))&1

Re: [FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.

2016-07-04 Thread Dan Parrot

On Mon, 2016-07-04 at 06:22 +, Carl Eugen Hoyos wrote:
> Dan Parrot  mail.com> writes:
> 
> > Finish providing SIMD versions for POWER8 VSX of functions 
> > in libswscale/input.c
> > That should allow trac ticket #5570 to be closed.
> 
> Please add some numbers:
> Either for single functions or for a single ffmpeg command.
> (for rgb/bgr, mono is irrelevant)
> 
> Carl Eugen
> 
> ___
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

The data below show the running times, for each of the functions,
obtained using SystemTap. The dataset used was the entire FATE
regression suite. Only the first  calls are used in obtaining the
data (for functions called more often). SIMD functions have suffix
"_vsx". The unit of time used is nanosecond.

---
name: abgrToA_c_vsx. 
no. of calls: 1408. min: 2772 ns. avg: 3106 ns. max: 44282 ns. total:
4373993 ns. 

name: abgrToA_c. 
no. of calls: 1408. min: 3088 ns. avg: 3385 ns. max: 24698 ns. total: 4766911 
ns.
---
name: bgr24ToUV_c_vsx. 
no. of calls: 288. min: 5213 ns. avg: 5452 ns. max: 26635 ns. total:
1570338 ns. 

name: bgr24ToUV_c. 
no. of calls: 288. min: 5351 ns. avg: 5636 ns. max: 27284 ns. total: 1623277 ns.
---
name: bgr24ToUV_half_c_vsx. 
no. of calls: . min: 4792 ns. avg: 4941 ns. max: 34340 ns. total:
49411622 ns. 

name: bgr24ToUV_half_c. 
no. of calls: . min: 4795 ns. avg: 6012 ns. max: 66135 ns. total: 60122454 
ns.
---
name: bgr24ToY_c_vsx. 
no. of calls: . min: 4475 ns. avg: 4654 ns. max: 28739 ns. total:
46539077 ns. 

name: bgr24ToY_c. 
no. of calls: . min: 4551 ns. avg: 5974 ns. max: 218357 ns. total: 59741865 
ns.
---
name: monoblack2Y_c_vsx. 
no. of calls: 288. min: 2902 ns. avg: 3102 ns. max: 25454 ns. total:
893490 ns. 

name: monoblack2Y_c. 
no. of calls: 288. min: 3011 ns. avg: 3203 ns. max: 26008 ns. total: 922515 ns.
---
name: monowhite2Y_c_vsx. 
no. of calls: . min: 2813 ns. avg: 3025 ns. max: 81510 ns. total:
30248113 ns. 

name: monowhite2Y_c. 
no. of calls: . min: 2692 ns. avg: 2891 ns. max: 43653 ns. total: 28911676 
ns.
---
name: nv12ToUV_c_vsx. 
no. of calls: 144. min: 2709 ns. avg: 2960 ns. max: 26249 ns. total:
426364 ns. 

name: nv12ToUV_c. 
no. of calls: 144. min: 2930 ns. avg: 3169 ns. max: 24483 ns. total: 456353 ns.
---
name: nv21ToUV_c_vsx. 
no. of calls: 144. min: 2707 ns. avg: 3001 ns. max: 26050 ns. total:
432150 ns. 

name: nv21ToUV_c. 
no. of calls: 144. min: 2887 ns. avg: 3141 ns. max: 24704 ns. total: 452426 ns.
---
name: planar_rgb_to_a_vsx. 
no. of calls: 288. min: 2977 ns. avg: 3223 ns. max: 24993 ns. total:
928305 ns. 

name: planar_rgb_to_a. 
no. of calls: 288. min: 3306 ns. avg: 3538 ns. max: 24350 ns. total: 1019154 ns.
---
name: planar_rgb_to_uv_vsx. 
no. of calls: 576. min: 5092 ns. avg: 5295 ns. max: 27170 ns. total:
3050431 ns. 

name: planar_rgb_to_uv. 
no. of calls: 576. min: 5605 ns. avg: 5864 ns. max: 26177 ns. total: 3377983 ns.
---
name: planar_rgb_to_y_vsx. 
no. of calls: 576. min: 4459 ns. avg: 4666 ns. max: 27760 ns. total:
2688039 ns. 

name: planar_rgb_to_y. 
no. of calls: 576. min: 4877 ns. avg: 5149 ns. max: 27879 ns. total: 2965982 ns.
---
name: rgb24ToUV_c_vsx. 
no. of calls: 688. min: 4090 ns. avg: 4791 ns. max: 25602 ns. total:
3296223 ns. 

name: rgb24ToUV_c. 
no. of calls: 688. min: 4077 ns. avg: 4891 ns. max: 26629 ns. total: 3365385 ns.
---
name: rgb24ToUV_half_c_vsx. 
no. of calls: . min: 4062 ns. avg: 5074 ns.

Re: [FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.

2016-07-04 Thread Dan Parrot

On Mon, 2016-07-04 at 09:20 +, Carl Eugen Hoyos wrote:
> Dan Parrot  mail.com> writes:
> 
> > The dataset used was the entire FATE regression suite.
> 
> I don't think this is a particularly useful testcase:
> It takes very long but mostly tests other things.
> 
> Did you test if using ffmpeg -benchmark -f rawvideo -i /dev/zero... 
> showed different results?
> I believe this should be both easier and faster to test.
Sorry, I don't understand what that command line just above is trying to
achieve. Could you elaborate?

> > name: rgb24ToY_c_vsx. 
> > no. of calls: . min: 3832 ns. avg: 4709 ns. max: 37550 ns. 
> > total: 47093533 ns. 
> > 
> > name: rgb24ToY_c. 
> > no. of calls: . min: 3809 ns. avg: 4707 ns. max: 29041 ns. 
> > total: 47072923 ns.
> 
> Without any data, I would have thought that this is the most 
> important function (and "no. of calls" seems to confirm this).
> 
> Why is this not faster?
Surprisingly, gcc is producing some badly suboptimal assembly. I need to
follow up with IBM's Linux Technology Center. The major issue is that
multiplication of vector quantities in C is generating as many
multiplications in assembly as would scalar multiplication in a loop. No
way that should be occurring.

> Can you confirm with START_TIMER / STOP_TIMER that there is no 
> gain?
SystemTap probes provide identical functionality by measuring deltas
between function entry and function return.


___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.

2016-07-04 Thread Dan Parrot

On Mon, 2016-07-04 at 16:30 +, Carl Eugen Hoyos wrote:
> Dan Parrot  mail.com> writes:
> 
> > > Did you test if using ffmpeg -benchmark -f rawvideo -i /dev/zero... 
> > > showed different results?
> > > I believe this should be both easier and faster to test.
> >
> > Sorry, I don't understand what that command line just above 
> > is trying to achieve. Could you elaborate?
> 
> Instead of running the whole fate suite that takes long and 
> does not test libswscale for most commands, just test an 
> ffmpeg command line that only tests libswscale:
> $ ffmpeg -benchmark -f rawvideo -pix_fmt rgb24 
> -i /dev/zero -pix_fmt yuv420p -f null -vframes 1 -
> vs
> 
> $ ffmpeg -cpuflags 0 -benchmark -f rawvideo -pix_fmt rgb24 
> -i /dev/zero -pix_fmt yuv420p -f null -vframes 1 -
> 
Ok. Thanks for the explanation. I will run those commands and post the
reported results.

> [...]
> 
> > Surprisingly, gcc is producing some badly suboptimal assembly.
> 
> Just to make sure I don't misunderstand:
> Does this mean intrinsics are suboptimal to write assembly 
> code?
Here's what I mean: All variables below are of type "vector int"

1. v0 = v2 * v3
2. v0 = v4 * v5 + v6 * v7 + v8 * v9

The first statement produces 1 multiply, 1 multiply-sum and 1 addition
instruction in assembly.

The second produces 6 multiply, 6 multiply-sum, and 10 addition
instructions in assembly! I expected 3, 3, 3 of each respective
operations from (1) plus 2 additions.

> 
> > > Can you confirm with START_TIMER / STOP_TIMER that there is no 
> > > gain?
> >
> > SystemTap probes provide identical functionality by measuring 
> > deltas between function entry and function return.
> 
> Sorry, I don't understand:
> Did you test with both methods to verify that they provide 
> the same results?
> 
> Note that if it turns out that START_TIMER / STOP_TIMER 
> cannot be used on ppc64 (le) this would be important 
> information for us.
> 
I'll insert these macros and inform of the results if the code compiles
and runs.


___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.

2016-07-04 Thread Dan Parrot


> > Just to make sure I don't misunderstand:
> > Does this mean intrinsics are suboptimal to write assembly 
> > code?
> Here's what I mean: All variables below are of type "vector int"
> 
> 1. v0 = v2 * v3
> 2. v0 = v4 * v5 + v6 * v7 + v8 * v9
> 
> The first statement produces 1 multiply, 1 multiply-sum and 1 addition
> instruction in assembly.
> 
> The second produces 6 multiply, 6 multiply-sum, and 10 addition
> instructions in assembly! I expected 3, 3, 3 of each respective
> operations from (1) plus 2 additions.

The operations counts given above were obtained using gcc 5.3.1 on
Fedora 22. I just created a simple test with those same statements and
compiled using gcc 6.1.1 on Fedora 24. The assembly operation counts are
what I had expected initially and more reasonable.

So, I'm going to move my ffmpeg development onto the Fedora 24 cloud
image and see if the SIMD performance there is better than was on Fedora
22. The reason I'm moving to Fedora 24 instead of trying to upgrade gcc
on Fedora 22 is that I've learned to prefer standard pre-installed
images to the wrecks I've managed to create doing my own sysadmin on the
POWER8 cloud.


___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.

2016-07-04 Thread Dan Parrot

On Mon, 2016-07-04 at 20:55 +0200, Hendrik Leppkes wrote:
> On Mon, Jul 4, 2016 at 5:20 PM, Dan Parrot  wrote:
> >> Why is this not faster?
> > Surprisingly, gcc is producing some badly suboptimal assembly. I need to
> > follow up with IBM's Linux Technology Center. The major issue is that
> > multiplication of vector quantities in C is generating as many
> > multiplications in assembly as would scalar multiplication in a loop. No
> > way that should be occurring.
> >
> 
> This is the reason why we generally don't allow intrinsic
> optimizations and instead ask people to write full assembly instead.
> It behaves more consistently everywhere.

Is this then a requirement to abandon the use of intrinsics for PPC64
SIMD and instead re-implement in assembly?




___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.

2016-07-04 Thread Dan Parrot

On Mon, 2016-07-04 at 16:30 +, Carl Eugen Hoyos wrote:
> Dan Parrot  mail.com> writes:
> 
> > > Did you test if using ffmpeg -benchmark -f rawvideo -i /dev/zero... 
> > > showed different results?
> > > I believe this should be both easier and faster to test.
> >
> > Sorry, I don't understand what that command line just above 
> > is trying to achieve. Could you elaborate?
> 
> Instead of running the whole fate suite that takes long and 
> does not test libswscale for most commands, just test an 
> ffmpeg command line that only tests libswscale:
> $ ffmpeg -benchmark -f rawvideo -pix_fmt rgb24 
> -i /dev/zero -pix_fmt yuv420p -f null -vframes 1 -
$ ./ffmpeg -benchmark -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero
-pix_fmt yuv420p -f null -vframes 1000 -

frame= 1000 fps= 16 q=-0.0 Lsize=N/A time=00:00:40.00 bitrate=N/A
speed=0.632x
video:477kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB
muxing overhead: unknown
bench: utime=62.794s
bench: maxrss=21184kB


> vs
> 
> $ ffmpeg -cpuflags 0 -benchmark -f rawvideo -pix_fmt rgb24 
> -i /dev/zero -pix_fmt yuv420p -f null -vframes 1 -

$ ./ffmpeg -cpuflags 0 -benchmark -f rawvideo -pix_fmt rgb24 -s hd1080
-i /dev/zero -pix_fmt yuv420p -f null -vframes 1000 -

frame= 1000 fps= 12 q=-0.0 Lsize=N/A time=00:00:40.00 bitrate=N/A
speed=0.479x
video:477kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB
muxing overhead: unknown
bench: utime=82.918s
bench: maxrss=21120kB

> [...]
> 
> > Surprisingly, gcc is producing some badly suboptimal assembly.
> 
> Just to make sure I don't misunderstand:
> Does this mean intrinsics are suboptimal to write assembly 
> code?
So, the latest version of GCC does produce more efficient assembly.

To recap: GCC 5.3.1 produces assembly that does not take full advantage
of PPC64 POWER8 SIMD instructions. GCC 6.1.1 is much better and produces
shorter sequences that do use SIMD assembly instructions.

> > > Can you confirm with START_TIMER / STOP_TIMER that there is no 
> > > gain?
> >
> > SystemTap probes provide identical functionality by measuring 
> > deltas between function entry and function return.
> 
> Sorry, I don't understand:
> Did you test with both methods to verify that they provide 
> the same results?

> Note that if it turns out that START_TIMER / STOP_TIMER 
> cannot be used on ppc64 (le) this would be important 
> information for us.
These start/stop macros are the last issue I have outstanding. I hope to
be done in a few hours.


___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.

2016-07-04 Thread Dan Parrot

> > > > Can you confirm with START_TIMER / STOP_TIMER that there is no 
> > > > gain?
> > >
> > > SystemTap probes provide identical functionality by measuring 
> > > deltas between function entry and function return.
> > 
> > Sorry, I don't understand:
> > Did you test with both methods to verify that they provide 
> > the same results?
> > 
> > Note that if it turns out that START_TIMER / STOP_TIMER 
> > cannot be used on ppc64 (le) this would be important 
> > information for us.
> > 
> I'll insert these macros and inform of the results if the code compiles
> and runs.

These results for START_TIMER/STOP_TIMER are with ffmpeg compiled using
GCC 6.1.1

==
Existing non-SIMD version:

./ffmpeg -report -cpuflags 0 -benchmark -f rawvideo -pix_fmt rgb24 -s
hd1080 -i /dev/zero -pix_fmt yuv420p -f null -vframes 1000 -

  33770 UNITS in rgb24ToY_c,   1 runs,  0 skips
  33430 UNITS in rgb24ToY_c,   2 runs,  0 skips
  33292 UNITS in rgb24ToY_c,   4 runs,  0 skips
  33128 UNITS in rgb24ToY_c,   8 runs,  0 skips
  32848 UNITS in rgb24ToY_c,  16 runs,  0 skips
  32347 UNITS in rgb24ToY_c,  32 runs,  0 skips
  31831 UNITS in rgb24ToY_c,  64 runs,  0 skips
  31594 UNITS in rgb24ToY_c, 128 runs,  0 skips
  31513 UNITS in rgb24ToY_c, 256 runs,  0 skips
  31628 UNITS in rgb24ToY_c, 512 runs,  0 skips
  31466 UNITS in rgb24ToY_c,1024 runs,  0 skips
  31390 UNITS in rgb24ToY_c,2048 runs,  0 skips
  31411 UNITS in rgb24ToY_c,4096 runs,  0 skips
  31411 UNITS in rgb24ToY_c,8192 runs,  0 skipstrate=N/A
speed=0.522x
  31399 UNITS in rgb24ToY_c,   16384 runs,  0 skipstrate=N/A
speed=0.486x
  31416 UNITS in rgb24ToY_c,   32763 runs,  5 skipstrate=N/A
speed=0.467x
  31413 UNITS in rgb24ToY_c,   65530 runs,  6 skipstrate=N/A
speed=0.458x
  31421 UNITS in rgb24ToY_c,  131064 runs,  8 skipstrate=N/A
speed=0.454x
  31430 UNITS in rgb24ToY_c,  262131 runs, 13 skipstrate=N/A
speed=0.449x
  31422 UNITS in rgb24ToY_c,  524264 runs, 24 skipstrate=N/A
speed=0.449x
  31424 UNITS in rgb24ToY_c, 1048532 runs, 44 skipstrate=N/A
speed=0.45x 
frame= 1000 fps= 11 q=-0.0 Lsize=N/A time=00:00:40.00 bitrate=N/A
speed=0.449x
video:477kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB
muxing overhead: unknown
bench: utime=88.212s
bench: maxrss=21120kB
--
./ffmpeg -report -benchmark -f rawvideo -pix_fmt rgb24 -s hd1080
-i /dev/zero -pix_fmt yuv420p -f null -vframes 1000 -

  35440 UNITS in rgb24ToY_c,   1 runs,  0 skips
  34290 UNITS in rgb24ToY_c,   2 runs,  0 skips
  33670 UNITS in rgb24ToY_c,   4 runs,  0 skips
  33387 UNITS in rgb24ToY_c,   8 runs,  0 skips
  32786 UNITS in rgb24ToY_c,  16 runs,  0 skips
  32317 UNITS in rgb24ToY_c,  32 runs,  0 skips
  32008 UNITS in rgb24ToY_c,  64 runs,  0 skips
  31944 UNITS in rgb24ToY_c, 128 runs,  0 skips
  32049 UNITS in rgb24ToY_c, 256 runs,  0 skips
  31913 UNITS in rgb24ToY_c, 512 runs,  0 skips
  31822 UNITS in rgb24ToY_c,1024 runs,  0 skips
  31805 UNITS in rgb24ToY_c,2048 runs,  0 skips
  31841 UNITS in rgb24ToY_c,4096 runs,  0 skips
  31825 UNITS in rgb24ToY_c,8192 runs,  0 skips
  31803 UNITS in rgb24ToY_c,   16383 runs,  1 skipstrate=N/A
speed=0.649x
  31822 UNITS in rgb24ToY_c,   32766 runs,  2 skipstrate=N/A
speed=0.602x
  31816 UNITS in rgb24ToY_c,   65532 runs,  4 skipstrate=N/A
speed=0.59x 
  31811 UNITS in rgb24ToY_c,  131064 runs,  8 skipstrate=N/A
speed=0.584x
  31810 UNITS in rgb24ToY_c,  262133 runs, 11 skipstrate=N/A
speed=0.583x
  31811 UNITS in rgb24ToY_c,  524266 runs, 22 skipstrate=N/A
speed=0.583x
  31822 UNITS in rgb24ToY_c, 1048527 runs, 49 skipstrate=N/A
speed=0.582x
frame= 1000 fps= 15 q=-0.0 Lsize=N/A time=00:00:40.00 bitrate=N/A
speed=0.581x
video:477kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB
muxing overhead: unknown
bench: utime=68.211s
bench: maxrss=21120kB


SIMD version in patch submitted earlier in this message thread:

./ffmpeg -report -cpuflags 0 -benchmark -f rawvideo -pix_fmt rgb24 -s
hd1080 -i /dev/zero -pix_fmt yuv420p -f null -vframes 1000 - && ./ffmpeg
-report -benchmark -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero
-pix_fmt yuv420p -f null -vframes 1000

  23950 UNITS in rgb24ToY_c_vsx,   1 runs,  0 skips
  23175 UNITS in rgb24ToY_c_vsx,   2 runs,  0 skips
  22752 UNITS in rgb24ToY_c_vsx,   4 runs,  0 skips
  22401 UNITS in rgb24ToY_c_vsx,   8 runs,  0 skips
  22106 UNITS in rgb24ToY_c_vsx,  16 runs,  0 skips
  21585 UNITS in rgb24ToY_c_vsx,  32 runs,  0 skips
  21126 UNITS in rgb24ToY_c_vsx,  64 runs,  0 skips
  2

Re: [FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.

2016-07-04 Thread Dan Parrot

On Mon, 2016-07-04 at 09:20 +, Carl Eugen Hoyos wrote:
> Dan Parrot  mail.com> writes:
> 
> > The dataset used was the entire FATE regression suite.
> 
> I don't think this is a particularly useful testcase:
> It takes very long but mostly tests other things.
> 
> Did you test if using ffmpeg -benchmark -f rawvideo -i /dev/zero... 
> showed different results?
> I believe this should be both easier and faster to test.
> 
> > name: rgb24ToY_c_vsx. 
> > no. of calls: . min: 3832 ns. avg: 4709 ns. max: 37550 ns. 
> > total: 47093533 ns. 
> > 
> > name: rgb24ToY_c. 
> > no. of calls: . min: 3809 ns. avg: 4707 ns. max: 29041 ns. 
> > total: 47072923 ns.
> 
> Without any data, I would have thought that this is the most 
> important function (and "no. of calls" seems to confirm this).
> 
> Why is this not faster?
> Can you confirm with START_TIMER / STOP_TIMER that there is no 
> gain?
> 
> Thank you, Carl Eugen
> 
> ___
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel



___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.

2016-07-04 Thread Dan Parrot

On Mon, 2016-07-04 at 23:31 -0500, Dan Parrot wrote:
> On Mon, 2016-07-04 at 09:20 +, Carl Eugen Hoyos wrote:
> > Dan Parrot  mail.com> writes:
> > 
> > > The dataset used was the entire FATE regression suite.
> > 
> > I don't think this is a particularly useful testcase:
> > It takes very long but mostly tests other things.
> > 
> > Did you test if using ffmpeg -benchmark -f rawvideo -i /dev/zero... 
> > showed different results?
> > I believe this should be both easier and faster to test.
> > 
> > > name: rgb24ToY_c_vsx. 
> > > no. of calls: . min: 3832 ns. avg: 4709 ns. max: 37550 ns. 
> > > total: 47093533 ns. 
> > > 
> > > name: rgb24ToY_c. 
> > > no. of calls: . min: 3809 ns. avg: 4707 ns. max: 29041 ns. 
> > > total: 47072923 ns.
> > 
> > Without any data, I would have thought that this is the most 
> > important function (and "no. of calls" seems to confirm this).
> > 
> > Why is this not faster?

I believe I have answered, in earlier posts, all the questions you
raised. Finally, just to satisfy my curiosity, I used SystemTap to probe
during a run of the entire FATE regression. Here are the same two
functions, this time with GCC 6.1.1 instead of 5.3.1 (it is
representative of all other functions)

name: rgb24ToY_c_vsx. 
no. of calls: . min: 3053 ns. avg: 3298 ns. max: 69359 ns. total:
32983050 ns.

name: rgb24ToY_c. 
no. of calls: . min: 3040 ns. avg: 4056 ns. max: 79159 ns. total:
40561568 ns.

Non-trivial improvement is seen for the SIMD code. So: would you accept
and apply the patch?

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.

2016-07-05 Thread Dan Parrot

On Mon, 2016-07-04 at 06:22 +, Carl Eugen Hoyos wrote:
> Dan Parrot  mail.com> writes:
> 
> > Finish providing SIMD versions for POWER8 VSX of functions 
> > in libswscale/input.c
> > That should allow trac ticket #5570 to be closed.
> 
> Please add some numbers:
> Either for single functions or for a single ffmpeg command.
> (for rgb/bgr, mono is irrelevant)
> 
> Carl Eugen

All questions you posed have now been answered in the thread. It here
are no new issues, could you apply the patch and close trac ticket
#5570.


___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.

2016-07-05 Thread Dan Parrot

On Tue, 2016-07-05 at 15:45 +, Carl Eugen Hoyos wrote:
> Dan Parrot  mail.com> writes:
> 
> > These results for START_TIMER/STOP_TIMER are with ffmpeg 
> > compiled using GCC 6.1.1
> 
> I believe your results indicate that -cpuflags 0 has no 
> effect on vsx.
> On x86, this would be a blocker.
> 
> While I would prefer if you had tested with vanilla 
> gcc instead of a version patched by a distributor, no 
> more comments from me.

Just to be clear, I am not making any claims about any special patching
in GCC 6.1.1. All I am saying about GCC is that version 5.3.1 produces
suboptimal SIMD sequences; 6.1.1 gives much better SIMD sequences. Both
versions are preinstalled on Fedora images in the IBM cloud. I have no
way of knowing how those versions of GCC behave on different machines.

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

[FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.

2016-07-05 Thread Dan Parrot

Finish providing SIMD versions for POWER8 VSX of functions in 
libswscale/input.c That should allow trac ticket #5570 to be closed.
The speedups obtained for the functions are:

abgrToA_c   1.19
bgr24ToUV_c 1.23
bgr24ToUV_half_c1.37
bgr24ToY_c_vsx  1.43
nv12ToUV_c  1.05
nv21ToUV_c  1.06
planar_rgb_to_uv1.25
planar_rgb_to_y 1.26
rgb24ToUV_c 1.11
rgb24ToUV_half_c1.10
rgb24ToY_c  0.92
rgbaToA_c   0.88
uyvyToUV_c  1.05
uyvyToY_c   1.15
yuy2ToUV_c  1.07
yuy2ToY_c   1.17
yvy2ToUV_c  1.05
---
 libswscale/ppc/input_vsx.c | 1021 +++-
 1 file changed, 1017 insertions(+), 4 deletions(-)

diff --git a/libswscale/ppc/input_vsx.c b/libswscale/ppc/input_vsx.c
index d977a32..35edd5e 100644
--- a/libswscale/ppc/input_vsx.c
+++ b/libswscale/ppc/input_vsx.c
@@ -30,6 +30,7 @@
 #include "libavutil/mathematics.h"
 #include "libavutil/pixdesc.h"
 #include "libavutil/avassert.h"
+#include "libavutil/timer.h"
 #include "config.h"
 #include "libswscale/rgb2rgb.h"
 #include "libswscale/swscale.h"
@@ -54,6 +55,7 @@ static void abgrToA_c_vsx(uint8_t *_dst, const uint8_t *src, 
const uint8_t *unus
 for ( i = 0; i < width_adj; i += 8) {
 vector int v_rd0 = vec_vsx_ld(0, (int *)src_addr);
 vector int v_rd1 = vec_vsx_ld(0, (int *)(src_addr + 16));
+vector int v_dst;
 
 v_rd0 = vec_and(v_rd0, vec_splats(0x0ff));
 v_rd1 = vec_and(v_rd1, vec_splats(0x0ff));
@@ -61,8 +63,8 @@ static void abgrToA_c_vsx(uint8_t *_dst, const uint8_t *src, 
const uint8_t *unus
 v_rd0 = vec_sl(v_rd0, vec_splats((unsigned)6));
 v_rd1 = vec_sl(v_rd1, vec_splats((unsigned)6));
 
-vector int v_dst = vec_perm(v_rd0, v_rd1, ((vector unsigned char)
-   {0, 1, 4, 5, 8, 9, 12, 13, 
16, 17, 20, 21, 24, 25, 28, 29}));
+v_dst = vec_perm(v_rd0, v_rd1, ((vector unsigned char)
+{0, 1, 4, 5, 8, 9, 12, 13, 16, 17, 20, 
21, 24, 25, 28, 29}));
 vec_vsx_st((vector unsigned char)v_dst, 0, (unsigned char *)dst_addr);
 
 src_addr += 32;
@@ -91,6 +93,7 @@ static void rgbaToA_c_vsx(uint8_t *_dst, const uint8_t *src, 
const uint8_t *unus
 for ( i = 0; i < width_adj; i += 8) {
 vector int v_rd0 = vec_vsx_ld(0, (int *)src_addr);
 vector int v_rd1 = vec_vsx_ld(0, (int *)(src_addr + 16));
+vector int v_dst;
 
 v_rd0 = vec_sld(v_rd0, v_rd0, 13);
 v_rd1 = vec_sld(v_rd1, v_rd1, 13);
@@ -101,8 +104,8 @@ static void rgbaToA_c_vsx(uint8_t *_dst, const uint8_t 
*src, const uint8_t *unus
 v_rd0 = vec_sl(v_rd0, vec_splats((unsigned)6));
 v_rd1 = vec_sl(v_rd1, vec_splats((unsigned)6));
 
-vector int v_dst = vec_perm(v_rd0, v_rd1, ((vector unsigned char)
-   {0, 1, 4, 5, 8, 9, 12, 13, 
16, 17, 20, 21, 24, 25, 28, 29}));
+v_dst = vec_perm(v_rd0, v_rd1, ((vector unsigned char)
+{0, 1, 4, 5, 8, 9, 12, 13, 16, 17, 20, 
21, 24, 25, 28, 29}));
 vec_vsx_st((vector unsigned char)v_dst, 0, (unsigned char *)dst_addr);
 
 src_addr += 32;
@@ -114,6 +117,175 @@ static void rgbaToA_c_vsx(uint8_t *_dst, const uint8_t 
*src, const uint8_t *unus
 }
 }
 
+static void monoblack2Y_c_vsx(uint8_t *_dst, const uint8_t *src, const uint8_t 
*unused1, const uint8_t *unused2,
+  int width, uint32_t *unused)
+{
+int16_t *dst = (int16_t *)_dst;
+int i, j, width_adj, frag_len;
+
+vector unsigned charv_rd;
+vector signed short v_din, v_d, v_dst;
+vector unsigned short   v_opr;
+
+uintptr_t src_addr = (uintptr_t)src;
+uintptr_t dst_addr = (uintptr_t)dst;
+
+width = (width + 7) >> 3;
+
+// compute integral number of vector-length items and length of final 
fragment
+width_adj = width >> 3;
+width_adj = width_adj << 3;
+frag_len = width - width_adj;
+
+v_opr = (vector unsigned short) {7, 6, 5, 4, 3, 2, 1, 0};
+
+for (i = 0; i < width_adj; i += 8) {
+if (i & 0x0f) {
+v_rd = vec_sld(v_rd, v_rd, 8);
+} else {
+v_rd = vec_vsx_ld(0, (unsigned char *)src_addr);
+src_addr += 16;
+}
+
+v_din = vec_unpackh((vector signed char)v_rd);
+v_din = vec_and(v_din, vec_splats((short)0x00ff));
+
+for (j = 0; j < 8; j++) {
+switch(j) {
+case 0:
+v_d = vec_splat(v_din, 0);
+break;
+case 1:
+v_d = vec_splat(v_din, 1);
+break;
+case 2:
+v_d = vec_splat(v_din, 2);
+break;
+case 3:
+v_d = vec_splat(v_din, 3);
+break;
+

Re: [FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.

2016-07-06 Thread Dan Parrot

On Wed, 2016-07-06 at 09:07 +0200, Hendrik Leppkes wrote:
> On Wed, Jul 6, 2016 at 4:37 AM, Dan Parrot  wrote:
> > Finish providing SIMD versions for POWER8 VSX of functions in 
> > libswscale/input.c That should allow trac ticket #5570 to be closed.
> > The speedups obtained for the functions are:
> >
> > abgrToA_c   1.19
> > bgr24ToUV_c 1.23
> > bgr24ToUV_half_c1.37
> > bgr24ToY_c_vsx  1.43
> > nv12ToUV_c  1.05
> > nv21ToUV_c  1.06
> > planar_rgb_to_uv1.25
> > planar_rgb_to_y 1.26
> > rgb24ToUV_c 1.11
> > rgb24ToUV_half_c1.10
> > rgb24ToY_c  0.92
> > rgbaToA_c   0.88
> > uyvyToUV_c  1.05
> > uyvyToY_c   1.15
> > yuy2ToUV_c  1.07
> > yuy2ToY_c   1.17
> > yvy2ToUV_c  1.05
> 
> SIMD implementations that in the best case improve the speed by 43%
> (and in some cases is *slower*) seem barely worth it. One would expect
> a proper SIMD implementation to offer 100% or higher increases, at
> least thats the general expectation on x86 with SSE/AVX.
It sounds like you have either forgotten or never learned a very basic
principle of computer architecture. I recommend the text by Patterson
and Hennessey. The principle is Amdahl's Law. Before you start throwing
numbers around, make sure you understand what was being parallelized.

> So the question here is - is thats VSX being bad, or the intrinsics
> being bad? How would the speedup be in proper hand-written ASM?
> If hand-written ASM can give us the usual 100-200% improvements we would
> expect from SIMD, then this is what should generally be favored.
I am not got to write assembly just so you get a nice fuzzy feeling. If that's 
a deal-breaker, so be it.

> Also, one further thought:
> From the commit message, it sounds like you might only be doing this
> for the bounty in #5570, do you plan to maintain these optimizations
> in the future?

Unless you are a mind reader, STFU about my motivation in writing code.

One other thing: why didn't this come up when the earlier patch was
submitted and applied?

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

38 matches

Mail list logo