[PATCH] Fixing improper conversion from sin() to sinf() in optimization mode.

2013-08-20 Thread Cong Hou
When a sin() (cos(), log(), etc.) function is called on a value of
float type and the returned double value is converted to another value
of float type, GCC converts this function call into a float version
(sinf()) in the optimization mode. This avoids two type conversions
and the float version function call usually takes less time. However,
this can result in different result and therefore is unsafe.

For example, the following code produces different results using -O0
(correct), but the same results using -Ox other than -O0 (incorrect).


#include 
#include 

int main()
{
  float v = 1;
  printf("%.20f\n", (float)sin(v));
  printf("%.20f\n", sinf(v));
}


In this patch, we do this conversion only when the flag
-funsafe-math-optimizations is set. The patch is shown below.


thanks,

Cong




Index: gcc/testsuite/gcc.c-torture/execute/20030125-1.c
===
--- gcc/testsuite/gcc.c-torture/execute/20030125-1.c (revision 201891)
+++ gcc/testsuite/gcc.c-torture/execute/20030125-1.c (working copy)
@@ -44,11 +44,11 @@ __attribute__ ((noinline))
 double
 sin(double a)
 {
- abort ();
+ return a;
 }
 __attribute__ ((noinline))
 float
 sinf(float a)
 {
- return a;
+ abort ();
 }
Index: gcc/convert.c
===
--- gcc/convert.c (revision 201891)
+++ gcc/convert.c (working copy)
@@ -99,7 +99,7 @@ convert_to_real (tree type, tree expr)
   /* Disable until we figure out how to decide whether the functions are
  present in runtime.  */
   /* Convert (float)sqrt((double)x) where x is float into sqrtf(x) */
-  if (optimize
+  if (optimize && flag_unsafe_math_optimizations
   && (TYPE_MODE (type) == TYPE_MODE (double_type_node)
   || TYPE_MODE (type) == TYPE_MODE (float_type_node)))
 {
Index: gcc/testsuite/gcc.c-torture/execute/20030125-1.c
===
--- gcc/testsuite/gcc.c-torture/execute/20030125-1.c(revision 201891)
+++ gcc/testsuite/gcc.c-torture/execute/20030125-1.c(working copy)
@@ -44,11 +44,11 @@ __attribute__ ((noinline))
 double
 sin(double a)
 {
-   abort ();
+   return a;
 }
 __attribute__ ((noinline))
 float
 sinf(float a)
 {
-   return a;
+   abort ();
 }
Index: gcc/convert.c
===
--- gcc/convert.c   (revision 201891)
+++ gcc/convert.c   (working copy)
@@ -99,7 +99,7 @@ convert_to_real (tree type, tree expr)
   /* Disable until we figure out how to decide whether the functions are
  present in runtime.  */
   /* Convert (float)sqrt((double)x) where x is float into sqrtf(x) */
-  if (optimize
+  if (optimize && flag_unsafe_math_optimizations
   && (TYPE_MODE (type) == TYPE_MODE (double_type_node)
   || TYPE_MODE (type) == TYPE_MODE (float_type_node)))
 {


Re: [PATCH] Fixing improper conversion from sin() to sinf() in optimization mode.

2013-09-03 Thread Cong Hou
I have fixed my test code and replaced those aliasing violations with
unions. Now the test result shows logb() is safe for the conversions.

The conclusion is, logb() and fabs() are always safe for the
converion, and sqrt() is unsafe for the conversion from sqrtl(double)
to sqrt(double). Other math functions are not safe for the conversion.

The new test code I used is shown below:

#include 
#include 
#include 
#include 

typedef union
{
int i;
float f;
} T32;

typedef union
{
long long int i;
double f;
} T64;

#define N 1000

#define test_math_func(func) \
for (i = 0; i < N; ++i) \
{ \
  int d = rand(), e = rand(); \
  if (d == 0) continue; \
  T32 v, r1, r2; \
  v.f = (float)e / d; \
  r1.f = func(v.f), r2.f = func##f(v.f); \
  if (r1.f != r2.f) \
  { \
printf("%s double -> float (%X) %X %X\n", #func, v.i, r1.i, r2.i); \
break; \
  } \
} \
for (i = 0; i < N; ++i) \
{ \
  int d = rand(), e = rand(); \
  if (d == 0) continue; \
  T32 v, r1, r2; \
  v.f = (float)e / d; \
  r1.f = func##l(v.f), r2.f = func##f(v.f); \
  if (r1.f != r2.f) \
  { \
printf("%s long double -> float (%X) %X %X\n", #func, v.i, r1.i, r2.i); \
break; \
  } \
} \
for (i = 0; i < N; ++i) \
{ \
  int d = rand(), e = rand(); \
  if (d == 0) continue; \
  T64 v, r1, r2; \
  v.f = (double)e / d; \
  r1.f = func##l(v.f), r2.f = func(v.f); \
  if (r1.f != r2.f) \
  { \
printf("%s long double -> double (%016llX) %016llX %016llX\n",
#func, v.i, r1.i, r2.i); \
break; \
  } \
}

int main()
{
  int i;
  test_math_func(sin);
  test_math_func(cos);
  test_math_func(sinh);
  test_math_func(cosh);
  test_math_func(asin);
  test_math_func(acos);
  test_math_func(asinh);
  test_math_func(acosh);
  test_math_func(tan);
  test_math_func(tanh);
  test_math_func(atan);
  test_math_func(atanh);
  test_math_func(log);
  test_math_func(log10);
  test_math_func(log1p);
  test_math_func(log2);
  test_math_func(logb);
  test_math_func(cbrt);
  test_math_func(erf);
  test_math_func(erfc);
  test_math_func(exp);
  test_math_func(exp2);
  test_math_func(expm1);
  test_math_func(sqrt);
  test_math_func(fabs);
}


I have modified the patch according to this new conclusion. The patch
is pasted as below.

thanks,
Cong


===
--- gcc/convert.c (revision 201891)
+++ gcc/convert.c (working copy)
@@ -135,16 +135,24 @@ convert_to_real (tree type, tree expr)
   CASE_MATHFN (COS)
   CASE_MATHFN (ERF)
   CASE_MATHFN (ERFC)
-  CASE_MATHFN (FABS)
   CASE_MATHFN (LOG)
   CASE_MATHFN (LOG10)
   CASE_MATHFN (LOG2)
   CASE_MATHFN (LOG1P)
-  CASE_MATHFN (LOGB)
   CASE_MATHFN (SIN)
-  CASE_MATHFN (SQRT)
   CASE_MATHFN (TAN)
   CASE_MATHFN (TANH)
+/* The above functions are not safe to do this conversion. */
+if (!flag_unsafe_math_optimizations)
+  break;
+  CASE_MATHFN (SQRT)
+/* sqrtl(double) cannot be safely converted to sqrt(double). */
+if (fcode == BUILT_IN_SQRTL &&
+(TYPE_MODE (type) == TYPE_MODE (double_type_node)) &&
+!flag_unsafe_math_optimizations)
+  break;
+  CASE_MATHFN (FABS)
+  CASE_MATHFN (LOGB)
 #undef CASE_MATHFN
 {
   tree arg0 = strip_float_extensions (CALL_EXPR_ARG (expr, 0));
Index: gcc/testsuite/gcc.c-torture/execute/20030125-1.c
===
--- gcc/testsuite/gcc.c-torture/execute/20030125-1.c (revision 201891)
+++ gcc/testsuite/gcc.c-torture/execute/20030125-1.c (working copy)
@@ -44,11 +44,11 @@ __attribute__ ((noinline))
 double
 sin(double a)
 {
- abort ();
+ return a;
 }
 __attribute__ ((noinline))
 float
 sinf(float a)
 {
- return a;
+ abort ();
 }

On Sat, Aug 31, 2013 at 9:24 AM, Joseph S. Myers
 wrote:
> On Sat, 31 Aug 2013, Cong Hou wrote:
>
>> > I don't see why it would be unsafe for logb - can you give an example
>> > (exact float input value as hex float, and the values you believe logb
>> > should return for float and double).
>> >
>>
>> Please try the following code (you will get different results whether to
>> use optimization mode):
>>
>> #include 
>> #include 
>>
>> int main()
>> {
>>   int i = 0x3edc67d5;
>>   float f = *((float*)&i);
>>   float r1 = logb(f);
>>   float r2 = logbf(f);
>>   printf("%x %x\n", *((int*)&r1), *((int*)&r2));
>> }
>
> (a) Please stop sending HTML email, so your messages reach the mailing
> list, and resend your messages so far to the list.  The mailing list needs
> to see the whole of both sides of the discussion of any patch being
> proposed for GCC.
>
> (b) I referred to the values *you believe logb should return*.
> Optimization is not meant to preserve library bugs; the comparison should
> be on 

Re: [PATCH] Fixing improper conversion from sin() to sinf() in optimization mode.

2013-09-03 Thread Cong Hou
Could you please tell me how to check the precision of long double in
GCC on different platforms?

Thank you!


Cong

On Tue, Sep 3, 2013 at 2:43 PM, Joseph S. Myers  wrote:
> On Tue, 3 Sep 2013, Xinliang David Li wrote:
>
>> >From Joseph:
>>
>> "The
>> conversion is not safe for sqrt if the two types are double and long
>> double and long double is x86 extended, for example."
>>
>> This is not reflected in the patch.
>
> No, the problem is that it tries to reflect it but hardcodes the specific
> example I gave, rather than following the logic I explained regarding the
> precisions of the types involved, which depend on the target.  And since I
> only gave a simplified analysis, for two types when this function deals
> with cases involving three types, the patch submission needs to include
> its own analysis for the full generality of three types to justify the
> logic used (as inequalities involving the three precisions).  (I suspect
> it reduces to the case of two types so you don't need to go into the
> details of reasoning about floating point to produce the more general
> analysis.  But in any case, it's for the patch submitter to give the full
> explanation.)
>
> --
> Joseph S. Myers
> jos...@codesourcery.com


Re: [PATCH] Fixing improper conversion from sin() to sinf() in optimization mode.

2013-09-04 Thread Cong Hou
I have made a new patch according to your comments. I found several
references saying that the precision 2p+2 is OK for the sqrt
conversion (one here:
http://www.cs.berkeley.edu/~fateman/generic/algorithms.pdf). The new
patch is pasted as below.

Thank you for all the suggestions, Joseph!


Cong


Index: gcc/testsuite/gcc.c-torture/execute/20030125-1.c
===
--- gcc/testsuite/gcc.c-torture/execute/20030125-1.c (revision 201891)
+++ gcc/testsuite/gcc.c-torture/execute/20030125-1.c (working copy)
@@ -44,11 +44,11 @@ __attribute__ ((noinline))
 double
 sin(double a)
 {
- abort ();
+ return a;
 }
 __attribute__ ((noinline))
 float
 sinf(float a)
 {
- return a;
+ abort ();
 }
Index: gcc/convert.c
===
--- gcc/convert.c (revision 201891)
+++ gcc/convert.c (working copy)
@@ -135,16 +135,34 @@ convert_to_real (tree type, tree expr)
   CASE_MATHFN (COS)
   CASE_MATHFN (ERF)
   CASE_MATHFN (ERFC)
-  CASE_MATHFN (FABS)
   CASE_MATHFN (LOG)
   CASE_MATHFN (LOG10)
   CASE_MATHFN (LOG2)
   CASE_MATHFN (LOG1P)
-  CASE_MATHFN (LOGB)
   CASE_MATHFN (SIN)
-  CASE_MATHFN (SQRT)
   CASE_MATHFN (TAN)
   CASE_MATHFN (TANH)
+  CASE_MATHFN (SQRT)
+
+/* The above functions (except sqrt) are not safe to do
this conversion. */
+if (!flag_unsafe_math_optimizations)
+{
+  /* sqrtl?(T1) could be safely converted into sqrtf?(T2) only if
+   * p1 >= p2*2+2, where p1 and p2 are precisions of T1 and T2. */
+  if ((fcode == BUILT_IN_SQRT || fcode == BUILT_IN_SQRTL))
+  {
+int p1 = REAL_MODE_FORMAT (TYPE_MODE (type))->p;
+int p2 = (fcode == BUILT_IN_SQRTL) ?
+REAL_MODE_FORMAT (TYPE_MODE (long_double_type_node))->p :
+REAL_MODE_FORMAT (TYPE_MODE (double_type_node))->p;
+if (p2 < p1 * 2 + 2)
+  break;
+  }
+  else
+break;
+}
+  CASE_MATHFN (FABS)
+  CASE_MATHFN (LOGB)
 #undef CASE_MATHFN
 {
   tree arg0 = strip_float_extensions (CALL_EXPR_ARG (expr, 0));

On Tue, Sep 3, 2013 at 3:38 PM, Joseph S. Myers  wrote:
> On Tue, 3 Sep 2013, Cong Hou wrote:
>
>> Could you please tell me how to check the precision of long double in
>> GCC on different platforms?
>
> REAL_MODE_FORMAT (TYPE_MODE (long_double_type_node))->p
>
> (but you should be referring to the relevant types - "type", the type
> being converted to, "itype", the type of the function being called in the
> source code, "TREE_TYPE (arg0)", the type of the argument after extensions
> have been removed, and "newtype", computed from those - so you should have
> expressions like the above with two or more of those four types, but not
> with long_double_type_node directly).
>
> The patch submission will need to include a proper analysis to justify to
> the reader why the particular inequality with particular types from those
> four is correct in all cases where the relevant code may be executed.
>
> --
> Joseph S. Myers
> jos...@codesourcery.com


Re: [PATCH] Fixing improper conversion from sin() to sinf() in optimization mode.

2013-09-04 Thread Cong Hou
Updated patch according to your comment (tabs are not pasted here).

Cong


Index: gcc/convert.c
===
--- gcc/convert.c (revision 201891)
+++ gcc/convert.c (working copy)
@@ -135,16 +135,40 @@ convert_to_real (tree type, tree expr)
   CASE_MATHFN (COS)
   CASE_MATHFN (ERF)
   CASE_MATHFN (ERFC)
-  CASE_MATHFN (FABS)
   CASE_MATHFN (LOG)
   CASE_MATHFN (LOG10)
   CASE_MATHFN (LOG2)
   CASE_MATHFN (LOG1P)
-  CASE_MATHFN (LOGB)
   CASE_MATHFN (SIN)
-  CASE_MATHFN (SQRT)
   CASE_MATHFN (TAN)
   CASE_MATHFN (TANH)
+  CASE_MATHFN (SQRT)
+
+/* The above functions (except sqrt) are not safe to do this conversion. */
+if (!flag_unsafe_math_optimizations)
+  {
+ /* sqrtl?(T1) could be safely converted into sqrtf?(T2) only if
+   p1 >= p2*2+2, where p1 and p2 are precisions of T1 and T2.
+   For example, on x86 the conversion from float(sqrt((double)f) to
+   sqrtf(f) is safe where f has the type float, since float has 23 bits
+   precision and double has 52 bits precision, and 52 > 23*2+2.
+   However, the conversion from double(sqrtl((long double)d) to
+   sqrt(d) is unsafe where d has the type double. This is because
+   long double has 63 bits precision and then 63 < 52*2+2.  */
+ if ((fcode == BUILT_IN_SQRT || fcode == BUILT_IN_SQRTL))
+  {
+int p1 = REAL_MODE_FORMAT (TYPE_MODE (type))->p;
+int p2 = (fcode == BUILT_IN_SQRTL) ?
+ REAL_MODE_FORMAT (TYPE_MODE (long_double_type_node))->p :
+ REAL_MODE_FORMAT (TYPE_MODE (double_type_node))->p;
+if (p2 < p1 * 2 + 2)
+  break;
+  }
+ else
+  break;
+  }
+  CASE_MATHFN (FABS)
+  CASE_MATHFN (LOGB)
 #undef CASE_MATHFN
 {
   tree arg0 = strip_float_extensions (CALL_EXPR_ARG (expr, 0));
Index: gcc/testsuite/gcc.c-torture/execute/20030125-1.c
===
--- gcc/testsuite/gcc.c-torture/execute/20030125-1.c (revision 201891)
+++ gcc/testsuite/gcc.c-torture/execute/20030125-1.c (working copy)
@@ -44,11 +44,11 @@ __attribute__ ((noinline))
 double
 sin(double a)
 {
- abort ();
+ return a;
 }
 __attribute__ ((noinline))
 float
 sinf(float a)
 {
- return a;
+ abort ();
 }

On Wed, Sep 4, 2013 at 2:21 PM, Xinliang David Li  wrote:
> On Wed, Sep 4, 2013 at 1:59 PM, Joseph S. Myers  
> wrote:
>> On Wed, 4 Sep 2013, Cong Hou wrote:
>>
>>> I have made a new patch according to your comments. I found several
>>> references saying that the precision 2p+2 is OK for the sqrt
>>> conversion (one here:
>>> http://www.cs.berkeley.edu/~fateman/generic/algorithms.pdf). The new
>>> patch is pasted as below.
>>
>> This patch submission still fails to pay attention to various of my
>> comments.
>>
>
> If you can provide inlined comments in the patch, that will be more
> useful and productive.
>
> thanks,
>
> David
>
>
>> --
>> Joseph S. Myers
>> jos...@codesourcery.com


Re: [PATCH] Fixing improper conversion from sin() to sinf() in optimization mode.

2013-09-06 Thread Cong Hou
First, thank you for your detailed comments again! Then I deeply
apologize for not explaining my patch properly and responding to your
previous comment. I didn't understand thoroughly the problem before
submitting the patch.

Previously I only considered the following three conversions for sqrt():


1: (float) sqrt ((double) float_val)  ->  sqrtf (float_val)
2: (float) sqrtl ((long double) float_val)  ->  sqrtf (float_val)
3: (double) sqrtl ((long double) double_val)  ->  sqrt (double_val)


We have four types here:

TYPE is the type to which the result of the function call is converted.
ITYPE is the type of the math call expression.
TREE_TYPE(arg0) is the type of the function argument (before type conversion).
NEWTYPE is chosen from TYPE and TREE_TYPE(arg0) with higher precision.
It will be the type of the new math call expression after conversion.

For all three cases above, TYPE is always the same as NEWTYPE. That is
why I only considered TYPE during the precision comparison. ITYPE can
only be double_type_node or long_double_type_node depending on the
type of the math function. That is why I explicitly used those two
types instead of ITYPE (no correctness issue). But you are right,
ITYPE is more elegant and better here.

After further analysis, I found I missed two more cases. Note that we
have the following conditions according to the code in convert.c:

TYPE_PRECISION(NEWTYPE) >= TYPE_PRECISION(TYPE)
TYPE_PRECISION(NEWTYPE) >= TYPE_PRECISION(TREE_TYPE(arg0))
TYPE_PRECISION (NEWTYPE) < TYPE_PRECISION (ITYPE)

the last condition comes from the fact that we only consider
converting a math function call into another one with narrower type.
Therefore we have

TYPE_PRECISION(TYPE) < TYPE_PRECISION (ITYPE)
TYPE_PRECISION(TREE_TYPE(arg0)) < TYPE_PRECISION (ITYPE)

So for sqrt(), TYPE and TREE_TYPE(arg0) can only be float, and for
sqrtl(), TYPE and TREE_TYPE(arg0) can be either float or double with
four possible combinations. Therefore we have two more conversions to
consider besides the three ones I mentioned above:


4: (float) sqrtl ((long double) double_val)  ->  (float) sqrt (double_val)
5: (double) sqrtl ((long double) float_val)  ->  sqrt ((double) float_val)


For the first conversion here, TYPE (float) is different from NEWTYPE
(double), and my previous patch doesn't handle this case.The correct
way is to compare precisions of ITYPE and NEWTYPE now.

To sum up, we are converting the expression

(TYPE) sqrtITYPE ((ITYPE) expr)

to

(TYPE) sqrtNEWTYPE ((NEWTYPE) expr)

and we require

PRECISION (ITYPE) >= PRECISION (NEWTYPE) * 2 + 2

to make it a safe conversion.


The new patch is pasted below.

I appreciate your detailed comments and analysis, and next time when I
submit a patch I will be more carefully about the reviewer's comment.


Thank you!

Cong



Index: gcc/convert.c
===
--- gcc/convert.c (revision 201891)
+++ gcc/convert.c (working copy)
@@ -135,16 +135,19 @@ convert_to_real (tree type, tree expr)
   CASE_MATHFN (COS)
   CASE_MATHFN (ERF)
   CASE_MATHFN (ERFC)
-  CASE_MATHFN (FABS)
   CASE_MATHFN (LOG)
   CASE_MATHFN (LOG10)
   CASE_MATHFN (LOG2)
   CASE_MATHFN (LOG1P)
-  CASE_MATHFN (LOGB)
   CASE_MATHFN (SIN)
-  CASE_MATHFN (SQRT)
   CASE_MATHFN (TAN)
   CASE_MATHFN (TANH)
+/* The above functions are not safe to do this conversion. */
+if (!flag_unsafe_math_optimizations)
+  break;
+  CASE_MATHFN (SQRT)
+  CASE_MATHFN (FABS)
+  CASE_MATHFN (LOGB)
 #undef CASE_MATHFN
 {
   tree arg0 = strip_float_extensions (CALL_EXPR_ARG (expr, 0));
@@ -155,6 +158,27 @@ convert_to_real (tree type, tree expr)
   if (TYPE_PRECISION (TREE_TYPE (arg0)) > TYPE_PRECISION (type))
  newtype = TREE_TYPE (arg0);

+  /* We consider to convert
+
+ (T1) sqrtT2 ((T2) exprT3)
+ to
+ (T1) sqrtT4 ((T4) exprT3)
+
+  , where T1 is TYPE, T2 is ITYPE, T3 is TREE_TYPE (ARG0),
+ and T4 is NEWTYPE. All those types are of floating point types.
+ T4 (NEWTYPE) should be narrower than T2 (ITYPE). This conversion
+ is safe only if P1 >= P2*2+2, where P1 and P2 are precisions of
+ T2 and T4. See the following URL for a reference:
+ 
http://stackoverflow.com/questions/9235456/determining-floating-point-square-root
+ */
+  if (fcode == BUILT_IN_SQRT || fcode == BUILT_IN_SQRTL)
+ {
+  int p1 = REAL_MODE_FORMAT (TYPE_MODE (itype))->p;
+  int p2 = REAL_MODE_FORMAT (TYPE_MODE (newtype))->p;
+  if (p1 < p2 * 2 + 2 && !flag_unsafe_math_optimizations)
+break;
+ }
+
   /* Be careful about integer to fp conversions.
  These may overflow still.  */
   if (FLOAT_TYPE_P (TREE_TYPE (arg0))
Index: gcc/testsuite/gcc.c-torture/execute/20030125-1.c
===
--- gcc/testsuite/gcc.c-torture/execute/20030125-1.c (revision 201891)
+++ gcc/testsuite/gcc.c-torture/execute/20030125-1.c (working copy)
@@ -44,11 +44,11 @@ __attribute__ ((noinline))
 double
 sin(double a)
 {
- abort ();
+ retu

Re: [PATCH] Fixing improper conversion from sin() to sinf() in optimization mode.

2013-09-09 Thread Cong Hou
On Mon, Sep 9, 2013 at 6:26 PM, Xinliang David Li  wrote:
> On Fri, Sep 6, 2013 at 3:24 PM, Cong Hou  wrote:
>> First, thank you for your detailed comments again! Then I deeply
>> apologize for not explaining my patch properly and responding to your
>> previous comment. I didn't understand thoroughly the problem before
>> submitting the patch.
>>
>> Previously I only considered the following three conversions for sqrt():
>>
>>
>> 1: (float) sqrt ((double) float_val)  ->  sqrtf (float_val)
>> 2: (float) sqrtl ((long double) float_val)  ->  sqrtf (float_val)
>> 3: (double) sqrtl ((long double) double_val)  ->  sqrt (double_val)
>>
>>
>> We have four types here:
>>
>> TYPE is the type to which the result of the function call is converted.
>> ITYPE is the type of the math call expression.
>> TREE_TYPE(arg0) is the type of the function argument (before type 
>> conversion).
>> NEWTYPE is chosen from TYPE and TREE_TYPE(arg0) with higher precision.
>> It will be the type of the new math call expression after conversion.
>>
>> For all three cases above, TYPE is always the same as NEWTYPE. That is
>> why I only considered TYPE during the precision comparison. ITYPE can
>> only be double_type_node or long_double_type_node depending on the
>> type of the math function. That is why I explicitly used those two
>> types instead of ITYPE (no correctness issue). But you are right,
>> ITYPE is more elegant and better here.
>>
>> After further analysis, I found I missed two more cases. Note that we
>> have the following conditions according to the code in convert.c:
>>
>> TYPE_PRECISION(NEWTYPE) >= TYPE_PRECISION(TYPE)
>> TYPE_PRECISION(NEWTYPE) >= TYPE_PRECISION(TREE_TYPE(arg0))
>> TYPE_PRECISION (NEWTYPE) < TYPE_PRECISION (ITYPE)
>>
>> the last condition comes from the fact that we only consider
>> converting a math function call into another one with narrower type.
>> Therefore we have
>>
>> TYPE_PRECISION(TYPE) < TYPE_PRECISION (ITYPE)
>> TYPE_PRECISION(TREE_TYPE(arg0)) < TYPE_PRECISION (ITYPE)
>>
>> So for sqrt(), TYPE and TREE_TYPE(arg0) can only be float, and for
>> sqrtl(), TYPE and TREE_TYPE(arg0) can be either float or double with
>> four possible combinations. Therefore we have two more conversions to
>> consider besides the three ones I mentioned above:
>>
>>
>> 4: (float) sqrtl ((long double) double_val)  ->  (float) sqrt (double_val)
>> 5: (double) sqrtl ((long double) float_val)  ->  sqrt ((double) float_val)
>>
>>
>> For the first conversion here, TYPE (float) is different from NEWTYPE
>> (double), and my previous patch doesn't handle this case.The correct
>> way is to compare precisions of ITYPE and NEWTYPE now.
>>
>> To sum up, we are converting the expression
>>
>> (TYPE) sqrtITYPE ((ITYPE) expr)
>>
>> to
>>
>> (TYPE) sqrtNEWTYPE ((NEWTYPE) expr)
>>
>> and we require
>>
>> PRECISION (ITYPE) >= PRECISION (NEWTYPE) * 2 + 2
>>
>> to make it a safe conversion.
>>
>>
>> The new patch is pasted below.
>>
>> I appreciate your detailed comments and analysis, and next time when I
>> submit a patch I will be more carefully about the reviewer's comment.
>>
>>
>> Thank you!
>>
>> Cong
>>
>>
>>
>> Index: gcc/convert.c
>> ===
>> --- gcc/convert.c (revision 201891)
>> +++ gcc/convert.c (working copy)
>> @@ -135,16 +135,19 @@ convert_to_real (tree type, tree expr)
>>CASE_MATHFN (COS)
>>CASE_MATHFN (ERF)
>>CASE_MATHFN (ERFC)
>> -  CASE_MATHFN (FABS)
>>CASE_MATHFN (LOG)
>>CASE_MATHFN (LOG10)
>>CASE_MATHFN (LOG2)
>>CASE_MATHFN (LOG1P)
>> -  CASE_MATHFN (LOGB)
>>CASE_MATHFN (SIN)
>> -  CASE_MATHFN (SQRT)
>>CASE_MATHFN (TAN)
>>CASE_MATHFN (TANH)
>> +/* The above functions are not safe to do this conversion. */
>> +if (!flag_unsafe_math_optimizations)
>> +  break;
>> +  CASE_MATHFN (SQRT)
>> +  CASE_MATHFN (FABS)
>> +  CASE_MATHFN (LOGB)
>>  #undef CASE_MATHFN
>>  {
>>tree arg0 = strip_float_extensions (CALL_EXPR_ARG (expr, 0));
>> @@ -155,6 +158,27 @@ convert_to_real (tree type, tree expr)
>>if (TYPE_PRECISION (TREE_TYPE (arg0)) > TYPE_PRECISION (type))
>>   newtype = TREE_TYPE (arg0);
>>
>> +  /* We consider to convert
>> +
>> + (T1) 

[PATCH] [vectorizer] Fixing a bug in tree-vect-patterns.c in GCC vectorizer.

2013-09-11 Thread Cong Hou
Hi

There is a bug in the function vect_recog_dot_prod_pattern() in
tree-vect-patterns.c. This function checks if a loop is of dot
production pattern. Specifically, according to the comment of this
function:

/*
 Try to find the following pattern:

 type x_t, y_t;
 TYPE1 prod;
 TYPE2 sum = init;
   loop:
 sum_0 = phi 
 S1  x_t = ...
 S2  y_t = ...
 S3  x_T = (TYPE1) x_t;
 S4  y_T = (TYPE1) y_t;
 S5  prod = x_T * y_T;
 [S6  prod = (TYPE2) prod;  #optional]
 S7  sum_1 = prod + sum_0;

   where 'TYPE1' is exactly double the size of type 'type', and
'TYPE2' is the same size of 'TYPE1' or bigger. This is a special case
of a reduction computation.
*/

This function should check if x_t and y_t have the same type (type)
which has the half size of TYPE1. The corresponding code is shown
below:

  oprnd0 = gimple_assign_rhs1 (stmt);
  oprnd1 = gimple_assign_rhs2 (stmt);
  if (!types_compatible_p (TREE_TYPE (oprnd0), prod_type) ||
!types_compatible_p (TREE_TYPE (oprnd1), prod_type))
return NULL;
  if (!type_conversion_p (oprnd0, stmt, true, &half_type0,
&def_stmt, &promotion) || !promotion)
return NULL;
  oprnd00 = gimple_assign_rhs1 (def_stmt);

/*==V  see here! */
  if (!type_conversion_p (oprnd0, stmt, true, &half_type1,
&def_stmt, &promotion) || !promotion)
return NULL;
  oprnd01 = gimple_assign_rhs1 (def_stmt);
  if (!types_compatible_p (half_type0, half_type1))
return NULL;
  if (TYPE_PRECISION (prod_type) != TYPE_PRECISION (half_type0) * 2)
return NULL;

Here the function uses x_T (oprnd0) to check the type of y_t, which is
incorrect. The fix is simple: just replace it by oprnd1.

The failed test case for this bug is shown below:

int foo(short *a, int *b, int n) {
  int sum = 0;
  for (int i = 0; i < n; ++i)
sum += a[i] * b[i];
  return sum;
}


thanks,
Cong


Index: gcc/tree-vect-patterns.c
===
--- gcc/tree-vect-patterns.c (revision 200988)
+++ gcc/tree-vect-patterns.c (working copy)
@@ -397,7 +397,7 @@ vect_recog_dot_prod_pattern (vec
   || !promotion)
 return NULL;
   oprnd00 = gimple_assign_rhs1 (def_stmt);
-  if (!type_conversion_p (oprnd0, stmt, true, &half_type1, &def_stmt,
+  if (!type_conversion_p (oprnd1, stmt, true, &half_type1, &def_stmt,
 &promotion)
   || !promotion)
 return NULL;


Re: [PATCH] [vectorizer] Fixing a bug in tree-vect-patterns.c in GCC vectorizer.

2013-09-13 Thread Cong Hou
A new test case is added to testsuite/gcc.dg/vect, which will fail
without this patch and pass with it. Bootstrap also get passed. No
additional test failure is introduced.

The new test case includes a dot product on two arrays with short and
int types. The loop will still be vectorized (using punpcklwd on array
with short type), but should not be recognized as a dot-product
pattern.


thanks,
Cong





Index: gcc/tree-vect-patterns.c
===
--- gcc/tree-vect-patterns.c (revision 202572)
+++ gcc/tree-vect-patterns.c (working copy)
@@ -397,7 +397,7 @@ vect_recog_dot_prod_pattern (vec
   || !promotion)
 return NULL;
   oprnd00 = gimple_assign_rhs1 (def_stmt);
-  if (!type_conversion_p (oprnd0, stmt, true, &half_type1, &def_stmt,
+  if (!type_conversion_p (oprnd1, stmt, true, &half_type1, &def_stmt,
 &promotion)
   || !promotion)
 return NULL;
Index: gcc/ChangeLog
===
--- gcc/ChangeLog (revision 202572)
+++ gcc/ChangeLog (working copy)
@@ -1,3 +1,9 @@
+2013-09-13  Cong Hou  
+
+ * tree-vect-patterns.c (vect_recog_dot_prod_pattern): Fix a bug
+ when checking the dot production pattern. The type of rhs operand
+ of multiply is now checked correctly.
+
 2013-09-13  Jan Hubicka  

  PR middle-end/58094
Index: gcc/testsuite/gcc.dg/vect/vect-reduc-dot-s16c.c
===
--- gcc/testsuite/gcc.dg/vect/vect-reduc-dot-s16c.c (revision 0)
+++ gcc/testsuite/gcc.dg/vect/vect-reduc-dot-s16c.c (revision 0)
@@ -0,0 +1,73 @@
+/* { dg-require-effective-target vect_int } */
+
+#include 
+#include "tree-vect.h"
+
+#define N 64
+#define DOT 43680
+
+signed short X[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
+signed int   Y[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
+
+/* (short, int)->int->int dot product.
+   Not detected as a dot-product pattern.  */
+
+__attribute__ ((noinline)) int
+foo (int len)
+{
+  int i;
+  int result = 0;
+
+  for (i = 0; i < len; i++)
+{
+  result += (X[i] * Y[i]);
+}
+  return result;
+}
+
+
+/* (int, short)->int->int dot product.
+   Not detected as a dot-product pattern.  */
+
+__attribute__ ((noinline)) int
+bar (int len)
+{
+  int i;
+  int result = 0;
+
+  for (i = 0; i < len; i++)
+{
+  result += (Y[i] * X[i]);
+}
+  return result;
+}
+
+int
+main (void)
+{
+  int i;
+  int dot;
+
+  check_vect ();
+
+  for (i = 0; i < N; i++)
+{
+  X[i] = i;
+  Y[i] = N - i;
+  __asm__ volatile ("");
+}
+
+  dot = foo (N);
+  if (dot != DOT)
+abort ();
+
+  dot = bar (N);
+  if (dot != DOT)
+abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 2 "vect" {
target vect_unpack } } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */
+
Index: gcc/testsuite/ChangeLog
===
--- gcc/testsuite/ChangeLog (revision 202572)
+++ gcc/testsuite/ChangeLog (working copy)
@@ -1,3 +1,9 @@
+2013-09-13  Cong Hou  
+
+ * gcc.dg/vect/vect-reduc-dot-s16c.c: Add a test case with dot product
+ on two arrays with short and int types. This should not be recognized
+ as a dot product pattern.
+
 2013-09-13  Kai Tietz  

  gcc.target/i386/pr57848.c: New file.




On Wed, Sep 11, 2013 at 6:55 PM, Xinliang David Li  wrote:
> Can you add a test case to the regression suite?
>
> When the type of arguments are unsigned short/unsigned int, GCC does
> not vectorize the loop anymore -- this is worth a separate bug to
> track. punpcklwd instruction can be used to do zero extension of the
> short type.
>
> David
>
> On Wed, Sep 11, 2013 at 6:16 PM, Cong Hou  wrote:
>> Hi
>>
>> There is a bug in the function vect_recog_dot_prod_pattern() in
>> tree-vect-patterns.c. This function checks if a loop is of dot
>> production pattern. Specifically, according to the comment of this
>> function:
>>
>> /*
>>  Try to find the following pattern:
>>
>>  type x_t, y_t;
>>  TYPE1 prod;
>>  TYPE2 sum = init;
>>loop:
>>  sum_0 = phi 
>>  S1  x_t = ...
>>  S2  y_t = ...
>>  S3  x_T = (TYPE1) x_t;
>>  S4  y_T = (TYPE1) y_t;
>>  S5  prod = x_T * y_T;
>>  [S6  prod = (TYPE2) prod;  #optional]
>>  S7  sum_1 = prod + sum_0;
>>
>>where 'TYPE1' is exactly double the size of type 'type', and
>> 'TYPE2' is the same size of 'TYPE1' or bigger. This is a special case
>> of a reduction computation.
>> */
>>
>> This function should check if x_t and y_t have the same type (ty

Re: [PATCH] Fixing improper conversion from sin() to sinf() in optimization mode.

2013-09-20 Thread Cong Hou
Any comment or more suggestions on this patch?


thanks,
Cong

On Mon, Sep 9, 2013 at 7:28 PM, Cong Hou  wrote:
> On Mon, Sep 9, 2013 at 6:26 PM, Xinliang David Li  wrote:
>> On Fri, Sep 6, 2013 at 3:24 PM, Cong Hou  wrote:
>>> First, thank you for your detailed comments again! Then I deeply
>>> apologize for not explaining my patch properly and responding to your
>>> previous comment. I didn't understand thoroughly the problem before
>>> submitting the patch.
>>>
>>> Previously I only considered the following three conversions for sqrt():
>>>
>>>
>>> 1: (float) sqrt ((double) float_val)  ->  sqrtf (float_val)
>>> 2: (float) sqrtl ((long double) float_val)  ->  sqrtf (float_val)
>>> 3: (double) sqrtl ((long double) double_val)  ->  sqrt (double_val)
>>>
>>>
>>> We have four types here:
>>>
>>> TYPE is the type to which the result of the function call is converted.
>>> ITYPE is the type of the math call expression.
>>> TREE_TYPE(arg0) is the type of the function argument (before type 
>>> conversion).
>>> NEWTYPE is chosen from TYPE and TREE_TYPE(arg0) with higher precision.
>>> It will be the type of the new math call expression after conversion.
>>>
>>> For all three cases above, TYPE is always the same as NEWTYPE. That is
>>> why I only considered TYPE during the precision comparison. ITYPE can
>>> only be double_type_node or long_double_type_node depending on the
>>> type of the math function. That is why I explicitly used those two
>>> types instead of ITYPE (no correctness issue). But you are right,
>>> ITYPE is more elegant and better here.
>>>
>>> After further analysis, I found I missed two more cases. Note that we
>>> have the following conditions according to the code in convert.c:
>>>
>>> TYPE_PRECISION(NEWTYPE) >= TYPE_PRECISION(TYPE)
>>> TYPE_PRECISION(NEWTYPE) >= TYPE_PRECISION(TREE_TYPE(arg0))
>>> TYPE_PRECISION (NEWTYPE) < TYPE_PRECISION (ITYPE)
>>>
>>> the last condition comes from the fact that we only consider
>>> converting a math function call into another one with narrower type.
>>> Therefore we have
>>>
>>> TYPE_PRECISION(TYPE) < TYPE_PRECISION (ITYPE)
>>> TYPE_PRECISION(TREE_TYPE(arg0)) < TYPE_PRECISION (ITYPE)
>>>
>>> So for sqrt(), TYPE and TREE_TYPE(arg0) can only be float, and for
>>> sqrtl(), TYPE and TREE_TYPE(arg0) can be either float or double with
>>> four possible combinations. Therefore we have two more conversions to
>>> consider besides the three ones I mentioned above:
>>>
>>>
>>> 4: (float) sqrtl ((long double) double_val)  ->  (float) sqrt (double_val)
>>> 5: (double) sqrtl ((long double) float_val)  ->  sqrt ((double) float_val)
>>>
>>>
>>> For the first conversion here, TYPE (float) is different from NEWTYPE
>>> (double), and my previous patch doesn't handle this case.The correct
>>> way is to compare precisions of ITYPE and NEWTYPE now.
>>>
>>> To sum up, we are converting the expression
>>>
>>> (TYPE) sqrtITYPE ((ITYPE) expr)
>>>
>>> to
>>>
>>> (TYPE) sqrtNEWTYPE ((NEWTYPE) expr)
>>>
>>> and we require
>>>
>>> PRECISION (ITYPE) >= PRECISION (NEWTYPE) * 2 + 2
>>>
>>> to make it a safe conversion.
>>>
>>>
>>> The new patch is pasted below.
>>>
>>> I appreciate your detailed comments and analysis, and next time when I
>>> submit a patch I will be more carefully about the reviewer's comment.
>>>
>>>
>>> Thank you!
>>>
>>> Cong
>>>
>>>
>>>
>>> Index: gcc/convert.c
>>> ===
>>> --- gcc/convert.c (revision 201891)
>>> +++ gcc/convert.c (working copy)
>>> @@ -135,16 +135,19 @@ convert_to_real (tree type, tree expr)
>>>CASE_MATHFN (COS)
>>>CASE_MATHFN (ERF)
>>>CASE_MATHFN (ERFC)
>>> -  CASE_MATHFN (FABS)
>>>CASE_MATHFN (LOG)
>>>CASE_MATHFN (LOG10)
>>>CASE_MATHFN (LOG2)
>>>CASE_MATHFN (LOG1P)
>>> -  CASE_MATHFN (LOGB)
>>>CASE_MATHFN (SIN)
>>> -  CASE_MATHFN (SQRT)
>>>CASE_MATHFN (TAN)
>>>CASE_MATHFN (TANH)
>>> +/* The above functions are not safe to do this conversion. */
>>> +if (!fla

[PATCH] Bug fix: *var and MEM[(const int *)var] (var has int* type) are not treated as the same data ref.

2013-09-23 Thread Cong Hou
(I have also created this issue in bug reports:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58513)


First look at the code below:


int op(const int* a, const int* b)
{ return *a+*b; }

void foo(int*a, int b)
{
  int i;
  for (i = 0; i < 10; ++i)
a[i] = op(a+i, &b);
}


GCC will generate the following GIMPLE for this loop after inlining op():


  :
  # i_15 = PHI 
  # ivtmp_23 = PHI 
  _4 = (long unsigned int) i_15;
  _5 = _4 * 4;
  _7 = a_6(D) + _5;
  _10 = MEM[(const int *)_7];
  _11 = _10 + b_12(D);
  *_7 = _11;
  i_9 = i_15 + 1;
  ivtmp_22 = ivtmp_23 - 1;
  if (ivtmp_22 != 0)
goto ;
  else
goto ;



Here each element of the array a is loaded by MEM[(const int *)_7] and
stored by *_7, which are the only two data refs in the loop body. The
GCC vectorizer needs to check the possible aliasing between data refs
with potential data dependence. Here those two data refs are actually
the same one, but GCC could not recognize this fact. As a result, the
aliasing checking predicate will always return false at runtime (GCC
4.9 could eliminate this generated branch at the end of the
vectorization pass).

The reason why GCC thinks that MEM[(const int *)_7] and *_7 are two
different data refs is that there is a possible defect in the function
operand_equal_p(), which is used to compare two data refs. The current
implementation uses == to compare the types of the second argument of
MEM_REF operator, which is too strict. Using types_compatible_p()
instead can fix the issue above. I also included a test case for this
bug fix. Bootstrapping and "make check" are both passed.


thanks,
Cong



Index: gcc/testsuite/gcc.dg/alias-14.c
===
--- gcc/testsuite/gcc.dg/alias-14.c (revision 0)
+++ gcc/testsuite/gcc.dg/alias-14.c (revision 0)
@@ -0,0 +1,24 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize" } */
+
+int op (const int* x, const int* y)
+{
+  return *x + *y;
+}
+
+/* After inlining op() the type of the data ref is converted from int* into
+   const int&, resulting in two data refs MEM[(const int *)DR] and *DR for read
+   and write, where DR represents the address of a[i] here.  They are still
+   the same data ref and no alias exists in the loop.  The vectorizer should
+   succesffuly vectorize this loop.  */
+
+void foo(int* a, int b)
+{
+  int i;
+  for (i = 0; i < 10; ++i)
+a[i] = op(a + i, &b);
+}
+
+
+/* { dg-final { scan-assembler-times "paddd" 1 { target x86_64-*-* } } } */
+
Index: gcc/fold-const.c
===
--- gcc/fold-const.c (revision 202662)
+++ gcc/fold-const.c (working copy)
@@ -2693,8 +2693,9 @@ operand_equal_p (const_tree arg0, const_
&& operand_equal_p (TYPE_SIZE (TREE_TYPE (arg0)),
TYPE_SIZE (TREE_TYPE (arg1)), flags)))
   && types_compatible_p (TREE_TYPE (arg0), TREE_TYPE (arg1))
-  && (TYPE_MAIN_VARIANT (TREE_TYPE (TREE_OPERAND (arg0, 1)))
-  == TYPE_MAIN_VARIANT (TREE_TYPE (TREE_OPERAND (arg1, 1
+  && types_compatible_p (
+   TYPE_MAIN_VARIANT (TREE_TYPE (TREE_OPERAND (arg0, 1))),
+   TYPE_MAIN_VARIANT (TREE_TYPE (TREE_OPERAND (arg1, 1
   && OP_SAME (0) && OP_SAME (1));

  case ARRAY_REF:


Re: [PATCH] Bug fix: *var and MEM[(const int *)var] (var has int* type) are not treated as the same data ref.

2013-09-24 Thread Cong Hou
Nice fix! I noticed that this patch is already combined to the trunk.

Thank you very much, Richard!



Cong

On Tue, Sep 24, 2013 at 1:49 AM, Richard Biener  wrote:
> On Tue, 24 Sep 2013, Richard Biener wrote:
>
>> On Tue, 24 Sep 2013, Jakub Jelinek wrote:
>>
>> > Hi!
>> >
>> > On Mon, Sep 23, 2013 at 05:26:13PM -0700, Cong Hou wrote:
>> >
>> > Missing ChangeLog entry.
>> >
>> > > --- gcc/testsuite/gcc.dg/alias-14.c (revision 0)
>> > > +++ gcc/testsuite/gcc.dg/alias-14.c (revision 0)
>> >
>> > Vectorizer tests should go into gcc.dg/vect/ instead, or, if they are
>> > for a single target (but there is no reason why this should be a single
>> > target), into gcc.target//.
>> >
>> > > --- gcc/fold-const.c (revision 202662)
>> > > +++ gcc/fold-const.c (working copy)
>> > > @@ -2693,8 +2693,9 @@ operand_equal_p (const_tree arg0, const_
>> > > && operand_equal_p (TYPE_SIZE (TREE_TYPE (arg0)),
>> > > TYPE_SIZE (TREE_TYPE (arg1)), flags)))
>> > >&& types_compatible_p (TREE_TYPE (arg0), TREE_TYPE (arg1))
>> > > -  && (TYPE_MAIN_VARIANT (TREE_TYPE (TREE_OPERAND (arg0, 1)))
>> > > -  == TYPE_MAIN_VARIANT (TREE_TYPE (TREE_OPERAND (arg1, 1
>> > > +  && types_compatible_p (
>> > > +   TYPE_MAIN_VARIANT (TREE_TYPE (TREE_OPERAND (arg0, 1))),
>> > > +   TYPE_MAIN_VARIANT (TREE_TYPE (TREE_OPERAND (arg1, 1
>> > >&& OP_SAME (0) && OP_SAME (1));
>> >
>> > This looks wrong.  types_compatible_p will happily return true say
>> > for unsigned long and unsigned long long types on x86_64, because
>> > they are both integral types with the same precision, but the second
>> > argument of MEM_REF contains aliasing information, where the distinction
>> > between the two is important.
>> > So, while == comparison of main variant is too strict, types_compatible_p
>> > is too weak, so I guess you need to write a new predicate that will either
>> > handle the == and a few special cases that are safe to be handled, or
>> > look for what exactly we use the type of the second MEM_REF argument
>> > and check those properties.  We certainly need that
>> > get_deref_alias_set_1 and get_deref_alias_set return the same values
>> > for both the types, but whether that is the only information we are using,
>> > not sure, CCing Richard.
>>
>> Using TYPE_MAIN_VARIANT is exactly correct - this is the best we
>> can do that will work with all frontends.  TYPE_MAIN_VARIANT
>> guarantees that the alias-sets stay the same:
>>
>>   /* If the innermost reference is a MEM_REF that has a
>>  conversion embedded treat it like a VIEW_CONVERT_EXPR above,
>>  using the memory access type for determining the alias-set.  */
>>  if (TREE_CODE (inner) == MEM_REF
>>  && TYPE_MAIN_VARIANT (TREE_TYPE (inner))
>> != TYPE_MAIN_VARIANT
>>(TREE_TYPE (TREE_TYPE (TREE_OPERAND (inner, 1)
>>return get_deref_alias_set (TREE_OPERAND (inner, 1));
>>
>> so we cannot change the compatibility checks without touching the
>> alias-set deriving code.  For the testcase in question we have
>> MEM[(const int &)_7] vs. MEM[(int *)_7] and unfortunately pointer
>> and reference types are not variant types.
>>
>> We also cannot easily resort to pointed-to types as our all-beloved
>> ref-all qualification is on the pointer type rather than on the
>> pointed-to type.
>>
>> But yes, we could implement a more complicated predicate
>>
>> bool
>> alias_ptr_types_compatible_p (const_tree t1, const_tree t2)
>> {
>>   t1 = TYPE_MAIN_VARIANT (t1);
>>   t2 = TYPE_MAIN_VARIANT (t2);
>>   if (t1 == t2)
>> return true;
>>
>>   if (TYPE_REF_CAN_ALIAS_ALL (t1)
>>   || TYPE_REF_CAN_ALIAS_ALL (t2))
>> return false;
>>
>>   return (TYPE_MAIN_VARIANT (TREE_TYPE (t1))
>>   == TYPE_MAIN_VARIANT (TREE_TYPE (t2)));
>> }
>>
>> Note that the fold-const code in question is
>>
>>   return ((TYPE_SIZE (TREE_TYPE (arg0)) == TYPE_SIZE (TREE_TYPE
>> (arg1))
>>|| (TYPE_SIZE (TREE_TYPE (arg0))
>>&& TYPE_SIZE (TREE_TYPE (arg1))
>>&& operand_equal_p (TYPE_SIZE (TREE_TYPE (arg0)),
>>TYPE_SIZE (TREE_TYPE (arg1)),
&g

[PATCH] Relax the requirement of reduction pattern in GCC vectorizer.

2013-09-27 Thread Cong Hou
The current GCC vectorizer requires the following pattern as a simple
reduction computation:

   loop_header:
 a1 = phi < a0, a2 >
 a3 = ...
 a2 = operation (a3, a1)

But a3 can also be defined outside of the loop. For example, the
following loop can benefit from vectorization but the GCC vectorizer
fails to vectorize it:


int foo(int v)
{
  int s = 1;
  ++v;
  for (int i = 0; i < 10; ++i)
s *= v;
  return s;
}


This patch relaxes the original requirement by also considering the
following pattern:


   a3 = ...
   loop_header:
 a1 = phi < a0, a2 >
 a2 = operation (a3, a1)


A test case is also added. The patch is tested on x86-64.


thanks,
Cong



diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 39c786e..45c1667 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,9 @@
+2013-09-27  Cong Hou  
+
+ * tree-vect-loop.c: Relax the requirement of the reduction
+ pattern so that one operand of the reduction operation can
+ come from outside of the loop.
+
 2013-09-25  Tom Tromey  

  * Makefile.in (PARTITION_H, LTO_SYMTAB_H, COMMON_TARGET_DEF_H)
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 09644d2..90496a2 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,7 @@
+2013-09-27  Cong Hou  
+
+ * gcc.dg/vect/vect-reduc-pattern-3.c: New test.
+
 2013-09-25  Marek Polacek  

  PR sanitizer/58413
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 2871ba1..3c51c3b 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -2091,6 +2091,13 @@ vect_is_slp_reduction (loop_vec_info loop_info,
gimple phi, gimple first_stmt)
  a3 = ...
  a2 = operation (a3, a1)

+   or
+
+   a3 = ...
+   loop_header:
+ a1 = phi < a0, a2 >
+ a2 = operation (a3, a1)
+
such that:
1. operation is commutative and associative and it is safe to
   change the order of the computation (if CHECK_REDUCTION is true)
@@ -2451,6 +2458,7 @@ vect_is_simple_reduction_1 (loop_vec_info
loop_info, gimple phi,
   if (def2 && def2 == phi
   && (code == COND_EXPR
   || !def1 || gimple_nop_p (def1)
+  || !flow_bb_inside_loop_p (loop, gimple_bb (def1))
   || (def1 && flow_bb_inside_loop_p (loop, gimple_bb (def1))
   && (is_gimple_assign (def1)
   || is_gimple_call (def1)
@@ -2469,6 +2477,7 @@ vect_is_simple_reduction_1 (loop_vec_info
loop_info, gimple phi,
   if (def1 && def1 == phi
   && (code == COND_EXPR
   || !def2 || gimple_nop_p (def2)
+  || !flow_bb_inside_loop_p (loop, gimple_bb (def2))
   || (def2 && flow_bb_inside_loop_p (loop, gimple_bb (def2))
   && (is_gimple_assign (def2)
   || is_gimple_call (def2)
diff --git gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c
gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c
new file mode 100644
index 000..06a9416
--- /dev/null
+++ gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c
@@ -0,0 +1,41 @@
+/* { dg-require-effective-target vect_int } */
+
+#include 
+#include "tree-vect.h"
+
+#define N 10
+#define RES 1024
+
+/* A reduction pattern in which there is no data ref in
+   the loop and one operand is defined outside of the loop.  */
+
+__attribute__ ((noinline)) int
+foo (int v)
+{
+  int i;
+  int result = 1;
+
+  ++v;
+  for (i = 0; i < N; i++)
+result *= v;
+
+  return result;
+}
+
+int
+main (void)
+{
+  int res;
+
+  check_vect ();
+
+  res = foo (1);
+  if (res != RES)
+abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */
+


[PATCH] Improving uniform_vector_p() function.

2013-10-01 Thread Cong Hou
The current uniform_vector_p() function only returns non-NULL when the
vector is directly a uniform vector. For example, for the following
gimple code:

vect_cst_.15_91 = {_9, _9, _9, _9, _9, _9, _9, _9};


The current implementation can only detect that {_9, _9, _9, _9, _9,
_9, _9, _9} is a uniform vector, but fails to recognize
vect_cst_.15_91 is also one. This simple patch searches through
assignment chains to find more uniform vectors.


thanks,
Cong



diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 45c1667..b42f8a9 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,9 @@
+2013-10-01  Cong Hou  
+
+   * tree.c: Improve the function uniform_vector_p() so that a
+   vector assigned with a uniform vector is also treated as a
+   uniform vector.
+
diff --git a/gcc/tree.c b/gcc/tree.c
index 1c881e4..1d6d894 100644
--- a/gcc/tree.c
+++ b/gcc/tree.c
@@ -10297,6 +10297,17 @@ uniform_vector_p (const_tree vec)
   return first;
 }

+  if (TREE_CODE (vec) == SSA_NAME)
+{
+  gimple def = SSA_NAME_DEF_STMT (vec);
+  if (gimple_code (def) == GIMPLE_ASSIGN)
+{
+  tree rhs = gimple_op (def, 1);
+  if (VECTOR_TYPE_P (TREE_TYPE (rhs)))
+return uniform_vector_p (rhs);
+}
+}
+
   return NULL_TREE;
 }


Re: [PATCH] Improving uniform_vector_p() function.

2013-10-01 Thread Cong Hou
Actually I will introduce optimizations in the next patch. Currently
the function uniform_vector_p () is rarely used in GCC, but there are
certainly some optimization opportunities with the help of this
function.

For example, when we widen a vector with 8 identical element of short
type to two vectors of int type, GCC emits the following code:

  vect_cst_.15_91 = {_9, _9, _9, _9, _9, _9, _9, _9};
  vect__10.16_92 = [vec_unpack_lo_expr] vect_cst_.15_91;
  vect__10.16_93 = [vec_unpack_hi_expr] vect_cst_.15_91;

When vect_cst_.15_91 is a uniform vector, we know vect__10.16_92 and
vect__10.16_93 are identical so that we can remove the second
[vec_unpack_hi_expr] operation:

  vect_cst_.15_91 = {_9, _9, _9, _9, _9, _9, _9, _9};
  vect__10.16_92 = [vec_unpack_lo_expr] vect_cst_.15_91;
  vect__10.16_93 = vect__10.16_92;


thanks,
Cong


On Tue, Oct 1, 2013 at 2:37 PM, Xinliang David Li  wrote:
> On Tue, Oct 1, 2013 at 10:31 AM, Cong Hou  wrote:
>> The current uniform_vector_p() function only returns non-NULL when the
>> vector is directly a uniform vector. For example, for the following
>> gimple code:
>>
>> vect_cst_.15_91 = {_9, _9, _9, _9, _9, _9, _9, _9};
>>
>>
>> The current implementation can only detect that {_9, _9, _9, _9, _9,
>> _9, _9, _9} is a uniform vector, but fails to recognize
>> vect_cst_.15_91 is also one. This simple patch searches through
>> assignment chains to find more uniform vectors.
>>
>>
>> thanks,
>> Cong
>>
>>
>>
>> diff --git a/gcc/ChangeLog b/gcc/ChangeLog
>> index 45c1667..b42f8a9 100644
>> --- a/gcc/ChangeLog
>> +++ b/gcc/ChangeLog
>> @@ -1,3 +1,9 @@
>> +2013-10-01  Cong Hou  
>> +
>> +   * tree.c: Improve the function uniform_vector_p() so that a
>> +   vector assigned with a uniform vector is also treated as a
>> +   uniform vector.
>> +
>> diff --git a/gcc/tree.c b/gcc/tree.c
>> index 1c881e4..1d6d894 100644
>> --- a/gcc/tree.c
>> +++ b/gcc/tree.c
>> @@ -10297,6 +10297,17 @@ uniform_vector_p (const_tree vec)
>>return first;
>>  }
>>
>> +  if (TREE_CODE (vec) == SSA_NAME)
>> +{
>> +  gimple def = SSA_NAME_DEF_STMT (vec);
>> +  if (gimple_code (def) == GIMPLE_ASSIGN)
>
>
> do  this:
>
>  if (is_gimple_assign (def) && gimple_assign_copy_p (def))
>
>> +{
>> +  tree rhs = gimple_op (def, 1);
>> +  if (VECTOR_TYPE_P (TREE_TYPE (rhs)))
>> +return uniform_vector_p (rhs);
>> +}
>> +}
>> +
>>return NULL_TREE;
>>  }
>
> Do you have a test case showing what missed optimization this fix can enable ?
>
> David


[PATCH] Reducing number of alias checks in vectorization.

2013-10-01 Thread Cong Hou
)->dest);
+  gsi_move_after (&si, &si_dst);
+ }
+  continue;
+}
+  else if (!dr)
+  {
+bool hoist = true;
+for (size_t i = 0; i < gimple_num_ops (stmt); i++)
+{
+  tree op = gimple_op (stmt, i);
+  if (TREE_CODE (op) == INTEGER_CST
+  || TREE_CODE (op) == REAL_CST)
+continue;
+  if (TREE_CODE (op) == SSA_NAME)
+  {
+gimple def = SSA_NAME_DEF_STMT (op);
+if (def == stmt
+|| gimple_nop_p (def)
+|| !flow_bb_inside_loop_p (loop, gimple_bb (def)))
+  continue;
+  }
+  hoist = false;
+  break;
+}
+
+if (hoist)
+{
+  basic_block preheader = loop_preheader_edge (loop)->src;
+  gimple_stmt_iterator si_dst = gsi_last_bb (preheader);
+  gsi_move_after (&si, &si_dst);
+  continue;
+}
+  }
+  gsi_next (&si);
+ }
+}
+
   /* End loop-exit-fixes after versioning.  */

   if (cond_expr_stmt_list)
Index: gcc/ChangeLog
===
--- gcc/ChangeLog (revision 202663)
+++ gcc/ChangeLog (working copy)
@@ -1,3 +1,8 @@
+2013-10-01  Cong Hou  
+
+ * tree-vect-loop-manip.c (vect_create_cond_for_alias_checks): Combine
+ alias checks if it is possible to amortize the runtime overhead.
+


Re: [PATCH] Reducing number of alias checks in vectorization.

2013-10-02 Thread Cong Hou
On Tue, Oct 1, 2013 at 11:35 PM, Jakub Jelinek  wrote:
> On Tue, Oct 01, 2013 at 07:12:54PM -0700, Cong Hou wrote:
>> --- gcc/tree-vect-loop-manip.c (revision 202662)
>> +++ gcc/tree-vect-loop-manip.c (working copy)
>
> Your mailer ate all the tabs, so the formatting of the whole patch
> can't be checked.
>


I'll pay attention to this problem in my later patch submission.


>> @@ -19,6 +19,10 @@ You should have received a copy of the G
>>  along with GCC; see the file COPYING3.  If not see
>>  <http://www.gnu.org/licenses/>.  */
>>
>> +#include 
>> +#include 
>> +#include 
>
> Why?  GCC has it's vec.h vectors, why don't you use those?
> There is even qsort method for you in there.  And for pairs, you can
> easily just use structs with two members as structure elements in the
> vector.
>


GCC is now restructured using C++ and STL is one of the most important
part of C++. I am new to GCC community and more familiar to STL (and I
think allowing STL in GCC could attract more new developers for GCC).
I agree using GCC's vec can maintain a uniform style but STL is just
so powerful and easy to use...

I just did a search in GCC source tree and found  is not used
yet. I will change std::vector to GCC's vec for now (and also qsort),
but am still wondering if one day GCC would accept STL.


>> +struct dr_addr_with_seg_len
>> +{
>> +  dr_addr_with_seg_len (data_reference* d, tree addr, tree off, tree len)
>> +: dr (d), basic_addr (addr), offset (off), seg_len (len) {}
>> +
>> +  data_reference* dr;
>
> Space should be before *, not after it.
>
>> +  if (TREE_CODE (p11.offset) != INTEGER_CST
>> +  || TREE_CODE (p21.offset) != INTEGER_CST)
>> +return p11.offset < p21.offset;
>
> If offset isn't INTEGER_CST, you are comparing the pointer values?
> That is never a good idea, then compilation will depend on how say address
> space randomization randomizes virtual address space.  GCC needs to have
> reproduceable compilations.


I this scenario comparing pointers is safe. The sort is used to put
together any two pairs of data refs which can be merged. For example,
if we have (a, b) (a, c), (a, b+1), then after sorting them we should
have either (a, b), (a, b+1), (a, c) or (a, c), (a, b), (a, b+1). We
don't care the relative order of "non-mergable" dr pairs here. So
although the sorting result may vary the final result we get should
not change.


>
>> +  if (int_cst_value (p11.offset) != int_cst_value (p21.offset))
>> +return int_cst_value (p11.offset) < int_cst_value (p21.offset);
>
> This is going to ICE whenever the offsets wouldn't fit into a
> HOST_WIDE_INT.
>
> I'd say you just shouldn't put into the vector entries where offset isn't
> host_integerp, those would never be merged with other checks, or something
> similar.

Do you mean I should use widest_int_cst_value()? Then I will replace
all int_cst_value() here with it. I also changed the type of "diff"
variable into HOST_WIDEST_INT.



Thank you very much for your comments!

Cong



>
> Jakub


Re: [PATCH] Relax the requirement of reduction pattern in GCC vectorizer.

2013-10-02 Thread Cong Hou
Ping..  Any comment on this patch?


thanks,
Cong


On Sat, Sep 28, 2013 at 9:34 AM, Xinliang David Li  wrote:
> You can also add a test case of this form:
>
> int foo( int t, int n, int *dst)
> {
>int j = 0;
>int s = 1;
>t++;
>for (j = 0; j < n; j++)
>  {
>  dst[j] = t;
>  s *= t;
>  }
>
>return s;
> }
>
> where without the fix the loop vectorization is missed.
>
> David
>
> On Fri, Sep 27, 2013 at 6:28 PM, Cong Hou  wrote:
>> The current GCC vectorizer requires the following pattern as a simple
>> reduction computation:
>>
>>loop_header:
>>  a1 = phi < a0, a2 >
>>  a3 = ...
>>  a2 = operation (a3, a1)
>>
>> But a3 can also be defined outside of the loop. For example, the
>> following loop can benefit from vectorization but the GCC vectorizer
>> fails to vectorize it:
>>
>>
>> int foo(int v)
>> {
>>   int s = 1;
>>   ++v;
>>   for (int i = 0; i < 10; ++i)
>> s *= v;
>>   return s;
>> }
>>
>>
>> This patch relaxes the original requirement by also considering the
>> following pattern:
>>
>>
>>a3 = ...
>>loop_header:
>>  a1 = phi < a0, a2 >
>>  a2 = operation (a3, a1)
>>
>>
>> A test case is also added. The patch is tested on x86-64.
>>
>>
>> thanks,
>> Cong
>>
>> 
>>
>> diff --git a/gcc/ChangeLog b/gcc/ChangeLog
>> index 39c786e..45c1667 100644
>> --- a/gcc/ChangeLog
>> +++ b/gcc/ChangeLog
>> @@ -1,3 +1,9 @@
>> +2013-09-27  Cong Hou  
>> +
>> + * tree-vect-loop.c: Relax the requirement of the reduction
>> + pattern so that one operand of the reduction operation can
>> + come from outside of the loop.
>> +
>>  2013-09-25  Tom Tromey  
>>
>>   * Makefile.in (PARTITION_H, LTO_SYMTAB_H, COMMON_TARGET_DEF_H)
>> diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
>> index 09644d2..90496a2 100644
>> --- a/gcc/testsuite/ChangeLog
>> +++ b/gcc/testsuite/ChangeLog
>> @@ -1,3 +1,7 @@
>> +2013-09-27  Cong Hou  
>> +
>> + * gcc.dg/vect/vect-reduc-pattern-3.c: New test.
>> +
>>  2013-09-25  Marek Polacek  
>>
>>   PR sanitizer/58413
>> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
>> index 2871ba1..3c51c3b 100644
>> --- a/gcc/tree-vect-loop.c
>> +++ b/gcc/tree-vect-loop.c
>> @@ -2091,6 +2091,13 @@ vect_is_slp_reduction (loop_vec_info loop_info,
>> gimple phi, gimple first_stmt)
>>   a3 = ...
>>   a2 = operation (a3, a1)
>>
>> +   or
>> +
>> +   a3 = ...
>> +   loop_header:
>> + a1 = phi < a0, a2 >
>> + a2 = operation (a3, a1)
>> +
>> such that:
>> 1. operation is commutative and associative and it is safe to
>>change the order of the computation (if CHECK_REDUCTION is true)
>> @@ -2451,6 +2458,7 @@ vect_is_simple_reduction_1 (loop_vec_info
>> loop_info, gimple phi,
>>if (def2 && def2 == phi
>>&& (code == COND_EXPR
>>|| !def1 || gimple_nop_p (def1)
>> +  || !flow_bb_inside_loop_p (loop, gimple_bb (def1))
>>|| (def1 && flow_bb_inside_loop_p (loop, gimple_bb (def1))
>>&& (is_gimple_assign (def1)
>>|| is_gimple_call (def1)
>> @@ -2469,6 +2477,7 @@ vect_is_simple_reduction_1 (loop_vec_info
>> loop_info, gimple phi,
>>if (def1 && def1 == phi
>>&& (code == COND_EXPR
>>|| !def2 || gimple_nop_p (def2)
>> +  || !flow_bb_inside_loop_p (loop, gimple_bb (def2))
>>|| (def2 && flow_bb_inside_loop_p (loop, gimple_bb (def2))
>>&& (is_gimple_assign (def2)
>>|| is_gimple_call (def2)
>> diff --git gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c
>> gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c
>> new file mode 100644
>> index 000..06a9416
>> --- /dev/null
>> +++ gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c
>> @@ -0,0 +1,41 @@
>> +/* { dg-require-effective-target vect_int } */
>> +
>> +#include 
>> +#include "tree-vect.h"
>> +
>> +#define N 10
>> +#define RES 1024
>> +
>> +/* A reduction pattern in which there is no data ref in
>> +   the loop and one operand is defined outside of the loop.  */
>> +
>> +__attribute__ ((noinline)) int
>> +foo (int v)
>> +{
>> +  int i;
>> +  int result = 1;
>> +
>> +  ++v;
>> +  for (i = 0; i < N; i++)
>> +result *= v;
>> +
>> +  return result;
>> +}
>> +
>> +int
>> +main (void)
>> +{
>> +  int res;
>> +
>> +  check_vect ();
>> +
>> +  res = foo (1);
>> +  if (res != RES)
>> +abort ();
>> +
>> +  return 0;
>> +}
>> +
>> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
>> +/* { dg-final { cleanup-tree-dump "vect" } } */
>> +


Re: [PATCH] Reducing number of alias checks in vectorization.

2013-10-02 Thread Cong Hou
On Wed, Oct 2, 2013 at 4:24 AM, Richard Biener  wrote:
> On Tue, 1 Oct 2013, Cong Hou wrote:
>
>> When alias exists between data refs in a loop, to vectorize it GCC
>> does loop versioning and adds runtime alias checks. Basically for each
>> pair of data refs with possible data dependence, there will be two
>> comparisons generated to make sure there is no aliasing between them
>> in each iteration of the vectorized loop. If there are many such data
>> refs pairs, the number of comparisons can be very large, which is a
>> big overhead.
>>
>> However, in some cases it is possible to reduce the number of those
>> comparisons. For example, for the following loop, we can detect that
>> b[0] and b[1] are two consecutive member accesses so that we can
>> combine the alias check between a[0:100]&b[0] and a[0:100]&b[1] into
>> checking a[0:100]&b[0:2]:
>>
>> void foo(int*a, int* b)
>> {
>>for (int i = 0; i < 100; ++i)
>> a[i] = b[0] + b[1];
>> }
>>
>> Actually, the requirement of consecutive memory accesses is too
>> strict. For the following loop, we can still combine the alias checks
>> between a[0:100]&b[0] and a[0:100]&b[100]:
>>
>> void foo(int*a, int* b)
>> {
>>for (int i = 0; i < 100; ++i)
>> a[i] = b[0] + b[100];
>> }
>>
>> This is because if b[0] is not in a[0:100] and b[100] is not in
>> a[0:100] then a[0:100] cannot be between b[0] and b[100]. We only need
>> to check a[0:100] and b[0:101] don't overlap.
>>
>> More generally, consider two pairs of data refs (a, b1) and (a, b2).
>> Suppose addr_b1 and addr_b2 are basic addresses of data ref b1 and b2;
>> offset_b1 and offset_b2 (offset_b1 < offset_b2) are offsets of b1 and
>> b2, and segment_length_a, segment_length_b1, and segment_length_b2 are
>> segment length of a, b1, and b2. Then we can combine the two
>> comparisons into one if the following condition is satisfied:
>>
>> offset_b2- offset_b1 - segment_length_b1 < segment_length_a
>>
>>
>> This patch detects those combination opportunities to reduce the
>> number of alias checks. It is tested on an x86-64 machine.
>
> Apart from the other comments you got (to which I agree) the patch
> seems to do two things, namely also:
>
> +  /* Extract load and store statements on pointers with zero-stride
> + accesses.  */
> +  if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo))
> +{
>
> which I'd rather see in a separate patch (and done also when
> the loop doesn't require versioning for alias).
>


My mistake.. I am working on those two patches at the same time and
pasted that one also here by mistake. I will send another patch about
the "hoist" topic.


> Also combining the alias checks in vect_create_cond_for_alias_checks
> is nice but doesn't properly fix the use of the
> vect-max-version-for-alias-checks param which currently inhibits
> vectorization of the HIMENO benchmark by default (and make us look bad
> compared to LLVM).
>
> So I believe this merging should be done incrementally when
> we collect the DDRs we need to test in vect_mark_for_runtime_alias_test.
>


I agree that vect-max-version-for-alias-checks param should count the
number of checks after the merge. However, the struct
data_dependence_relation could not record the new information produced
by the merge. The new information I mentioned contains the new segment
length for comparisons. This length is calculated right in
vect_create_cond_for_alias_checks() function. Since
vect-max-version-for-alias-checks is used during analysis phase, shall
we move all those (get segment length for each data ref and merge
alias checks) from transformation to analysis phase? If we cannot
store the result properly (data_dependence_relation is not enough),
shall we do it twice in both phases?

I also noticed a possible bug in the function vect_same_range_drs()
called by vect_prune_runtime_alias_test_list(). For the following code
I get two pairs of data refs after
vect_prune_runtime_alias_test_list(), but in
vect_create_cond_for_alias_checks() after detecting grouped accesses I
got two identical pairs of data refs. The consequence is two identical
alias checks are produced.


void yuv2yuyv_ref (int *d, int *src, int n)
{
  char *dest = (char *)d;
  int i;

  for(i=0;i>16;
dest[i*4 + 1] = (src[i*2 + 1])>>8;
dest[i*4 + 2] = (src[i*2 + 0])>>16;
dest[i*4 + 3] = (src[i*2 + 0])>>0;
  }
}


I think the solution to this problem is changing

GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt_i))
== GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt_j)

into

STMT_VINFO_DATA_REF (vinfo_for_stmt (GROUP_FIRST_ELEMENT
(v

Re: [PATCH] Reducing number of alias checks in vectorization.

2013-10-02 Thread Cong Hou
On Wed, Oct 2, 2013 at 2:18 PM, Xinliang David Li  wrote:
> On Wed, Oct 2, 2013 at 4:24 AM, Richard Biener  wrote:
>> On Tue, 1 Oct 2013, Cong Hou wrote:
>>
>>> When alias exists between data refs in a loop, to vectorize it GCC
>>> does loop versioning and adds runtime alias checks. Basically for each
>>> pair of data refs with possible data dependence, there will be two
>>> comparisons generated to make sure there is no aliasing between them
>>> in each iteration of the vectorized loop. If there are many such data
>>> refs pairs, the number of comparisons can be very large, which is a
>>> big overhead.
>>>
>>> However, in some cases it is possible to reduce the number of those
>>> comparisons. For example, for the following loop, we can detect that
>>> b[0] and b[1] are two consecutive member accesses so that we can
>>> combine the alias check between a[0:100]&b[0] and a[0:100]&b[1] into
>>> checking a[0:100]&b[0:2]:
>>>
>>> void foo(int*a, int* b)
>>> {
>>>for (int i = 0; i < 100; ++i)
>>> a[i] = b[0] + b[1];
>>> }
>>>
>>> Actually, the requirement of consecutive memory accesses is too
>>> strict. For the following loop, we can still combine the alias checks
>>> between a[0:100]&b[0] and a[0:100]&b[100]:
>>>
>>> void foo(int*a, int* b)
>>> {
>>>for (int i = 0; i < 100; ++i)
>>> a[i] = b[0] + b[100];
>>> }
>>>
>>> This is because if b[0] is not in a[0:100] and b[100] is not in
>>> a[0:100] then a[0:100] cannot be between b[0] and b[100]. We only need
>>> to check a[0:100] and b[0:101] don't overlap.
>>>
>>> More generally, consider two pairs of data refs (a, b1) and (a, b2).
>>> Suppose addr_b1 and addr_b2 are basic addresses of data ref b1 and b2;
>>> offset_b1 and offset_b2 (offset_b1 < offset_b2) are offsets of b1 and
>>> b2, and segment_length_a, segment_length_b1, and segment_length_b2 are
>>> segment length of a, b1, and b2. Then we can combine the two
>>> comparisons into one if the following condition is satisfied:
>>>
>>> offset_b2- offset_b1 - segment_length_b1 < segment_length_a
>>>
>>>
>>> This patch detects those combination opportunities to reduce the
>>> number of alias checks. It is tested on an x86-64 machine.
>>
>> Apart from the other comments you got (to which I agree) the patch
>> seems to do two things, namely also:
>>
>> +  /* Extract load and store statements on pointers with zero-stride
>> + accesses.  */
>> +  if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo))
>> +{
>>
>> which I'd rather see in a separate patch (and done also when
>> the loop doesn't require versioning for alias).
>
> yes.
>
>>
>> Also combining the alias checks in vect_create_cond_for_alias_checks
>> is nice but doesn't properly fix the use of the
>> vect-max-version-for-alias-checks param
>
> Yes. The handling of this should be moved to
> 'vect_prune_runtime_alias_test_list' to avoid premature decisions.
>
>
>
>>which currently inhibits
>> vectorization of the HIMENO benchmark by default (and make us look bad
>> compared to LLVM).
>
> Here is a small reproducible:
>
> struct  A {
>   int *base;
>   int offset;
>   int offset2;
>   int offset3;
>   int offset4;
>   int offset5;
>   int offset6;
>   int offset7;
>   int offset8;
> };
>
> void foo (struct A * ar1, struct A* ar2)
> {
>   int i;
>   for (i = 0; i < 1; i++)
> {
>ar1->base[i]  = 2*ar2->base[i] + ar2->offset + ar2->offset2
> + ar2->offset3 + ar2->offset4 + ar2->offset5 + ar2->offset6; /* +
> ar2->offset7 + ar2->offset8;*/
> }
> }
>
> GCC trunk won't vectorize it at O2 due to the limit.
>
>
> There is another problem we should be tracking: GCC no longer
> vectorize the loop (with large
> --param=vect-max-version-for-alias-checks=40) when -fno-strict-alias
> is specified.   However with additional runtime alias check, the loop
> should be vectorizable.


The problem can be reproduced by the following loop:


void foo (int* a, int** b)
{
  int i;
  for (i = 0; i < 1000; ++i)
a[i] = (*b)[i];
}


When -fno-strict-aliasing is specified, the basic address of (*b)[i]
which is *b could be modified by a[i] if alias exists between them.
This forbids GCC from making the basic address of (*b)[i] a loop
invariant, and hence could not do 

Re: [PATCH] Reducing number of alias checks in vectorization.

2013-10-02 Thread Cong Hou
Forget to mention that the alias check merger can reduce the number of
checks from 7 to 2 for this example:

struct  A {
  int *base;
  int offset;
  int offset2;
  int offset3;
  int offset4;
  int offset5;
  int offset6;
  int offset7;
  int offset8;
};

void foo (struct A * ar1, struct A* ar2)
{
  int i;
  for (i = 0; i < 1; i++)
{
   ar1->base[i]  = 2*ar2->base[i] + ar2->offset + ar2->offset2
+ ar2->offset3 + ar2->offset4 + ar2->offset5 + ar2->offset6; /* +
ar2->offset7 + ar2->offset8;*/
}
}


thanks,
Cong


On Wed, Oct 2, 2013 at 2:18 PM, Xinliang David Li  wrote:
> On Wed, Oct 2, 2013 at 4:24 AM, Richard Biener  wrote:
>> On Tue, 1 Oct 2013, Cong Hou wrote:
>>
>>> When alias exists between data refs in a loop, to vectorize it GCC
>>> does loop versioning and adds runtime alias checks. Basically for each
>>> pair of data refs with possible data dependence, there will be two
>>> comparisons generated to make sure there is no aliasing between them
>>> in each iteration of the vectorized loop. If there are many such data
>>> refs pairs, the number of comparisons can be very large, which is a
>>> big overhead.
>>>
>>> However, in some cases it is possible to reduce the number of those
>>> comparisons. For example, for the following loop, we can detect that
>>> b[0] and b[1] are two consecutive member accesses so that we can
>>> combine the alias check between a[0:100]&b[0] and a[0:100]&b[1] into
>>> checking a[0:100]&b[0:2]:
>>>
>>> void foo(int*a, int* b)
>>> {
>>>for (int i = 0; i < 100; ++i)
>>> a[i] = b[0] + b[1];
>>> }
>>>
>>> Actually, the requirement of consecutive memory accesses is too
>>> strict. For the following loop, we can still combine the alias checks
>>> between a[0:100]&b[0] and a[0:100]&b[100]:
>>>
>>> void foo(int*a, int* b)
>>> {
>>>for (int i = 0; i < 100; ++i)
>>> a[i] = b[0] + b[100];
>>> }
>>>
>>> This is because if b[0] is not in a[0:100] and b[100] is not in
>>> a[0:100] then a[0:100] cannot be between b[0] and b[100]. We only need
>>> to check a[0:100] and b[0:101] don't overlap.
>>>
>>> More generally, consider two pairs of data refs (a, b1) and (a, b2).
>>> Suppose addr_b1 and addr_b2 are basic addresses of data ref b1 and b2;
>>> offset_b1 and offset_b2 (offset_b1 < offset_b2) are offsets of b1 and
>>> b2, and segment_length_a, segment_length_b1, and segment_length_b2 are
>>> segment length of a, b1, and b2. Then we can combine the two
>>> comparisons into one if the following condition is satisfied:
>>>
>>> offset_b2- offset_b1 - segment_length_b1 < segment_length_a
>>>
>>>
>>> This patch detects those combination opportunities to reduce the
>>> number of alias checks. It is tested on an x86-64 machine.
>>
>> Apart from the other comments you got (to which I agree) the patch
>> seems to do two things, namely also:
>>
>> +  /* Extract load and store statements on pointers with zero-stride
>> + accesses.  */
>> +  if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo))
>> +{
>>
>> which I'd rather see in a separate patch (and done also when
>> the loop doesn't require versioning for alias).
>
> yes.
>
>>
>> Also combining the alias checks in vect_create_cond_for_alias_checks
>> is nice but doesn't properly fix the use of the
>> vect-max-version-for-alias-checks param
>
> Yes. The handling of this should be moved to
> 'vect_prune_runtime_alias_test_list' to avoid premature decisions.
>
>
>
>>which currently inhibits
>> vectorization of the HIMENO benchmark by default (and make us look bad
>> compared to LLVM).
>
> Here is a small reproducible:
>
> struct  A {
>   int *base;
>   int offset;
>   int offset2;
>   int offset3;
>   int offset4;
>   int offset5;
>   int offset6;
>   int offset7;
>   int offset8;
> };
>
> void foo (struct A * ar1, struct A* ar2)
> {
>   int i;
>   for (i = 0; i < 1; i++)
> {
>ar1->base[i]  = 2*ar2->base[i] + ar2->offset + ar2->offset2
> + ar2->offset3 + ar2->offset4 + ar2->offset5 + ar2->offset6; /* +
> ar2->offset7 + ar2->offset8;*/
> }
> }
>
> GCC trunk won't vectorize it at O2 due to the limit.
>
>
> There is another problem we should be tracking: GCC no longer
> vectorize

Re: [PATCH] Reducing number of alias checks in vectorization.

2013-10-02 Thread Cong Hou
On Wed, Oct 2, 2013 at 2:47 PM, Xinliang David Li  wrote:
> I think you need to augment (using a wrapper class) the DDR to capture
> more information about aliased memory pairs. It should be flexible
> enough to handle the following cases (you don't have to handle all
> cases in your first patch, but keep those in mind).


In order to bring the information in this augmented structure from the
analysis phase to transformation phase, should we add one more member
to loop_vec_info? Note that currently almost all vectorization related
information is contained in that struct.


>
> 1) All accesses in the same group have constant offsets:
>
> b[i], b[i+1], b[i+2] etc

This is the easy case.

>
> 2) Accesses in the same group may have offset which is specified by a
> unsigned value:
>
>unsigned N = ...
>
>b[i], b[i+N]

If the value of N or its upper bound (see the next case) is unknown at
compile time, we could not merge the alias checks for a & b[i] and a &
b[i+N]. This is because the segment of a may exist between b[i] and
b[i+N].

>
> 3) Accesses have offset with value range > 0:
>
>for (j = 0; j < 1; j++)
>for (i = 0; i < ...; i++)
>  {
>    b[i] 
>    b[i + j ]    // j > 0
>   }
>

If we know j is greater than 0 and has a constant upper bound, we can
utilize this information during alias checks merging. For an induction
variable j, its upper bound can be queried easily. What if j is not an
induction variable:

unsigned j = ...;
if (j < 1000)
{
 for (i = 0; i < ...; i++)
  {
 b[i] 
 b[i + j ] 
  }
}

In current GCC implementation, how to get the upper bound of j here?
Should we search the control dependent predicate of the loop to see if
we are lucky to get the upper bound of j?


>
> 4) base addresses are assigned from the same buffer:
>
> b1  = &buffer[0];
> b2 = &buffer[1];
> b3 = &buffer[2];
>
> for (...)
>   {
>  ..b1[i]..
>  ..b2[i]..
>  ..
>}

This case helped to find a bug in my patch. Here the basic address of
b1 is an addr_expr &buffer instead of buffer. I should not compare the
pointer values of two basic addresses any more but should use
operand_equal_p(). Then Jakub is right: I should not sort all ddr
pairs by comparing pointer values. I once wrote a comparison function
and will consider to use that for sorting.


>
> 5) More elaborate case:
>
>for (i = 0; i< 3; i++)
>   base[i] = &buffer[i*N];
>
>  b1 = base[0];
>  b2 = base[1];
>  ...
> for ()
> {
>.. b1[i]..
> ..
> }

After loop unrolling this case becomes the same as the last one.


thanks,
Cong


>
> David
>
>
> On Wed, Oct 2, 2013 at 2:34 PM, Cong Hou  wrote:
>> On Wed, Oct 2, 2013 at 4:24 AM, Richard Biener  wrote:
>>> On Tue, 1 Oct 2013, Cong Hou wrote:
>>>
>>>> When alias exists between data refs in a loop, to vectorize it GCC
>>>> does loop versioning and adds runtime alias checks. Basically for each
>>>> pair of data refs with possible data dependence, there will be two
>>>> comparisons generated to make sure there is no aliasing between them
>>>> in each iteration of the vectorized loop. If there are many such data
>>>> refs pairs, the number of comparisons can be very large, which is a
>>>> big overhead.
>>>>
>>>> However, in some cases it is possible to reduce the number of those
>>>> comparisons. For example, for the following loop, we can detect that
>>>> b[0] and b[1] are two consecutive member accesses so that we can
>>>> combine the alias check between a[0:100]&b[0] and a[0:100]&b[1] into
>>>> checking a[0:100]&b[0:2]:
>>>>
>>>> void foo(int*a, int* b)
>>>> {
>>>>for (int i = 0; i < 100; ++i)
>>>> a[i] = b[0] + b[1];
>>>> }
>>>>
>>>> Actually, the requirement of consecutive memory accesses is too
>>>> strict. For the following loop, we can still combine the alias checks
>>>> between a[0:100]&b[0] and a[0:100]&b[100]:
>>>>
>>>> void foo(int*a, int* b)
>>>> {
>>>>for (int i = 0; i < 100; ++i)
>>>> a[i] = b[0] + b[100];
>>>> }
>>>>
>>>> This is because if b[0] is not in a[0:100] and b[100] is not in
>>>> a[0:100] then a[0:100] cannot be between b[0] and b[100]. We only need
>>>> to check a[0:100] and b[0:101] don't overlap.
>>>>
>>&

Re: [PATCH] Reducing number of alias checks in vectorization.

2013-10-03 Thread Cong Hou
I noticed that there is a "struct dataref_aux" defined in
tree-vectorizer.h which is specific to the vectorizer pass and is
stored in (void*)aux in "struct data_reference". Can we add one more
field "segment_length" to dataref_aux so that we can pass this
information for merging alias checks? Then we can avoid to modify or
create other structures.


thanks,
Cong


On Wed, Oct 2, 2013 at 2:34 PM, Cong Hou  wrote:
> On Wed, Oct 2, 2013 at 4:24 AM, Richard Biener  wrote:
>> On Tue, 1 Oct 2013, Cong Hou wrote:
>>
>>> When alias exists between data refs in a loop, to vectorize it GCC
>>> does loop versioning and adds runtime alias checks. Basically for each
>>> pair of data refs with possible data dependence, there will be two
>>> comparisons generated to make sure there is no aliasing between them
>>> in each iteration of the vectorized loop. If there are many such data
>>> refs pairs, the number of comparisons can be very large, which is a
>>> big overhead.
>>>
>>> However, in some cases it is possible to reduce the number of those
>>> comparisons. For example, for the following loop, we can detect that
>>> b[0] and b[1] are two consecutive member accesses so that we can
>>> combine the alias check between a[0:100]&b[0] and a[0:100]&b[1] into
>>> checking a[0:100]&b[0:2]:
>>>
>>> void foo(int*a, int* b)
>>> {
>>>for (int i = 0; i < 100; ++i)
>>> a[i] = b[0] + b[1];
>>> }
>>>
>>> Actually, the requirement of consecutive memory accesses is too
>>> strict. For the following loop, we can still combine the alias checks
>>> between a[0:100]&b[0] and a[0:100]&b[100]:
>>>
>>> void foo(int*a, int* b)
>>> {
>>>for (int i = 0; i < 100; ++i)
>>> a[i] = b[0] + b[100];
>>> }
>>>
>>> This is because if b[0] is not in a[0:100] and b[100] is not in
>>> a[0:100] then a[0:100] cannot be between b[0] and b[100]. We only need
>>> to check a[0:100] and b[0:101] don't overlap.
>>>
>>> More generally, consider two pairs of data refs (a, b1) and (a, b2).
>>> Suppose addr_b1 and addr_b2 are basic addresses of data ref b1 and b2;
>>> offset_b1 and offset_b2 (offset_b1 < offset_b2) are offsets of b1 and
>>> b2, and segment_length_a, segment_length_b1, and segment_length_b2 are
>>> segment length of a, b1, and b2. Then we can combine the two
>>> comparisons into one if the following condition is satisfied:
>>>
>>> offset_b2- offset_b1 - segment_length_b1 < segment_length_a
>>>
>>>
>>> This patch detects those combination opportunities to reduce the
>>> number of alias checks. It is tested on an x86-64 machine.
>>
>> Apart from the other comments you got (to which I agree) the patch
>> seems to do two things, namely also:
>>
>> +  /* Extract load and store statements on pointers with zero-stride
>> + accesses.  */
>> +  if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo))
>> +{
>>
>> which I'd rather see in a separate patch (and done also when
>> the loop doesn't require versioning for alias).
>>
>
>
> My mistake.. I am working on those two patches at the same time and
> pasted that one also here by mistake. I will send another patch about
> the "hoist" topic.
>
>
>> Also combining the alias checks in vect_create_cond_for_alias_checks
>> is nice but doesn't properly fix the use of the
>> vect-max-version-for-alias-checks param which currently inhibits
>> vectorization of the HIMENO benchmark by default (and make us look bad
>> compared to LLVM).
>>
>> So I believe this merging should be done incrementally when
>> we collect the DDRs we need to test in vect_mark_for_runtime_alias_test.
>>
>
>
> I agree that vect-max-version-for-alias-checks param should count the
> number of checks after the merge. However, the struct
> data_dependence_relation could not record the new information produced
> by the merge. The new information I mentioned contains the new segment
> length for comparisons. This length is calculated right in
> vect_create_cond_for_alias_checks() function. Since
> vect-max-version-for-alias-checks is used during analysis phase, shall
> we move all those (get segment length for each data ref and merge
> alias checks) from transformation to analysis phase? If we cannot
> store the result properly (data_dependence_relation is not enough),
> shall we do it twice in both phases?
>
> I also noticed a possibl

Re: [PATCH] Reducing number of alias checks in vectorization.

2013-10-03 Thread Cong Hou
On Thu, Oct 3, 2013 at 2:06 PM, Joseph S. Myers  wrote:
> On Tue, 1 Oct 2013, Cong Hou wrote:
>
>> +#include 
>> +#include 
>> +#include 
>> +
>>  #include "config.h"
>
> Whatever the other issues about including these headers at all, any system
> header (C or C++) must always be included *after* config.h, as config.h
> may define feature test macros that are only properly effective if defined
> before any system headers are included, and these macros (affecting such
> things as the size of off_t) need to be consistent throughout GCC.
>

OK. Actually I did meet some conflicts when I put those three C++
headers after all other includes.

Thank you for the comments.


Cong


> --
> Joseph S. Myers
> jos...@codesourcery.com


Re: [PATCH] Reducing number of alias checks in vectorization.

2013-10-03 Thread Cong Hou
Forget about this "aux" idea as the segment length for one data ref
can be different in different dr pairs.

In my patch I created a struct as shown below:

struct dr_addr_with_seg_len
{
  data_reference *dr;
  tree basic_addr;
  tree offset;
  tree seg_len;
};


Note that basic_addr and offset can always obtained from dr, but we
need to store two segment lengths for each dr pair. It is improper to
add a field to data_dependence_relation as it is defined outside of
vectorizer. We can change the type (a new one combining
data_dependence_relation and segment length) of may_alias_ddrs in
loop_vec_info to include such information, but we have to add a new
type to tree-vectorizer.h which is only used in two places - still too
much.

One possible solution is that we create a local struct as shown above
and a new function which returns the merged alias check information.
This function will be called twice: once during analysis phase and
once in transformation phase. Then we don't have to store the merged
alias check information during those two phases. The additional time
cost is minimal as there will not be too many data dependent dr pairs
in a loop.

Any comment?


thanks,
Cong


On Thu, Oct 3, 2013 at 10:57 AM, Cong Hou  wrote:
> I noticed that there is a "struct dataref_aux" defined in
> tree-vectorizer.h which is specific to the vectorizer pass and is
> stored in (void*)aux in "struct data_reference". Can we add one more
> field "segment_length" to dataref_aux so that we can pass this
> information for merging alias checks? Then we can avoid to modify or
> create other structures.
>
>
> thanks,
> Cong
>
>
> On Wed, Oct 2, 2013 at 2:34 PM, Cong Hou  wrote:
>> On Wed, Oct 2, 2013 at 4:24 AM, Richard Biener  wrote:
>>> On Tue, 1 Oct 2013, Cong Hou wrote:
>>>
>>>> When alias exists between data refs in a loop, to vectorize it GCC
>>>> does loop versioning and adds runtime alias checks. Basically for each
>>>> pair of data refs with possible data dependence, there will be two
>>>> comparisons generated to make sure there is no aliasing between them
>>>> in each iteration of the vectorized loop. If there are many such data
>>>> refs pairs, the number of comparisons can be very large, which is a
>>>> big overhead.
>>>>
>>>> However, in some cases it is possible to reduce the number of those
>>>> comparisons. For example, for the following loop, we can detect that
>>>> b[0] and b[1] are two consecutive member accesses so that we can
>>>> combine the alias check between a[0:100]&b[0] and a[0:100]&b[1] into
>>>> checking a[0:100]&b[0:2]:
>>>>
>>>> void foo(int*a, int* b)
>>>> {
>>>>for (int i = 0; i < 100; ++i)
>>>> a[i] = b[0] + b[1];
>>>> }
>>>>
>>>> Actually, the requirement of consecutive memory accesses is too
>>>> strict. For the following loop, we can still combine the alias checks
>>>> between a[0:100]&b[0] and a[0:100]&b[100]:
>>>>
>>>> void foo(int*a, int* b)
>>>> {
>>>>for (int i = 0; i < 100; ++i)
>>>> a[i] = b[0] + b[100];
>>>> }
>>>>
>>>> This is because if b[0] is not in a[0:100] and b[100] is not in
>>>> a[0:100] then a[0:100] cannot be between b[0] and b[100]. We only need
>>>> to check a[0:100] and b[0:101] don't overlap.
>>>>
>>>> More generally, consider two pairs of data refs (a, b1) and (a, b2).
>>>> Suppose addr_b1 and addr_b2 are basic addresses of data ref b1 and b2;
>>>> offset_b1 and offset_b2 (offset_b1 < offset_b2) are offsets of b1 and
>>>> b2, and segment_length_a, segment_length_b1, and segment_length_b2 are
>>>> segment length of a, b1, and b2. Then we can combine the two
>>>> comparisons into one if the following condition is satisfied:
>>>>
>>>> offset_b2- offset_b1 - segment_length_b1 < segment_length_a
>>>>
>>>>
>>>> This patch detects those combination opportunities to reduce the
>>>> number of alias checks. It is tested on an x86-64 machine.
>>>
>>> Apart from the other comments you got (to which I agree) the patch
>>> seems to do two things, namely also:
>>>
>>> +  /* Extract load and store statements on pointers with zero-stride
>>> + accesses.  */
>>> +  if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo))
>>> +{
>>>
>>> which I'd rather see in a separate patch (and done a

[PATCH] Hoist loop invariant statements containing data refs with zero-step during loop-versioning in vectorization.

2013-10-03 Thread Cong Hou
During loop versioning in vectorization, the alias check guarantees
that any load of a data reference with zero-step is a loop invariant,
which can be hoisted outside of the loop. After hoisting the load
statement, there may exist more loop invariant statements. This patch
tries to find all those statements and hoists them before the loop.

An example is shown below:


for (i = 0; i < N; ++i)
  a[i] = *b + 1;


After loop versioning the loop to be vectorized is guarded by

if (b + 1 < a && a + N < b)

which means there is no aliasing between *b and a[i]. The GIMPLE code
of the loop body is:

  :
  # i_18 = PHI <0(4), i_29(6)>
  # ivtmp_22 = PHI <1(4), ivtmp_30(6)>
  _23 = (long unsigned int) i_18;
  _24 = _23 * 4;
  _25 = a_6(D) + _24;
  _26 = *b_8(D);=> loop invariant
  _27 = _26 + 1;=> loop invariant
  *_25 = _27;
  i_29 = i_18 + 1;
  ivtmp_30 = ivtmp_22 - 1;
  if (ivtmp_30 != 0)
goto ;
  else
goto ;


After hoisting loop invariant statements:


  _26 = *b_8(D);
  _27 = _26 + 1;

  :
  # i_18 = PHI <0(4), i_29(6)>
  # ivtmp_22 = PHI <1(4), ivtmp_30(6)>
  _23 = (long unsigned int) i_18;
  _24 = _23 * 4;
  _25 = a_6(D) + _24;
  *_25 = _27;
  i_29 = i_18 + 1;
  ivtmp_30 = ivtmp_22 - 1;
  if (ivtmp_30 != 0)
goto ;
  else
goto ;


This patch is related to the bug report
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58508


thanks,
Cong
diff --git gcc/testsuite/gcc.dg/vect/pr58508.c 
gcc/testsuite/gcc.dg/vect/pr58508.c
new file mode 100644
index 000..cb22b50
--- /dev/null
+++ gcc/testsuite/gcc.dg/vect/pr58508.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details" } */
+
+
+/* The GCC vectorizer generates loop versioning for the following loop
+   since there may exist aliasing between A and B.  The predicate checks
+   if A may alias with B across all iterations.  Then for the loop in
+   the true body, we can assert that *B is a loop invariant so that
+   we can hoist the load of *B before the loop body.  */
+
+void foo (int* a, int* b)
+{
+  int i;
+  for (i = 0; i < 10; ++i)
+a[i] = *b + 1;
+}
+
+
+/* { dg-final { scan-tree-dump-times "hoist" 2 "vect" } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */


Re: [PATCH] Fixing improper conversion from sin() to sinf() in optimization mode.

2013-10-03 Thread Cong Hou
Ping...


thanks,
Cong


On Fri, Sep 20, 2013 at 9:49 AM, Cong Hou  wrote:
> Any comment or more suggestions on this patch?
>
>
> thanks,
> Cong
>
> On Mon, Sep 9, 2013 at 7:28 PM, Cong Hou  wrote:
>> On Mon, Sep 9, 2013 at 6:26 PM, Xinliang David Li  wrote:
>>> On Fri, Sep 6, 2013 at 3:24 PM, Cong Hou  wrote:
>>>> First, thank you for your detailed comments again! Then I deeply
>>>> apologize for not explaining my patch properly and responding to your
>>>> previous comment. I didn't understand thoroughly the problem before
>>>> submitting the patch.
>>>>
>>>> Previously I only considered the following three conversions for sqrt():
>>>>
>>>>
>>>> 1: (float) sqrt ((double) float_val)  ->  sqrtf (float_val)
>>>> 2: (float) sqrtl ((long double) float_val)  ->  sqrtf (float_val)
>>>> 3: (double) sqrtl ((long double) double_val)  ->  sqrt (double_val)
>>>>
>>>>
>>>> We have four types here:
>>>>
>>>> TYPE is the type to which the result of the function call is converted.
>>>> ITYPE is the type of the math call expression.
>>>> TREE_TYPE(arg0) is the type of the function argument (before type 
>>>> conversion).
>>>> NEWTYPE is chosen from TYPE and TREE_TYPE(arg0) with higher precision.
>>>> It will be the type of the new math call expression after conversion.
>>>>
>>>> For all three cases above, TYPE is always the same as NEWTYPE. That is
>>>> why I only considered TYPE during the precision comparison. ITYPE can
>>>> only be double_type_node or long_double_type_node depending on the
>>>> type of the math function. That is why I explicitly used those two
>>>> types instead of ITYPE (no correctness issue). But you are right,
>>>> ITYPE is more elegant and better here.
>>>>
>>>> After further analysis, I found I missed two more cases. Note that we
>>>> have the following conditions according to the code in convert.c:
>>>>
>>>> TYPE_PRECISION(NEWTYPE) >= TYPE_PRECISION(TYPE)
>>>> TYPE_PRECISION(NEWTYPE) >= TYPE_PRECISION(TREE_TYPE(arg0))
>>>> TYPE_PRECISION (NEWTYPE) < TYPE_PRECISION (ITYPE)
>>>>
>>>> the last condition comes from the fact that we only consider
>>>> converting a math function call into another one with narrower type.
>>>> Therefore we have
>>>>
>>>> TYPE_PRECISION(TYPE) < TYPE_PRECISION (ITYPE)
>>>> TYPE_PRECISION(TREE_TYPE(arg0)) < TYPE_PRECISION (ITYPE)
>>>>
>>>> So for sqrt(), TYPE and TREE_TYPE(arg0) can only be float, and for
>>>> sqrtl(), TYPE and TREE_TYPE(arg0) can be either float or double with
>>>> four possible combinations. Therefore we have two more conversions to
>>>> consider besides the three ones I mentioned above:
>>>>
>>>>
>>>> 4: (float) sqrtl ((long double) double_val)  ->  (float) sqrt (double_val)
>>>> 5: (double) sqrtl ((long double) float_val)  ->  sqrt ((double) float_val)
>>>>
>>>>
>>>> For the first conversion here, TYPE (float) is different from NEWTYPE
>>>> (double), and my previous patch doesn't handle this case.The correct
>>>> way is to compare precisions of ITYPE and NEWTYPE now.
>>>>
>>>> To sum up, we are converting the expression
>>>>
>>>> (TYPE) sqrtITYPE ((ITYPE) expr)
>>>>
>>>> to
>>>>
>>>> (TYPE) sqrtNEWTYPE ((NEWTYPE) expr)
>>>>
>>>> and we require
>>>>
>>>> PRECISION (ITYPE) >= PRECISION (NEWTYPE) * 2 + 2
>>>>
>>>> to make it a safe conversion.
>>>>
>>>>
>>>> The new patch is pasted below.
>>>>
>>>> I appreciate your detailed comments and analysis, and next time when I
>>>> submit a patch I will be more carefully about the reviewer's comment.
>>>>
>>>>
>>>> Thank you!
>>>>
>>>> Cong
>>>>
>>>>
>>>>
>>>> Index: gcc/convert.c
>>>> ===
>>>> --- gcc/convert.c (revision 201891)
>>>> +++ gcc/convert.c (working copy)
>>>> @@ -135,16 +135,19 @@ convert_to_real (tree type, tree expr)
>>>>CASE_MATHFN (COS)
>>>>CASE_MATHFN (ERF)
>

Re: [PATCH] Fixing improper conversion from sin() to sinf() in optimization mode.

2013-10-07 Thread Cong Hou
You are right. I am not an expert on numerical analysis, but I tested
your case and it proves the number 4 conversion is not safe.

Now we have four conversions which are safe once the precision
requirement is satisfied. I added a condition if (type != newtype) to
remove the unsafe one, as in this case once more conversion is added
which leads to the unsafe issue. If you think this condition does not
make sense please let me know.

The new patch is shown below (the attached file has tabs).

Thank you very much!



thanks,
Cong



Index: gcc/convert.c
===
--- gcc/convert.c (revision 203250)
+++ gcc/convert.c (working copy)
@@ -135,16 +135,19 @@ convert_to_real (tree type, tree expr)
   CASE_MATHFN (COS)
   CASE_MATHFN (ERF)
   CASE_MATHFN (ERFC)
-  CASE_MATHFN (FABS)
   CASE_MATHFN (LOG)
   CASE_MATHFN (LOG10)
   CASE_MATHFN (LOG2)
   CASE_MATHFN (LOG1P)
-  CASE_MATHFN (LOGB)
   CASE_MATHFN (SIN)
-  CASE_MATHFN (SQRT)
   CASE_MATHFN (TAN)
   CASE_MATHFN (TANH)
+/* The above functions are not safe to do this conversion.  */
+if (!flag_unsafe_math_optimizations)
+  break;
+  CASE_MATHFN (SQRT)
+  CASE_MATHFN (FABS)
+  CASE_MATHFN (LOGB)
 #undef CASE_MATHFN
 {
   tree arg0 = strip_float_extensions (CALL_EXPR_ARG (expr, 0));
@@ -155,13 +158,43 @@ convert_to_real (tree type, tree expr)
   if (TYPE_PRECISION (TREE_TYPE (arg0)) > TYPE_PRECISION (type))
  newtype = TREE_TYPE (arg0);

+  /* We consider to convert
+
+ (T1) sqrtT2 ((T2) exprT3)
+ to
+ (T1) sqrtT4 ((T4) exprT3)
+
+  , where T1 is TYPE, T2 is ITYPE, T3 is TREE_TYPE (ARG0),
+ and T4 is NEWTYPE. All those types are of floating point types.
+ T4 (NEWTYPE) should be narrower than T2 (ITYPE). This conversion
+ is safe only if P1 >= P2*2+2, where P1 and P2 are precisions of
+ T2 and T4. See the following URL for a reference:
+ 
http://stackoverflow.com/questions/9235456/determining-floating-point-square-root
+ */
+  if ((fcode == BUILT_IN_SQRT || fcode == BUILT_IN_SQRTL)
+  && !flag_unsafe_math_optimizations)
+ {
+  /* The following conversion is unsafe even the precision condition
+ below is satisfied:
+
+ (float) sqrtl ((long double) double_val) -> (float) sqrt (double_val)
+*/
+  if (type != newtype)
+break;
+
+  int p1 = REAL_MODE_FORMAT (TYPE_MODE (itype))->p;
+  int p2 = REAL_MODE_FORMAT (TYPE_MODE (newtype))->p;
+  if (p1 < p2 * 2 + 2)
+break;
+ }
+
   /* Be careful about integer to fp conversions.
  These may overflow still.  */
   if (FLOAT_TYPE_P (TREE_TYPE (arg0))
   && TYPE_PRECISION (newtype) < TYPE_PRECISION (itype)
   && (TYPE_MODE (newtype) == TYPE_MODE (double_type_node)
   || TYPE_MODE (newtype) == TYPE_MODE (float_type_node)))
-{
+ {
   tree fn = mathfn_built_in (newtype, fcode);

   if (fn)
Index: gcc/ChangeLog
===
--- gcc/ChangeLog (revision 203250)
+++ gcc/ChangeLog (working copy)
@@ -1,3 +1,9 @@
+2013-10-07  Cong Hou  
+
+ * convert.c (convert_to_real): Forbid unsafe math function
+ conversions including sin/cos/log etc. Add precision check
+ for sqrt.
+
 2013-10-07  Bill Schmidt  

  * config/rs6000/rs6000.c (altivec_expand_vec_perm_const_le): New.
Index: gcc/testsuite/ChangeLog
===
--- gcc/testsuite/ChangeLog (revision 203250)
+++ gcc/testsuite/ChangeLog (working copy)
@@ -1,3 +1,7 @@
+2013-10-07  Cong Hou  
+
+ * gcc.c-torture/execute/20030125-1.c: Update.
+
 2013-10-07  Bill Schmidt  

  * gcc.target/powerpc/pr43154.c: Skip for ppc64 little endian.
Index: gcc/testsuite/gcc.c-torture/execute/20030125-1.c
===
--- gcc/testsuite/gcc.c-torture/execute/20030125-1.c (revision 203250)
+++ gcc/testsuite/gcc.c-torture/execute/20030125-1.c (working copy)
@@ -44,11 +44,11 @@ __attribute__ ((noinline))
 double
 sin(double a)
 {
- abort ();
+ return a;
 }
 __attribute__ ((noinline))
 float
 sinf(float a)
 {
- return a;
+ abort ();
 }




On Thu, Oct 3, 2013 at 5:06 PM, Joseph S. Myers  wrote:
> On Fri, 6 Sep 2013, Cong Hou wrote:
>
>> 4: (float) sqrtl ((long double) double_val)  ->  (float) sqrt (double_val)
>
> I don't believe this case is in fact safe even if precision (long double)
>>= precision (double) * 2 + 2 (when your patch would allow it).
>
> The result that precision (double) * 2 + 2 is sufficient for the result of
> rounding the long double value to double to be the same as the result of
> rounding once from infinite precision to double would I think also mean
> the same when rounding of the infinite-precision result to float happens
> once - that is, if instead of (float) sqrt (double_val) you have fsqrt
> (double_val) (fsqrt being the proposed function in draft TS 18661-1 for
> computing a square root

Fwd: [PATCH] Reducing number of alias checks in vectorization.

2013-10-14 Thread Cong Hou
Sorry for forgetting using plain-text mode. Resend it.


-- Forwarded message --
From: Cong Hou 
Date: Mon, Oct 14, 2013 at 3:29 PM
Subject: Re: [PATCH] Reducing number of alias checks in vectorization.
To: Richard Biener , GCC Patches 
Cc: Jakub Jelinek 


I have made a new patch for this issue according to your comments.

There are several modifications to my previous patch:


1. Remove the use of STL features such as vector and sort. Use GCC's
vec and qsort instead.

2. Comparisons between tree nodes are not based on their addresses any
more. Use compare_tree() function instead.

3. The function vect_create_cond_for_alias_checks() now returns the
number of alias checks. If its second parameter cond_expr is NULL,
then this function only calculate the number of alias checks after the
merging and won't generate comparison expressions.

4. The function vect_prune_runtime_alias_test_list() now uses
vect_create_cond_for_alias_checks() to get the number of alias checks.


The patch is attached as a text file.

Please give me your comment on this patch. Thank you!


Cong


On Thu, Oct 3, 2013 at 2:35 PM, Cong Hou  wrote:
>
> Forget about this "aux" idea as the segment length for one data ref
> can be different in different dr pairs.
>
> In my patch I created a struct as shown below:
>
> struct dr_addr_with_seg_len
> {
>   data_reference *dr;
>   tree basic_addr;
>   tree offset;
>   tree seg_len;
> };
>
>
> Note that basic_addr and offset can always obtained from dr, but we
> need to store two segment lengths for each dr pair. It is improper to
> add a field to data_dependence_relation as it is defined outside of
> vectorizer. We can change the type (a new one combining
> data_dependence_relation and segment length) of may_alias_ddrs in
> loop_vec_info to include such information, but we have to add a new
> type to tree-vectorizer.h which is only used in two places - still too
> much.
>
> One possible solution is that we create a local struct as shown above
> and a new function which returns the merged alias check information.
> This function will be called twice: once during analysis phase and
> once in transformation phase. Then we don't have to store the merged
> alias check information during those two phases. The additional time
> cost is minimal as there will not be too many data dependent dr pairs
> in a loop.
>
> Any comment?
>
>
> thanks,
> Cong
>
>
> On Thu, Oct 3, 2013 at 10:57 AM, Cong Hou  wrote:
> > I noticed that there is a "struct dataref_aux" defined in
> > tree-vectorizer.h which is specific to the vectorizer pass and is
> > stored in (void*)aux in "struct data_reference". Can we add one more
> > field "segment_length" to dataref_aux so that we can pass this
> > information for merging alias checks? Then we can avoid to modify or
> > create other structures.
> >
> >
> > thanks,
> > Cong
> >
> >
> > On Wed, Oct 2, 2013 at 2:34 PM, Cong Hou  wrote:
> >> On Wed, Oct 2, 2013 at 4:24 AM, Richard Biener  wrote:
> >>> On Tue, 1 Oct 2013, Cong Hou wrote:
> >>>
> >>>> When alias exists between data refs in a loop, to vectorize it GCC
> >>>> does loop versioning and adds runtime alias checks. Basically for each
> >>>> pair of data refs with possible data dependence, there will be two
> >>>> comparisons generated to make sure there is no aliasing between them
> >>>> in each iteration of the vectorized loop. If there are many such data
> >>>> refs pairs, the number of comparisons can be very large, which is a
> >>>> big overhead.
> >>>>
> >>>> However, in some cases it is possible to reduce the number of those
> >>>> comparisons. For example, for the following loop, we can detect that
> >>>> b[0] and b[1] are two consecutive member accesses so that we can
> >>>> combine the alias check between a[0:100]&b[0] and a[0:100]&b[1] into
> >>>> checking a[0:100]&b[0:2]:
> >>>>
> >>>> void foo(int*a, int* b)
> >>>> {
> >>>>for (int i = 0; i < 100; ++i)
> >>>> a[i] = b[0] + b[1];
> >>>> }
> >>>>
> >>>> Actually, the requirement of consecutive memory accesses is too
> >>>> strict. For the following loop, we can still combine the alias checks
> >>>> between a[0:100]&b[0] and a[0:100]&b[100]:
> >>>>
> >>>> void foo(int*a, int* b)
> >>>> {
> >>>>for (int i = 0; i < 100; ++i)
> >&

Re: [PATCH] Hoist loop invariant statements containing data refs with zero-step during loop-versioning in vectorization.

2013-10-14 Thread Cong Hou
Any comment on this patch?


thanks,
Cong


On Thu, Oct 3, 2013 at 3:59 PM, Cong Hou  wrote:
> During loop versioning in vectorization, the alias check guarantees
> that any load of a data reference with zero-step is a loop invariant,
> which can be hoisted outside of the loop. After hoisting the load
> statement, there may exist more loop invariant statements. This patch
> tries to find all those statements and hoists them before the loop.
>
> An example is shown below:
>
>
> for (i = 0; i < N; ++i)
>   a[i] = *b + 1;
>
>
> After loop versioning the loop to be vectorized is guarded by
>
> if (b + 1 < a && a + N < b)
>
> which means there is no aliasing between *b and a[i]. The GIMPLE code
> of the loop body is:
>
>   :
>   # i_18 = PHI <0(4), i_29(6)>
>   # ivtmp_22 = PHI <1(4), ivtmp_30(6)>
>   _23 = (long unsigned int) i_18;
>   _24 = _23 * 4;
>   _25 = a_6(D) + _24;
>   _26 = *b_8(D);=> loop invariant
>   _27 = _26 + 1;=> loop invariant
>   *_25 = _27;
>   i_29 = i_18 + 1;
>   ivtmp_30 = ivtmp_22 - 1;
>   if (ivtmp_30 != 0)
> goto ;
>   else
> goto ;
>
>
> After hoisting loop invariant statements:
>
>
>   _26 = *b_8(D);
>   _27 = _26 + 1;
>
>   :
>   # i_18 = PHI <0(4), i_29(6)>
>   # ivtmp_22 = PHI <1(4), ivtmp_30(6)>
>   _23 = (long unsigned int) i_18;
>   _24 = _23 * 4;
>   _25 = a_6(D) + _24;
>   *_25 = _27;
>   i_29 = i_18 + 1;
>   ivtmp_30 = ivtmp_22 - 1;
>   if (ivtmp_30 != 0)
> goto ;
>   else
> goto ;
>
>
> This patch is related to the bug report
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58508
>
>
> thanks,
> Cong


Re: [PATCH] Relax the requirement of reduction pattern in GCC vectorizer.

2013-10-14 Thread Cong Hou
Ping...


thanks,
Cong


On Wed, Oct 2, 2013 at 11:18 AM, Cong Hou  wrote:
> Ping..  Any comment on this patch?
>
>
> thanks,
> Cong
>
>
> On Sat, Sep 28, 2013 at 9:34 AM, Xinliang David Li  wrote:
>> You can also add a test case of this form:
>>
>> int foo( int t, int n, int *dst)
>> {
>>int j = 0;
>>int s = 1;
>>t++;
>>for (j = 0; j < n; j++)
>>  {
>>  dst[j] = t;
>>  s *= t;
>>  }
>>
>>return s;
>> }
>>
>> where without the fix the loop vectorization is missed.
>>
>> David
>>
>> On Fri, Sep 27, 2013 at 6:28 PM, Cong Hou  wrote:
>>> The current GCC vectorizer requires the following pattern as a simple
>>> reduction computation:
>>>
>>>loop_header:
>>>  a1 = phi < a0, a2 >
>>>  a3 = ...
>>>  a2 = operation (a3, a1)
>>>
>>> But a3 can also be defined outside of the loop. For example, the
>>> following loop can benefit from vectorization but the GCC vectorizer
>>> fails to vectorize it:
>>>
>>>
>>> int foo(int v)
>>> {
>>>   int s = 1;
>>>   ++v;
>>>   for (int i = 0; i < 10; ++i)
>>> s *= v;
>>>   return s;
>>> }
>>>
>>>
>>> This patch relaxes the original requirement by also considering the
>>> following pattern:
>>>
>>>
>>>a3 = ...
>>>loop_header:
>>>  a1 = phi < a0, a2 >
>>>  a2 = operation (a3, a1)
>>>
>>>
>>> A test case is also added. The patch is tested on x86-64.
>>>
>>>
>>> thanks,
>>> Cong
>>>
>>> 
>>>
>>> diff --git a/gcc/ChangeLog b/gcc/ChangeLog
>>> index 39c786e..45c1667 100644
>>> --- a/gcc/ChangeLog
>>> +++ b/gcc/ChangeLog
>>> @@ -1,3 +1,9 @@
>>> +2013-09-27  Cong Hou  
>>> +
>>> + * tree-vect-loop.c: Relax the requirement of the reduction
>>> + pattern so that one operand of the reduction operation can
>>> + come from outside of the loop.
>>> +
>>>  2013-09-25  Tom Tromey  
>>>
>>>   * Makefile.in (PARTITION_H, LTO_SYMTAB_H, COMMON_TARGET_DEF_H)
>>> diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
>>> index 09644d2..90496a2 100644
>>> --- a/gcc/testsuite/ChangeLog
>>> +++ b/gcc/testsuite/ChangeLog
>>> @@ -1,3 +1,7 @@
>>> +2013-09-27  Cong Hou  
>>> +
>>> + * gcc.dg/vect/vect-reduc-pattern-3.c: New test.
>>> +
>>>  2013-09-25  Marek Polacek  
>>>
>>>   PR sanitizer/58413
>>> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
>>> index 2871ba1..3c51c3b 100644
>>> --- a/gcc/tree-vect-loop.c
>>> +++ b/gcc/tree-vect-loop.c
>>> @@ -2091,6 +2091,13 @@ vect_is_slp_reduction (loop_vec_info loop_info,
>>> gimple phi, gimple first_stmt)
>>>   a3 = ...
>>>   a2 = operation (a3, a1)
>>>
>>> +   or
>>> +
>>> +   a3 = ...
>>> +   loop_header:
>>> + a1 = phi < a0, a2 >
>>> + a2 = operation (a3, a1)
>>> +
>>> such that:
>>> 1. operation is commutative and associative and it is safe to
>>>change the order of the computation (if CHECK_REDUCTION is true)
>>> @@ -2451,6 +2458,7 @@ vect_is_simple_reduction_1 (loop_vec_info
>>> loop_info, gimple phi,
>>>if (def2 && def2 == phi
>>>&& (code == COND_EXPR
>>>|| !def1 || gimple_nop_p (def1)
>>> +  || !flow_bb_inside_loop_p (loop, gimple_bb (def1))
>>>|| (def1 && flow_bb_inside_loop_p (loop, gimple_bb (def1))
>>>&& (is_gimple_assign (def1)
>>>|| is_gimple_call (def1)
>>> @@ -2469,6 +2477,7 @@ vect_is_simple_reduction_1 (loop_vec_info
>>> loop_info, gimple phi,
>>>if (def1 && def1 == phi
>>>&& (code == COND_EXPR
>>>|| !def2 || gimple_nop_p (def2)
>>> +  || !flow_bb_inside_loop_p (loop, gimple_bb (def2))
>>>|| (def2 && flow_bb_inside_loop_p (loop, gimple_bb (def2))
>>>&& (is_gimple_assign (def2)
>>>|| is_gimple_call (def2)
>>> diff --git gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c
>>> gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c
>>> new file mode 100644
>>> index 000..06a9416
>>> --- /dev/null
>>> +++ gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c
>>> @@ -0,0 +1,41 @@
>>> +/* { dg-require-effective-target vect_int } */
>>> +
>>> +#include 
>>> +#include "tree-vect.h"
>>> +
>>> +#define N 10
>>> +#define RES 1024
>>> +
>>> +/* A reduction pattern in which there is no data ref in
>>> +   the loop and one operand is defined outside of the loop.  */
>>> +
>>> +__attribute__ ((noinline)) int
>>> +foo (int v)
>>> +{
>>> +  int i;
>>> +  int result = 1;
>>> +
>>> +  ++v;
>>> +  for (i = 0; i < N; i++)
>>> +result *= v;
>>> +
>>> +  return result;
>>> +}
>>> +
>>> +int
>>> +main (void)
>>> +{
>>> +  int res;
>>> +
>>> +  check_vect ();
>>> +
>>> +  res = foo (1);
>>> +  if (res != RES)
>>> +abort ();
>>> +
>>> +  return 0;
>>> +}
>>> +
>>> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
>>> +/* { dg-final { cleanup-tree-dump "vect" } } */
>>> +


Re: [PATCH] Relax the requirement of reduction pattern in GCC vectorizer.

2013-10-15 Thread Cong Hou
I have corrected the ChangeLog format, and committed this patch.

Thank you!


Cong


On Tue, Oct 15, 2013 at 6:38 AM, Richard Biener
 wrote:
> On Sat, Sep 28, 2013 at 3:28 AM, Cong Hou  wrote:
>> The current GCC vectorizer requires the following pattern as a simple
>> reduction computation:
>>
>>loop_header:
>>  a1 = phi < a0, a2 >
>>  a3 = ...
>>  a2 = operation (a3, a1)
>>
>> But a3 can also be defined outside of the loop. For example, the
>> following loop can benefit from vectorization but the GCC vectorizer
>> fails to vectorize it:
>>
>>
>> int foo(int v)
>> {
>>   int s = 1;
>>   ++v;
>>   for (int i = 0; i < 10; ++i)
>> s *= v;
>>   return s;
>> }
>>
>>
>> This patch relaxes the original requirement by also considering the
>> following pattern:
>>
>>
>>a3 = ...
>>loop_header:
>>  a1 = phi < a0, a2 >
>>  a2 = operation (a3, a1)
>>
>>
>> A test case is also added. The patch is tested on x86-64.
>>
>>
>> thanks,
>> Cong
>>
>> 
>>
>> diff --git a/gcc/ChangeLog b/gcc/ChangeLog
>> index 39c786e..45c1667 100644
>> --- a/gcc/ChangeLog
>> +++ b/gcc/ChangeLog
>> @@ -1,3 +1,9 @@
>> +2013-09-27  Cong Hou  
>> +
>> + * tree-vect-loop.c: Relax the requirement of the reduction
>
> ChangeLog format is
>
> * tree-vect-loop.c (vect_is_simple_reduction_1): Relax the
> requirement of the reduction.
>
> Ok with that change.
>
> Thanks,
> Richard.
>
>> + pattern so that one operand of the reduction operation can
>> + come from outside of the loop.
>> +
>>  2013-09-25  Tom Tromey  
>>
>>   * Makefile.in (PARTITION_H, LTO_SYMTAB_H, COMMON_TARGET_DEF_H)
>> diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
>> index 09644d2..90496a2 100644
>> --- a/gcc/testsuite/ChangeLog
>> +++ b/gcc/testsuite/ChangeLog
>> @@ -1,3 +1,7 @@
>> +2013-09-27  Cong Hou  
>> +
>> + * gcc.dg/vect/vect-reduc-pattern-3.c: New test.
>> +
>>  2013-09-25  Marek Polacek  
>>
>>   PR sanitizer/58413
>> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
>> index 2871ba1..3c51c3b 100644
>> --- a/gcc/tree-vect-loop.c
>> +++ b/gcc/tree-vect-loop.c
>> @@ -2091,6 +2091,13 @@ vect_is_slp_reduction (loop_vec_info loop_info,
>> gimple phi, gimple first_stmt)
>>   a3 = ...
>>   a2 = operation (a3, a1)
>>
>> +   or
>> +
>> +   a3 = ...
>> +   loop_header:
>> + a1 = phi < a0, a2 >
>> + a2 = operation (a3, a1)
>> +
>> such that:
>> 1. operation is commutative and associative and it is safe to
>>change the order of the computation (if CHECK_REDUCTION is true)
>> @@ -2451,6 +2458,7 @@ vect_is_simple_reduction_1 (loop_vec_info
>> loop_info, gimple phi,
>>if (def2 && def2 == phi
>>&& (code == COND_EXPR
>>|| !def1 || gimple_nop_p (def1)
>> +  || !flow_bb_inside_loop_p (loop, gimple_bb (def1))
>>|| (def1 && flow_bb_inside_loop_p (loop, gimple_bb (def1))
>>&& (is_gimple_assign (def1)
>>|| is_gimple_call (def1)
>> @@ -2469,6 +2477,7 @@ vect_is_simple_reduction_1 (loop_vec_info
>> loop_info, gimple phi,
>>if (def1 && def1 == phi
>>&& (code == COND_EXPR
>>|| !def2 || gimple_nop_p (def2)
>> +  || !flow_bb_inside_loop_p (loop, gimple_bb (def2))
>>|| (def2 && flow_bb_inside_loop_p (loop, gimple_bb (def2))
>>&& (is_gimple_assign (def2)
>>|| is_gimple_call (def2)
>> diff --git gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c
>> gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c
>> new file mode 100644
>> index 000..06a9416
>> --- /dev/null
>> +++ gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c
>> @@ -0,0 +1,41 @@
>> +/* { dg-require-effective-target vect_int } */
>> +
>> +#include 
>> +#include "tree-vect.h"
>> +
>> +#define N 10
>> +#define RES 1024
>> +
>> +/* A reduction pattern in which there is no data ref in
>> +   the loop and one operand is defined outside of the loop.  */
>> +
>> +__attribute__ ((noinline)) int
>> +foo (int v)
>> +{
>> +  int i;
>> +  int result = 1;
>> +
>> +  ++v;
>> +  for (i = 0; i < N; i++)
>> +result *= v;
>> +
>> +  return result;
>> +}
>> +
>> +int
>> +main (void)
>> +{
>> +  int res;
>> +
>> +  check_vect ();
>> +
>> +  res = foo (1);
>> +  if (res != RES)
>> +abort ();
>> +
>> +  return 0;
>> +}
>> +
>> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
>> +/* { dg-final { cleanup-tree-dump "vect" } } */
>> +


Re: [PATCH] Hoist loop invariant statements containing data refs with zero-step during loop-versioning in vectorization.

2013-10-15 Thread Cong Hou
Thank you for your reminder, Jeff! I just noticed Richard's comment. I
have modified the patch according to that.

The new patch is attached.


thanks,
Cong


On Tue, Oct 15, 2013 at 12:33 PM, Jeff Law  wrote:
> On 10/14/13 17:31, Cong Hou wrote:
>>
>> Any comment on this patch?
>
> Richi replied in the BZ you opened.
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58508
>
> Essentially he said emit the load on the edge rather than in the block
> itself.
> jeff
>
diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 8a38316..2637309 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,8 @@
+2013-10-15  Cong Hou  
+
+   * tree-vect-loop-manip.c (vect_loop_versioning): Hoist loop invariant
+   statement that contains data refs with zero-step.
+
 2013-10-14  David Malcolm  
 
* dumpfile.h (gcc::dump_manager): New class, to hold state
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 075d071..9d0f4a5 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,7 @@
+2013-10-15  Cong Hou  
+
+   * gcc.dg/vect/pr58508.c: New test.
+
 2013-10-14  Tobias Burnus  
 
PR fortran/58658
diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c 
b/gcc/testsuite/gcc.dg/vect/pr58508.c
new file mode 100644
index 000..cb22b50
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr58508.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details" } */
+
+
+/* The GCC vectorizer generates loop versioning for the following loop
+   since there may exist aliasing between A and B.  The predicate checks
+   if A may alias with B across all iterations.  Then for the loop in
+   the true body, we can assert that *B is a loop invariant so that
+   we can hoist the load of *B before the loop body.  */
+
+void foo (int* a, int* b)
+{
+  int i;
+  for (i = 0; i < 10; ++i)
+a[i] = *b + 1;
+}
+
+
+/* { dg-final { scan-tree-dump-times "hoist" 2 "vect" } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 574446a..f4fdec2 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -2477,6 +2477,92 @@ vect_loop_versioning (loop_vec_info loop_vinfo,
   adjust_phi_and_debug_stmts (orig_phi, e, PHI_RESULT (new_phi));
 }
 
+
+  /* Extract load and store statements on pointers with zero-stride
+ accesses.  */
+  if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo))
+{
+  /* In the loop body, we iterate each statement to check if it is a load
+or store.  Then we check the DR_STEP of the data reference.  If
+DR_STEP is zero, then we will hoist the load statement to the loop
+preheader, and move the store statement to the loop exit.  */
+
+  for (gimple_stmt_iterator si = gsi_start_bb (loop->header);
+  !gsi_end_p (si);)
+   {
+ gimple stmt = gsi_stmt (si);
+ stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+ struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
+
+ if (dr && integer_zerop (DR_STEP (dr)))
+   {
+ if (DR_IS_READ (dr))
+   {
+ if (dump_enabled_p ())
+   {
+ dump_printf_loc
+ (MSG_NOTE, vect_location,
+  "hoist the statement to outside of the loop ");
+ dump_gimple_stmt (MSG_NOTE, TDF_SLIM, stmt, 0);
+ dump_printf (MSG_NOTE, "\n");
+   }
+
+ gsi_remove (&si, false);
+ gsi_insert_on_edge_immediate (loop_preheader_edge (loop), 
stmt);
+   }
+ /* TODO: We also consider vectorizing loops containing zero-step
+data refs as writes.  For example:
+
+int a[N], *s;
+for (i = 0; i < N; i++)
+  *s += a[i];
+
+In this case the write to *s can be also moved after the
+loop.  */
+
+ continue;
+   }
+ else if (!dr)
+ {
+   bool hoist = true;
+   for (size_t i = 0; i < gimple_num_ops (stmt); i++)
+ {
+   tree op = gimple_op (stmt, i);
+   if (TREE_CODE (op) == INTEGER_CST
+   || TREE_CODE (op) == REAL_CST)
+ continue;
+   if (TREE_CODE (op) == SSA_NAME)
+ {
+   gimple def = SSA_NAME_DEF_STMT (op);
+   if (def == stmt
+   || gimple_nop_p (def)
+   || !flow_bb_inside_loop_p (loop, gimple_bb (def)))
+ continue;
+ }
+   hoist = false;
+   break;
+ }
+
+   if (hoist)
+ {
+  

Re: [PATCH] Hoist loop invariant statements containing data refs with zero-step during loop-versioning in vectorization.

2013-10-16 Thread Cong Hou
On Wed, Oct 16, 2013 at 2:02 AM, Richard Biener  wrote:
> On Tue, 15 Oct 2013, Cong Hou wrote:
>
>> Thank you for your reminder, Jeff! I just noticed Richard's comment. I
>> have modified the patch according to that.
>>
>> The new patch is attached.
>
> (posting patches inline is easier for review, now you have to deal
> with no quoting markers ;))
>
> Comments inline.
>
> diff --git a/gcc/ChangeLog b/gcc/ChangeLog
> index 8a38316..2637309 100644
> --- a/gcc/ChangeLog
> +++ b/gcc/ChangeLog
> @@ -1,3 +1,8 @@
> +2013-10-15  Cong Hou  
> +
> +   * tree-vect-loop-manip.c (vect_loop_versioning): Hoist loop invariant
> +   statement that contains data refs with zero-step.
> +
>  2013-10-14  David Malcolm  
>
> * dumpfile.h (gcc::dump_manager): New class, to hold state
> diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
> index 075d071..9d0f4a5 100644
> --- a/gcc/testsuite/ChangeLog
> +++ b/gcc/testsuite/ChangeLog
> @@ -1,3 +1,7 @@
> +2013-10-15  Cong Hou  
> +
> +   * gcc.dg/vect/pr58508.c: New test.
> +
>  2013-10-14  Tobias Burnus  
>
> PR fortran/58658
> diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c 
> b/gcc/testsuite/gcc.dg/vect/pr58508.c
> new file mode 100644
> index 000..cb22b50
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/pr58508.c
> @@ -0,0 +1,20 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details" } */
> +
> +
> +/* The GCC vectorizer generates loop versioning for the following loop
> +   since there may exist aliasing between A and B.  The predicate checks
> +   if A may alias with B across all iterations.  Then for the loop in
> +   the true body, we can assert that *B is a loop invariant so that
> +   we can hoist the load of *B before the loop body.  */
> +
> +void foo (int* a, int* b)
> +{
> +  int i;
> +  for (i = 0; i < 10; ++i)
> +a[i] = *b + 1;
> +}
> +
> +
> +/* { dg-final { scan-tree-dump-times "hoist" 2 "vect" } } */
> +/* { dg-final { cleanup-tree-dump "vect" } } */
> diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> index 574446a..f4fdec2 100644
> --- a/gcc/tree-vect-loop-manip.c
> +++ b/gcc/tree-vect-loop-manip.c
> @@ -2477,6 +2477,92 @@ vect_loop_versioning (loop_vec_info loop_vinfo,
>adjust_phi_and_debug_stmts (orig_phi, e, PHI_RESULT (new_phi));
>  }
>
>
> Note that applying this kind of transform at this point invalidates
> some of the earlier analysis the vectorizer performed (namely the
> def-kind which now effectively gets vect_external_def from
> vect_internal_def).  In this case it doesn't seem to cause any
> issues (we re-compute the def-kind everytime we need it (how wasteful)).
>
> +  /* Extract load and store statements on pointers with zero-stride
> + accesses.  */
> +  if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo))
> +{
> +  /* In the loop body, we iterate each statement to check if it is a load
> +or store.  Then we check the DR_STEP of the data reference.  If
> +DR_STEP is zero, then we will hoist the load statement to the loop
> +preheader, and move the store statement to the loop exit.  */
>
> We don't move the store yet.  Micha has a patch pending that enables
> vectorization of zero-step stores.
>
> +  for (gimple_stmt_iterator si = gsi_start_bb (loop->header);
> +  !gsi_end_p (si);)
>
> While technically ok now (vectorized loops contain a single basic block)
> please use LOOP_VINFO_BBS () to get at the vector of basic-blcoks
> and iterate over them like other code does.


Have done it.


>
> +   {
> + gimple stmt = gsi_stmt (si);
> + stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
> + struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
> +
> + if (dr && integer_zerop (DR_STEP (dr)))
> +   {
> + if (DR_IS_READ (dr))
> +   {
> + if (dump_enabled_p ())
> +   {
> + dump_printf_loc
> + (MSG_NOTE, vect_location,
> +  "hoist the statement to outside of the loop ");
>
> "hoisting out of the vectorized loop: "
>
> + dump_gimple_stmt (MSG_NOTE, TDF_SLIM, stmt, 0);
> + dump_printf (MSG_NOTE, "\n");
> +   }
> +
> + gsi_remove (&si, false);
> + gsi_insert_on_edge_immediate (loop_preheader_edge (loop), 
> stmt);
>
> Note that this will result in a b

Re: [PATCH] Hoist loop invariant statements containing data refs with zero-step during loop-versioning in vectorization.

2013-10-17 Thread Cong Hou
I tested this case with -fno-tree-loop-im and -fno-tree-pre, and it
seems that GCC could hoist j+1 outside of the i loop:

t3.c:5:5: note: hoisting out of the vectorized loop: _10 = (sizetype) j_25;
t3.c:5:5: note: hoisting out of the vectorized loop: _11 = _10 + 1;
t3.c:5:5: note: hoisting out of the vectorized loop: _12 = _11 * 4;
t3.c:5:5: note: hoisting out of the vectorized loop: _14 = b_13(D) + _12;
t3.c:5:5: note: hoisting out of the vectorized loop: _15 = *_14;
t3.c:5:5: note: hoisting out of the vectorized loop: _16 = _15 + 1;


But your suggestion is still nice as it can remove a branch and make
the code more brief. I have updated the patch and also included the
nested loop example into the test case.

Thank you!


Cong



diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 8a38316..2637309 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,8 @@
+2013-10-15  Cong Hou  
+
+ * tree-vect-loop-manip.c (vect_loop_versioning): Hoist loop invariant
+ statement that contains data refs with zero-step.
+
 2013-10-14  David Malcolm  

  * dumpfile.h (gcc::dump_manager): New class, to hold state
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 075d071..9d0f4a5 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,7 @@
+2013-10-15  Cong Hou  
+
+ * gcc.dg/vect/pr58508.c: New test.
+
 2013-10-14  Tobias Burnus  

  PR fortran/58658
diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c
b/gcc/testsuite/gcc.dg/vect/pr58508.c
new file mode 100644
index 000..6484a65
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr58508.c
@@ -0,0 +1,70 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details" } */
+
+
+/* The GCC vectorizer generates loop versioning for the following loop
+   since there may exist aliasing between A and B.  The predicate checks
+   if A may alias with B across all iterations.  Then for the loop in
+   the true body, we can assert that *B is a loop invariant so that
+   we can hoist the load of *B before the loop body.  */
+
+void test1 (int* a, int* b)
+{
+  int i;
+  for (i = 0; i < 10; ++i)
+a[i] = *b + 1;
+}
+
+/* A test case with nested loops.  The load of b[j+1] in the inner
+   loop should be hoisted.  */
+
+void test2 (int* a, int* b)
+{
+  int i, j;
+  for (j = 0; j < 10; ++j)
+for (i = 0; i < 10; ++i)
+  a[i] = b[j+1] + 1;
+}
+
+/* A test case with ifcvt transformation.  */
+
+void test3 (int* a, int* b)
+{
+  int i, t;
+  for (i = 0; i < 1; ++i)
+{
+  if (*b > 0)
+ t = *b * 2;
+  else
+ t = *b / 2;
+  a[i] = t;
+}
+}
+
+/* A test case in which the store in the loop can be moved outside
+   in the versioned loop with alias checks.  Note this loop won't
+   be vectorized.  */
+
+void test4 (int* a, int* b)
+{
+  int i;
+  for (i = 0; i < 10; ++i)
+*a += b[i];
+}
+
+/* A test case in which the load and store in the loop to b
+   can be moved outside in the versioned loop with alias checks.
+   Note this loop won't be vectorized.  */
+
+void test5 (int* a, int* b)
+{
+  int i;
+  for (i = 0; i < 10; ++i)
+{
+  *b += a[i];
+  a[i] = *b;
+}
+}
+
+/* { dg-final { scan-tree-dump-times "hoist" 8 "vect" } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 574446a..1cc563c 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -2477,6 +2477,73 @@ vect_loop_versioning (loop_vec_info loop_vinfo,
   adjust_phi_and_debug_stmts (orig_phi, e, PHI_RESULT (new_phi));
 }

+
+  /* Extract load statements on memrefs with zero-stride accesses.  */
+
+  if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo))
+{
+  /* In the loop body, we iterate each statement to check if it is a load.
+ Then we check the DR_STEP of the data reference.  If DR_STEP is zero,
+ then we will hoist the load statement to the loop preheader.  */
+
+  basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo);
+  int nbbs = loop->num_nodes;
+
+  for (int i = 0; i < nbbs; ++i)
+ {
+  for (gimple_stmt_iterator si = gsi_start_bb (bbs[i]);
+   !gsi_end_p (si);)
+{
+  gimple stmt = gsi_stmt (si);
+  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+  struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
+
+  if (is_gimple_assign (stmt)
+  && (!dr
+  || (DR_IS_READ (dr) && integer_zerop (DR_STEP (dr)
+ {
+  bool hoist = true;
+  ssa_op_iter iter;
+  tree var;
+
+  /* We hoist a statement if all SSA uses in it are defined
+ outside of the loop.  */
+  FOR_EACH_SSA_TREE_OPERAND (var, stmt, iter, SSA_OP_USE)
+{
+  gimple def = SSA_NAME_DEF_STMT (var);
+  if (!gimple_nop_p (def)
+  && flow_bb_inside_loop_p (loop, gimple_bb (def)))
+ {
+  hoist = false;
+  break;
+ }
+}
+
+  if (hoist)
+{
+  if (dr)

Re: [PATCH] Fixing improper conversion from sin() to sinf() in optimization mode.

2013-10-17 Thread Cong Hou
Ping?


thanks,
Cong


On Mon, Oct 7, 2013 at 10:15 AM, Cong Hou  wrote:
> You are right. I am not an expert on numerical analysis, but I tested
> your case and it proves the number 4 conversion is not safe.
>
> Now we have four conversions which are safe once the precision
> requirement is satisfied. I added a condition if (type != newtype) to
> remove the unsafe one, as in this case once more conversion is added
> which leads to the unsafe issue. If you think this condition does not
> make sense please let me know.
>
> The new patch is shown below (the attached file has tabs).
>
> Thank you very much!
>
>
>
> thanks,
> Cong
>
>
>
> Index: gcc/convert.c
> ===
> --- gcc/convert.c (revision 203250)
> +++ gcc/convert.c (working copy)
> @@ -135,16 +135,19 @@ convert_to_real (tree type, tree expr)
>CASE_MATHFN (COS)
>CASE_MATHFN (ERF)
>CASE_MATHFN (ERFC)
> -  CASE_MATHFN (FABS)
>CASE_MATHFN (LOG)
>CASE_MATHFN (LOG10)
>CASE_MATHFN (LOG2)
>CASE_MATHFN (LOG1P)
> -  CASE_MATHFN (LOGB)
>CASE_MATHFN (SIN)
> -  CASE_MATHFN (SQRT)
>CASE_MATHFN (TAN)
>CASE_MATHFN (TANH)
> +/* The above functions are not safe to do this conversion.  */
> +if (!flag_unsafe_math_optimizations)
> +  break;
> +  CASE_MATHFN (SQRT)
> +  CASE_MATHFN (FABS)
> +  CASE_MATHFN (LOGB)
>  #undef CASE_MATHFN
>  {
>tree arg0 = strip_float_extensions (CALL_EXPR_ARG (expr, 0));
> @@ -155,13 +158,43 @@ convert_to_real (tree type, tree expr)
>if (TYPE_PRECISION (TREE_TYPE (arg0)) > TYPE_PRECISION (type))
>   newtype = TREE_TYPE (arg0);
>
> +  /* We consider to convert
> +
> + (T1) sqrtT2 ((T2) exprT3)
> + to
> + (T1) sqrtT4 ((T4) exprT3)
> +
> +  , where T1 is TYPE, T2 is ITYPE, T3 is TREE_TYPE (ARG0),
> + and T4 is NEWTYPE. All those types are of floating point types.
> + T4 (NEWTYPE) should be narrower than T2 (ITYPE). This conversion
> + is safe only if P1 >= P2*2+2, where P1 and P2 are precisions of
> + T2 and T4. See the following URL for a reference:
> + 
> http://stackoverflow.com/questions/9235456/determining-floating-point-square-root
> + */
> +  if ((fcode == BUILT_IN_SQRT || fcode == BUILT_IN_SQRTL)
> +  && !flag_unsafe_math_optimizations)
> + {
> +  /* The following conversion is unsafe even the precision condition
> + below is satisfied:
> +
> + (float) sqrtl ((long double) double_val) -> (float) sqrt (double_val)
> +*/
> +  if (type != newtype)
> +break;
> +
> +  int p1 = REAL_MODE_FORMAT (TYPE_MODE (itype))->p;
> +  int p2 = REAL_MODE_FORMAT (TYPE_MODE (newtype))->p;
> +  if (p1 < p2 * 2 + 2)
> +break;
> + }
> +
>/* Be careful about integer to fp conversions.
>   These may overflow still.  */
>if (FLOAT_TYPE_P (TREE_TYPE (arg0))
>&& TYPE_PRECISION (newtype) < TYPE_PRECISION (itype)
>&& (TYPE_MODE (newtype) == TYPE_MODE (double_type_node)
>|| TYPE_MODE (newtype) == TYPE_MODE (float_type_node)))
> -{
> + {
>tree fn = mathfn_built_in (newtype, fcode);
>
>if (fn)
> Index: gcc/ChangeLog
> ===
> --- gcc/ChangeLog (revision 203250)
> +++ gcc/ChangeLog (working copy)
> @@ -1,3 +1,9 @@
> +2013-10-07  Cong Hou  
> +
> + * convert.c (convert_to_real): Forbid unsafe math function
> + conversions including sin/cos/log etc. Add precision check
> + for sqrt.
> +
>  2013-10-07  Bill Schmidt  
>
>   * config/rs6000/rs6000.c (altivec_expand_vec_perm_const_le): New.
> Index: gcc/testsuite/ChangeLog
> ===
> --- gcc/testsuite/ChangeLog (revision 203250)
> +++ gcc/testsuite/ChangeLog (working copy)
> @@ -1,3 +1,7 @@
> +2013-10-07  Cong Hou  
> +
> + * gcc.c-torture/execute/20030125-1.c: Update.
> +
>  2013-10-07  Bill Schmidt  
>
>   * gcc.target/powerpc/pr43154.c: Skip for ppc64 little endian.
> Index: gcc/testsuite/gcc.c-torture/execute/20030125-1.c
> ===
> --- gcc/testsuite/gcc.c-torture/execute/20030125-1.c (revision 203250)
> +++ gcc/testsuite/gcc.c-torture/execute/20030125-1.c (working copy)
> @@ -44,11 +44,11 @@ __attribute__ ((noinline))
>  double
>  sin(double a)
>  {
> - abort ();
> + return a;
>  }
>  __attribute__ ((noinline))
>  float
>  sinf(float a)
>  {
> - return a;
> + abort ();
>  }
>
>
>
>
> On Thu, Oct 3, 2013 at 5:06 PM, Joseph S. Myers  
> wrote:
>> On Fri, 6 

Re: [PATCH] Hoist loop invariant statements containing data refs with zero-step during loop-versioning in vectorization.

2013-10-21 Thread Cong Hou
Jeff, thank you for installing this patch. Actually I already have the
write privileges. I just came back from a trip.

Thank you again!



thanks,
Cong


On Fri, Oct 18, 2013 at 10:22 PM, Jeff Law  wrote:
> On 10/18/13 03:56, Richard Biener wrote:
>>
>> On Thu, 17 Oct 2013, Cong Hou wrote:
>>
>>> I tested this case with -fno-tree-loop-im and -fno-tree-pre, and it
>>> seems that GCC could hoist j+1 outside of the i loop:
>>>
>>> t3.c:5:5: note: hoisting out of the vectorized loop: _10 = (sizetype)
>>> j_25;
>>> t3.c:5:5: note: hoisting out of the vectorized loop: _11 = _10 + 1;
>>> t3.c:5:5: note: hoisting out of the vectorized loop: _12 = _11 * 4;
>>> t3.c:5:5: note: hoisting out of the vectorized loop: _14 = b_13(D) + _12;
>>> t3.c:5:5: note: hoisting out of the vectorized loop: _15 = *_14;
>>> t3.c:5:5: note: hoisting out of the vectorized loop: _16 = _15 + 1;
>>>
>>>
>>> But your suggestion is still nice as it can remove a branch and make
>>> the code more brief. I have updated the patch and also included the
>>> nested loop example into the test case.
>>
>>
>> Ok if it passes bootstrap & regtest.
>
> Bootstrapped & regression tested on x86_64-unknown-linux-gnu.  Installed on
> Cong's behalf.
>
> Cong -- if you plan on contributing regularly to GCC, please start the
> process for write privileges.  This form should have everything you need:
>
> https://sourceware.org/cgi-bin/pdw/ps_form.cgi
>
> Jeff


Re: [PATCH] Hoist loop invariant statements containing data refs with zero-step during loop-versioning in vectorization.

2013-10-21 Thread Cong Hou
OK. Have done that. And this is also a "patch", right? ;)


thanks,
Cong



diff --git a/MAINTAINERS b/MAINTAINERS
index 15b6cc7..a6954da 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -406,6 +406,7 @@ Fergus Hendersonf...@cs.mu.oz.au
 Stuart Henderson   shend...@gcc.gnu.org
 Matthew Hiller hil...@redhat.com
 Manfred Hollstein  m...@suse.com
+Cong Hou   co...@google.com
 Falk Hueffner  f...@debian.org
 Andrew John Hughes gnu_and...@member.fsf.org
 Andy Hutchinsonhutchinsona...@aim.com





On Mon, Oct 21, 2013 at 9:46 AM, Jeff Law  wrote:
> On 10/21/13 10:45, Cong Hou wrote:
>>
>> Jeff, thank you for installing this patch. Actually I already have the
>> write privileges. I just came back from a trip.
>
> Ah.  I didn't see you in the MAINTAINERS file.  Can you update that file
> please.
>
> Thanks,
> jeff
>


[PATCH] Vectorizing abs(char/short/int) on x86.

2013-10-22 Thread Cong Hou
This patch aims at PR58762.

Currently GCC could not vectorize abs() operation for integers on x86
with only SSE2 support. For int type, the reason is that the expand on
abs() is not defined for vector type. This patch defines such an
expand so that abs(int) will be vectorized with only SSE2.

For abs(char/short), type conversions are needed as the current abs()
function/operation does not accept argument of char/short type.
Therefore when we want to get the absolute value of a char_val using
abs (char_val), it will be converted into abs ((int) char_val). It
then can be vectorized, but the generated code is not efficient as
lots of packings and unpackings are envolved. But if we convert
(char) abs ((int) char_val) to abs (char_val), the vectorizer will be
able to generate better code. Same for short.

This conversion also enables vectorizing abs(char/short) operation
with PABSB and PABSW instructions in SSE3.

With only SSE2 support, I developed three methods to expand
abs(char/short/int) seperately:

1. For 32 bit int value x, we can get abs (x) from (((signed) x >>
(W-1)) ^ x) - ((signed) x >> (W-1)). This is better than max (x, -x),
which needs bit masking.

2. For 16 bit int value x, we can get abs (x) from max (x, -x), as
SSE2 provides PMAXSW instruction.

3. For 8 bit int value x, we can get abs (x) from min ((unsigned char)
x, (unsigned char) (-x)), as SSE2 provides PMINUB instruction.


The patch is pasted below. Please point out any problem in my patch
and analysis.


thanks,
Cong




diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 8a38316..e0f33ee 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,13 @@
+2013-10-22  Cong Hou  
+
+ PR target/58762
+ * convert.c (convert_to_integer): Convert (char) abs ((int) char_val)
+ into abs (char_val).  Also convert (short) abs ((int) short_val)
+ into abs (short_val).
+ * config/i386/i386-protos.h (ix86_expand_sse2_absvxsi2): New function.
+ * config/i386/i386.c (ix86_expand_sse2_absvxsi2): New function.
+ * config/i386/sse.md: Add SSE2 support to abs (char/int/short).
+
 2013-10-14  David Malcolm  

  * dumpfile.h (gcc::dump_manager): New class, to hold state
diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index 3ab2f3a..e85f663 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -238,6 +238,7 @@ extern void ix86_expand_mul_widen_evenodd (rtx,
rtx, rtx, bool, bool);
 extern void ix86_expand_mul_widen_hilo (rtx, rtx, rtx, bool, bool);
 extern void ix86_expand_sse2_mulv4si3 (rtx, rtx, rtx);
 extern void ix86_expand_sse2_mulvxdi3 (rtx, rtx, rtx);
+extern void ix86_expand_sse2_absvxsi2 (rtx, rtx);

 /* In i386-c.c  */
 extern void ix86_target_macros (void);
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 02cbbbd..8050e02 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -41696,6 +41696,53 @@ ix86_expand_sse2_mulvxdi3 (rtx op0, rtx op1, rtx op2)
gen_rtx_MULT (mode, op1, op2));
 }

+void
+ix86_expand_sse2_absvxsi2 (rtx op0, rtx op1)
+{
+  enum machine_mode mode = GET_MODE (op0);
+  rtx tmp0, tmp1;
+
+  switch (mode)
+{
+  /* For 32-bit signed integer X, the best way to calculate the absolute
+ value of X is (((signed) X >> (W-1)) ^ X) - ((signed) X >> (W-1)).  */
+  case V4SImode:
+ tmp0 = expand_simple_binop (mode, ASHIFTRT, op1,
+GEN_INT (GET_MODE_BITSIZE
+ (GET_MODE_INNER (mode)) - 1),
+NULL, 0, OPTAB_DIRECT);
+ if (tmp0)
+  tmp1 = expand_simple_binop (mode, XOR, op1, tmp0,
+  NULL, 0, OPTAB_DIRECT);
+ if (tmp0 && tmp1)
+  expand_simple_binop (mode, MINUS, tmp1, tmp0,
+   op0, 0, OPTAB_DIRECT);
+ break;
+
+  /* For 16-bit signed integer X, the best way to calculate the absolute
+ value of X is max (X, -X), as SSE2 provides the PMAXSW insn.  */
+  case V8HImode:
+ tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0);
+ if (tmp0)
+  expand_simple_binop (mode, SMAX, op1, tmp0, op0, 0,
+   OPTAB_DIRECT);
+ break;
+
+  /* For 8-bit signed integer X, the best way to calculate the absolute
+ value of X is min ((unsigned char) X, (unsigned char) (-X)),
+ as SSE2 provides the PMINUB insn.  */
+  case V16QImode:
+ tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0);
+ if (tmp0)
+  expand_simple_binop (V16QImode, UMIN, op1, tmp0, op0, 0,
+   OPTAB_DIRECT);
+ break;
+
+  default:
+ break;
+}
+}
+
 /* Expand an insert into a vector register through pinsr insn.
Return true if successful.  */

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index c3f6c94..bd90f2d 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -8721,7 +8721,7 @@
(set (attr "prefix_rex") (symbol_ref "x86_extended_reg_mentioned_p (insn)"))
(set_attr "mode" "DI")])

-(define_insn "abs2"
+(define_insn "*abs2"
   [(set (match_operand:VI124_AVX2_48_AVX512F 0 "register_operand" "=v")
 

Re: [PATCH] Vectorizing abs(char/short/int) on x86.

2013-10-23 Thread Cong Hou
On Tue, Oct 22, 2013 at 8:11 PM,   wrote:
>
>
> Sent from my iPad
>
>> On Oct 22, 2013, at 7:23 PM, Cong Hou  wrote:
>>
>> This patch aims at PR58762.
>>
>> Currently GCC could not vectorize abs() operation for integers on x86
>> with only SSE2 support. For int type, the reason is that the expand on
>> abs() is not defined for vector type. This patch defines such an
>> expand so that abs(int) will be vectorized with only SSE2.
>>
>> For abs(char/short), type conversions are needed as the current abs()
>> function/operation does not accept argument of char/short type.
>> Therefore when we want to get the absolute value of a char_val using
>> abs (char_val), it will be converted into abs ((int) char_val). It
>> then can be vectorized, but the generated code is not efficient as
>> lots of packings and unpackings are envolved. But if we convert
>> (char) abs ((int) char_val) to abs (char_val), the vectorizer will be
>> able to generate better code. Same for short.
>>
>> This conversion also enables vectorizing abs(char/short) operation
>> with PABSB and PABSW instructions in SSE3.
>>
>> With only SSE2 support, I developed three methods to expand
>> abs(char/short/int) seperately:
>>
>> 1. For 32 bit int value x, we can get abs (x) from (((signed) x >>
>> (W-1)) ^ x) - ((signed) x >> (W-1)). This is better than max (x, -x),
>> which needs bit masking.
>>
>> 2. For 16 bit int value x, we can get abs (x) from max (x, -x), as
>> SSE2 provides PMAXSW instruction.
>>
>> 3. For 8 bit int value x, we can get abs (x) from min ((unsigned char)
>> x, (unsigned char) (-x)), as SSE2 provides PMINUB instruction.
>>
>>
>> The patch is pasted below. Please point out any problem in my patch
>> and analysis.
>>
>>
>> thanks,
>> Cong
>>
>>
>>
>>
>> diff --git a/gcc/ChangeLog b/gcc/ChangeLog
>> index 8a38316..e0f33ee 100644
>> --- a/gcc/ChangeLog
>> +++ b/gcc/ChangeLog
>> @@ -1,3 +1,13 @@
>> +2013-10-22  Cong Hou  
>> +
>> + PR target/58762
>> + * convert.c (convert_to_integer): Convert (char) abs ((int) char_val)
>> + into abs (char_val).  Also convert (short) abs ((int) short_val)
>> + into abs (short_val).
>
> I don't like this optimization in convert.  I think it should be submitted 
> separately and should be done in tree-ssa-forwprop.


Yes. This patch can be split into two: one for vectorization and one
for abs conversion.

The reason why I put abs conversion to convert.c is because fabs
conversion is also done there.


>
> Also I think you should have a generic (non x86) test case for the above 
> optimization.


For vectorization I need to do it on x86 since the define_expand is
only for it. But for abs conversion, yes, I should make a generic test
case.


Thank you for your comments!


Cong


>
> Thanks,
> Andrew
>
>
>> + * config/i386/i386-protos.h (ix86_expand_sse2_absvxsi2): New function.
>> + * config/i386/i386.c (ix86_expand_sse2_absvxsi2): New function.
>> + * config/i386/sse.md: Add SSE2 support to abs (char/int/short).
>
>
>
>> +
>> 2013-10-14  David Malcolm  
>>
>>  * dumpfile.h (gcc::dump_manager): New class, to hold state
>> diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
>> index 3ab2f3a..e85f663 100644
>> --- a/gcc/config/i386/i386-protos.h
>> +++ b/gcc/config/i386/i386-protos.h
>> @@ -238,6 +238,7 @@ extern void ix86_expand_mul_widen_evenodd (rtx,
>> rtx, rtx, bool, bool);
>> extern void ix86_expand_mul_widen_hilo (rtx, rtx, rtx, bool, bool);
>> extern void ix86_expand_sse2_mulv4si3 (rtx, rtx, rtx);
>> extern void ix86_expand_sse2_mulvxdi3 (rtx, rtx, rtx);
>> +extern void ix86_expand_sse2_absvxsi2 (rtx, rtx);
>>
>> /* In i386-c.c  */
>> extern void ix86_target_macros (void);
>> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
>> index 02cbbbd..8050e02 100644
>> --- a/gcc/config/i386/i386.c
>> +++ b/gcc/config/i386/i386.c
>> @@ -41696,6 +41696,53 @@ ix86_expand_sse2_mulvxdi3 (rtx op0, rtx op1, rtx 
>> op2)
>>gen_rtx_MULT (mode, op1, op2));
>> }
>>
>> +void
>> +ix86_expand_sse2_absvxsi2 (rtx op0, rtx op1)
>> +{
>> +  enum machine_mode mode = GET_MODE (op0);
>> +  rtx tmp0, tmp1;
>> +
>> +  switch (mode)
>> +{
>> +  /* For 32-bit signed integer X, the best way to calculate the absolute
>> + value of X is (((signed) X >> (W-1)) ^ X) - ((signed) X >> (W-1)).  */
>> +  case V4SImode:
>> + tmp0 = exp

Re: [PATCH] Vectorizing abs(char/short/int) on x86.

2013-10-23 Thread Cong Hou
On Wed, Oct 23, 2013 at 12:20 AM, Uros Bizjak  wrote:
> Hello!
>
>> Currently GCC could not vectorize abs() operation for integers on x86
>> with only SSE2 support. For int type, the reason is that the expand on
>> abs() is not defined for vector type. This patch defines such an
>> expand so that abs(int) will be vectorized with only SSE2.
>
> +(define_expand "abs2"
> +  [(set (match_operand:VI124_AVX2_48_AVX512F 0 "register_operand")
> + (abs:VI124_AVX2_48_AVX512F
> +  (match_operand:VI124_AVX2_48_AVX512F 1 "register_operand")))]
> +  "TARGET_SSE2"
> +{
> +  if (TARGET_SSE2 && !TARGET_SSSE3)
> +ix86_expand_sse2_absvxsi2 (operands[0], operands[1]);
> +  else if (TARGET_SSSE3)
> +emit_insn (gen_rtx_SET (VOIDmode, operands[0],
> +gen_rtx_ABS (mode, operands[1])));
> +  DONE;
> +})
>
> This should be written as:
>
> (define_expand "abs2"
>   [(set (match_operand:VI124_AVX2_48_AVX512F 0 "register_operand")
>(abs:VI124_AVX2_48_AVX512F
>  (match_operand:VI124_AVX2_48_AVX512F 1 "nonimmediate_operand")))]
>   "TARGET_SSE2"
> {
>   if (!TARGET_SSSE3)
> {
>   ix86_expand_sse2_absvxsi2 (operands[0], operands[1]);
>   DONE;
> }
> })

OK.

>
> Please note that operands[1] can be a memory operand, so your expander
> should either handle it (this is preferred) or load the operand to the
> register at the beginning of the expansion.


OK. I think I don't have to make any change to
ix86_expand_sse2_absvxsi2(), as operands[1] is always read-only.
Right?


>
> +void
> +ix86_expand_sse2_absvxsi2 (rtx op0, rtx op1)
>
> This function name implies SImode operands ... please just name it
> ix86_expand_sse2_abs.


Yes, my bad. At first I only considered V4SI but later forgot to
rename the function.


Thank you very much!


Cong

>
> Uros.


Re: [PATCH] Vectorizing abs(char/short/int) on x86.

2013-10-23 Thread Cong Hou
On Wed, Oct 23, 2013 at 8:52 AM, Joseph S. Myers
 wrote:
> On Tue, 22 Oct 2013, Cong Hou wrote:
>
>> For abs(char/short), type conversions are needed as the current abs()
>> function/operation does not accept argument of char/short type.
>> Therefore when we want to get the absolute value of a char_val using
>> abs (char_val), it will be converted into abs ((int) char_val). It
>> then can be vectorized, but the generated code is not efficient as
>> lots of packings and unpackings are envolved. But if we convert
>> (char) abs ((int) char_val) to abs (char_val), the vectorizer will be
>> able to generate better code. Same for short.
>
> ABS_EXPR has undefined overflow behavior.  Thus, abs ((int) -128) is
> defined (and we also define the subsequent conversion of +128 to signed
> char, which ISO C makes implementation-defined not undefined), and
> converting to an ABS_EXPR on char would wrongly make it undefined.  For
> such a transformation to be valid (in the absence of VRP saying that -128
> isn't a possible value) you'd need a GIMPLE representation for
> ABS_EXPR, as distinct from ABS_EXPR.
> You don't have the option there is for some arithmetic operations of
> converting to a corresponding operation on unsigned types.
>

Yes, you are right. The method I use can guarantee wrapping on
overflow (either shift-xor-sub or max(x, -x)). Can I just add the
condition if (flag_wrapv) before the conversion I made to prevent the
undefined behavior on overflow?

Thank you!

Cong


> --
> Joseph S. Myers
> jos...@codesourcery.com


Re: [PATCH] Vectorizing abs(char/short/int) on x86.

2013-10-23 Thread Cong Hou
I think I did not make it clear. If GCC defines that passing 128 to a
char value makes it the wrapping result -128, then the conversion from
(char) abs ((int) char_val) to abs (char_val) is safe if we can
guarantee abs (char(-128)) = -128 also. Then the subsequent methods
used to get abs() should also guarantee wrapping on overflow.
Shift-xor-sub is OK, but max(x, -x) is OK only if the result of the
negation operation on -128 is also -128 (wrapping). I think that is
right the behavior of SSE2 operation PSUBB ([0,...,0], [x,...,x]), as
PSUBB can operate both signed/unsigned operands.


thanks,
Cong


On Wed, Oct 23, 2013 at 9:40 PM, Cong Hou  wrote:
> On Wed, Oct 23, 2013 at 8:52 AM, Joseph S. Myers
>  wrote:
>> On Tue, 22 Oct 2013, Cong Hou wrote:
>>
>>> For abs(char/short), type conversions are needed as the current abs()
>>> function/operation does not accept argument of char/short type.
>>> Therefore when we want to get the absolute value of a char_val using
>>> abs (char_val), it will be converted into abs ((int) char_val). It
>>> then can be vectorized, but the generated code is not efficient as
>>> lots of packings and unpackings are envolved. But if we convert
>>> (char) abs ((int) char_val) to abs (char_val), the vectorizer will be
>>> able to generate better code. Same for short.
>>
>> ABS_EXPR has undefined overflow behavior.  Thus, abs ((int) -128) is
>> defined (and we also define the subsequent conversion of +128 to signed
>> char, which ISO C makes implementation-defined not undefined), and
>> converting to an ABS_EXPR on char would wrongly make it undefined.  For
>> such a transformation to be valid (in the absence of VRP saying that -128
>> isn't a possible value) you'd need a GIMPLE representation for
>> ABS_EXPR, as distinct from ABS_EXPR.
>> You don't have the option there is for some arithmetic operations of
>> converting to a corresponding operation on unsigned types.
>>
>
> Yes, you are right. The method I use can guarantee wrapping on
> overflow (either shift-xor-sub or max(x, -x)). Can I just add the
> condition if (flag_wrapv) before the conversion I made to prevent the
> undefined behavior on overflow?
>
> Thank you!
>
> Cong
>
>
>> --
>> Joseph S. Myers
>> jos...@codesourcery.com


Re: [PATCH] Fixing improper conversion from sin() to sinf() in optimization mode.

2013-10-24 Thread Cong Hou
I have updated the patch according to your suggestion, and have
committed the patch as the bootstrapping and make check both get
passed.

Thank you for your patient help on this patch! I learned a lot from it.


thanks,
Cong


On Wed, Oct 23, 2013 at 1:13 PM, Joseph S. Myers
 wrote:
> On Mon, 7 Oct 2013, Cong Hou wrote:
>
>> +  if (type != newtype)
>> +break;
>
> That comparison would wrongly treat as different cases where the types
> differ only in one being a typedef, having qualifiers, etc. - or if in
> future GCC implemented proposed TS 18661-3, cases where they differ in
> e.g. one being float and the other _Float32 (defined as distinct types
> that are not compatible although they have the same representation and
> alignment).  I think the right test here, bearing in mind the _Float32
> case where types may not be compatible, is TYPE_MODE (type) != TYPE_MODE
> (newtype) - if the types have the same mode, they have the same set of
> values and so are not different in any way that matters for this
> optimization.  OK with that change.
>
> --
> Joseph S. Myers
> jos...@codesourcery.com


Re: [PATCH] Vectorizing abs(char/short/int) on x86.

2013-10-24 Thread Cong Hou
On Wed, Oct 23, 2013 at 11:18 PM, Jakub Jelinek  wrote:
> On Wed, Oct 23, 2013 at 09:40:21PM -0700, Cong Hou wrote:
>> On Wed, Oct 23, 2013 at 8:52 AM, Joseph S. Myers
>>  wrote:
>> > On Tue, 22 Oct 2013, Cong Hou wrote:
>> >
>> >> For abs(char/short), type conversions are needed as the current abs()
>> >> function/operation does not accept argument of char/short type.
>> >> Therefore when we want to get the absolute value of a char_val using
>> >> abs (char_val), it will be converted into abs ((int) char_val). It
>> >> then can be vectorized, but the generated code is not efficient as
>> >> lots of packings and unpackings are envolved. But if we convert
>> >> (char) abs ((int) char_val) to abs (char_val), the vectorizer will be
>> >> able to generate better code. Same for short.
>> >
>> > ABS_EXPR has undefined overflow behavior.  Thus, abs ((int) -128) is
>> > defined (and we also define the subsequent conversion of +128 to signed
>> > char, which ISO C makes implementation-defined not undefined), and
>> > converting to an ABS_EXPR on char would wrongly make it undefined.  For
>> > such a transformation to be valid (in the absence of VRP saying that -128
>> > isn't a possible value) you'd need a GIMPLE representation for
>> > ABS_EXPR, as distinct from ABS_EXPR.
>> > You don't have the option there is for some arithmetic operations of
>> > converting to a corresponding operation on unsigned types.
>> >
>>
>> Yes, you are right. The method I use can guarantee wrapping on
>> overflow (either shift-xor-sub or max(x, -x)). Can I just add the
>> condition if (flag_wrapv) before the conversion I made to prevent the
>> undefined behavior on overflow?
>
> What HW insns you expand to is one thing, but if some GCC pass assumes that
> ABS_EXPR always returns non-negative value (many do, look e.g. at
> tree_unary_nonnegative_warnv_p, extract_range_from_unary_expr_1,
> simplify_const_relational_operation, etc., you'd need to grep for all
> ABS_EXPR/ABS occurrences) and optimizes code based on that fact, you get
> wrong code because (char) abs((char) -128) is well defined.
> If we change ABS_EXPR/ABS definition that it is well defined on the most
> negative value of the typ (resp. mode), then we loose all those
> optimizations, if we do that only for the char/short types, it would be
> quite weird, though we could keep the benefits, but at the RTL level we'd
> need to treat that way all the modes equal to short's mode and smaller (so,
> for sizeof(short) == sizeof(int) target even int's mode).


I checked those functions and they all consider the possibility of
overflow. For example, tree_unary_nonnegative_warnv_p only returns
true for ABS_EXPR on integers if overflow is undefined. If the
consequence of overflow is wrapping, I think converting (char)
abs((int)-128) to abs(-128) (-128 has char type) is safe. Can we do it
by checking flag_wrapv?

I could also first remove the abs conversion content from this patch
but only keep the content of expanding abs() for i386. I will submit
it later.


>
> The other possibility is not to create the ABS_EXPRs of char/short anywhere,
> solve the vectorization issues either through tree-vect-patterns.c or
> as part of the vectorization type demotion/promotions, see the recent
> discussions for that, you'd represent the short/char abs for the vectorized
> loop say using the shift-xor-sub or builtin etc. and if you want to do the
> same thing for scalar code, you'd just have combiner try to match some
> sequence.


Yes, I could do it through tree-vect-patterns.c, if the abs conversion
is prohibited. Currently the only reason I need the abs conversion is
for vectorization.

Vectorization type demotion/promotions is interesting, but I am afraid
we will face the same problem there.


Thank you for your comment!


Cong


>
> Jakub


Re: [PATCH] Vectorizing abs(char/short/int) on x86.

2013-10-28 Thread Cong Hou
As there are some issues with abs() type conversions, I removed the
related content from the patch but only kept the SSE2 support for
abs(int).

For the define_expand I added as below, the else body is there to
avoid fall-through transformations to ABS operation in optabs.c.
Otherwise ABS will be converted to other operations even that we have
corresponding instructions from SSSE3.


(define_expand "abs2"
  [(set (match_operand:VI124_AVX2_48_AVX512F 0 "register_operand")
(abs:VI124_AVX2_48_AVX512F
 (match_operand:VI124_AVX2_48_AVX512F 1 "nonimmediate_operand")))]
  "TARGET_SSE2"
{
  if (!TARGET_SSSE3)
ix86_expand_sse2_abs (operands[0], force_reg (mode, operands[1]));
  else
emit_insn (gen_rtx_SET (VOIDmode, operands[0],
   gen_rtx_ABS (mode, operands[1])));
  DONE;
})


The patch is attached here. Please give me your comments.


thanks,
Cong



diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 8a38316..84c7ab5 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,10 @@
+2013-10-22  Cong Hou  
+
+ PR target/58762
+ * config/i386/i386-protos.h (ix86_expand_sse2_abs): New function.
+ * config/i386/i386.c (ix86_expand_sse2_abs): New function.
+ * config/i386/sse.md: Add SSE2 support to abs (8/16/32-bit-int).
+
 2013-10-14  David Malcolm  

  * dumpfile.h (gcc::dump_manager): New class, to hold state
diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index 3ab2f3a..ca31224 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -238,6 +238,7 @@ extern void ix86_expand_mul_widen_evenodd (rtx,
rtx, rtx, bool, bool);
 extern void ix86_expand_mul_widen_hilo (rtx, rtx, rtx, bool, bool);
 extern void ix86_expand_sse2_mulv4si3 (rtx, rtx, rtx);
 extern void ix86_expand_sse2_mulvxdi3 (rtx, rtx, rtx);
+extern void ix86_expand_sse2_abs (rtx, rtx);

 /* In i386-c.c  */
 extern void ix86_target_macros (void);
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 02cbbbd..71905fc 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -41696,6 +41696,53 @@ ix86_expand_sse2_mulvxdi3 (rtx op0, rtx op1, rtx op2)
gen_rtx_MULT (mode, op1, op2));
 }

+void
+ix86_expand_sse2_abs (rtx op0, rtx op1)
+{
+  enum machine_mode mode = GET_MODE (op0);
+  rtx tmp0, tmp1;
+
+  switch (mode)
+{
+  /* For 32-bit signed integer X, the best way to calculate the absolute
+ value of X is (((signed) X >> (W-1)) ^ X) - ((signed) X >> (W-1)).  */
+  case V4SImode:
+ tmp0 = expand_simple_binop (mode, ASHIFTRT, op1,
+GEN_INT (GET_MODE_BITSIZE
+ (GET_MODE_INNER (mode)) - 1),
+NULL, 0, OPTAB_DIRECT);
+ if (tmp0)
+  tmp1 = expand_simple_binop (mode, XOR, op1, tmp0,
+  NULL, 0, OPTAB_DIRECT);
+ if (tmp0 && tmp1)
+  expand_simple_binop (mode, MINUS, tmp1, tmp0,
+   op0, 0, OPTAB_DIRECT);
+ break;
+
+  /* For 16-bit signed integer X, the best way to calculate the absolute
+ value of X is max (X, -X), as SSE2 provides the PMAXSW insn.  */
+  case V8HImode:
+ tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0);
+ if (tmp0)
+  expand_simple_binop (mode, SMAX, op1, tmp0, op0, 0,
+   OPTAB_DIRECT);
+ break;
+
+  /* For 8-bit signed integer X, the best way to calculate the absolute
+ value of X is min ((unsigned char) X, (unsigned char) (-X)),
+ as SSE2 provides the PMINUB insn.  */
+  case V16QImode:
+ tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0);
+ if (tmp0)
+  expand_simple_binop (V16QImode, UMIN, op1, tmp0, op0, 0,
+   OPTAB_DIRECT);
+ break;
+
+  default:
+ break;
+}
+}
+
 /* Expand an insert into a vector register through pinsr insn.
Return true if successful.  */

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index c3f6c94..b85ded4 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -8721,7 +8721,7 @@
(set (attr "prefix_rex") (symbol_ref "x86_extended_reg_mentioned_p (insn)"))
(set_attr "mode" "DI")])

-(define_insn "abs2"
+(define_insn "*abs2"
   [(set (match_operand:VI124_AVX2_48_AVX512F 0 "register_operand" "=v")
  (abs:VI124_AVX2_48_AVX512F
   (match_operand:VI124_AVX2_48_AVX512F 1 "nonimmediate_operand" "vm")))]
@@ -8733,6 +8733,20 @@
(set_attr "prefix" "maybe_vex")
(set_attr "mode" "")])

+(define_expand "abs2"
+  [(set (match_operand:VI124_AVX2_48_AVX512F 0 "register_operand")
+ (abs:VI124_AVX2_48_AVX512F
+  (match_operand:VI124_AVX2_48_AVX512F 1 "nonimmediate_operand")))]
+  "TARGET_SSE2"
+{
+  if (!TARGET_SSSE3)
+ix86_expand_sse2_abs (operands[0], force_reg (mode, operands[1]));
+  else
+emit_insn (gen_rtx_SET (VOIDmode, operands[0],
+gen_rtx_ABS (mode, operands[1])));
+  DONE;
+})
+
 (define_insn "abs2"
   [(set (match_operand:MMXMODEI 0 "register_operand" "

Re: [PATCH] Vectorizing abs(char/short/int) on x86.

2013-10-29 Thread Cong Hou
On Tue, Oct 29, 2013 at 1:38 AM, Uros Bizjak  wrote:
> Hello!
>
>> For the define_expand I added as below, the else body is there to
>> avoid fall-through transformations to ABS operation in optabs.c.
>> Otherwise ABS will be converted to other operations even that we have
>> corresponding instructions from SSSE3.
>
> No, it wont be.
>
> Fallthrough will generate the pattern that will be matched by the insn
> pattern above, just like you are doing by hand below.


I think the case is special for abs(). In optabs.c, there is a
function expand_abs() in which the function expand_abs_nojump() is
called. This function first tries the expand function defined for the
target and if it fails it will try max(v, -v) then shift-xor-sub
method. If I don't generate any instruction for SSSE3, the
fall-through will be max(v, -v). I have tested it on my machine.


>
>> (define_expand "abs2"
>>   [(set (match_operand:VI124_AVX2_48_AVX512F 0 "register_operand")
>> (abs:VI124_AVX2_48_AVX512F
>>  (match_operand:VI124_AVX2_48_AVX512F 1 "nonimmediate_operand")))]
>>   "TARGET_SSE2"
>> {
>>   if (!TARGET_SSSE3)
>> ix86_expand_sse2_abs (operands[0], force_reg (mode, operands[1]));
>
> Do you really need force_reg here? You are using generic expanders in
> ix86_expand_sse2_abs that can handle non-registers operands just as
> well.

You are right. I have removed force_reg.


>
>>   else
>> emit_insn (gen_rtx_SET (VOIDmode, operands[0],
>>gen_rtx_ABS (mode, operands[1])));
>>   DONE;
>> })
>
> Please note that your mailer mangles indents. Please indent your code 
> correctly.

Right.. I also attached a text file in which all tabs are there.


The updated patch is pasted below (and also in the attached file).
Thank you very much for your comment!


Cong




diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 8a38316..84c7ab5 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,10 @@
+2013-10-22  Cong Hou  
+
+ PR target/58762
+ * config/i386/i386-protos.h (ix86_expand_sse2_abs): New function.
+ * config/i386/i386.c (ix86_expand_sse2_abs): New function.
+ * config/i386/sse.md: Add SSE2 support to abs (8/16/32-bit-int).
+
 2013-10-14  David Malcolm  

  * dumpfile.h (gcc::dump_manager): New class, to hold state
diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index 3ab2f3a..ca31224 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -238,6 +238,7 @@ extern void ix86_expand_mul_widen_evenodd (rtx,
rtx, rtx, bool, bool);
 extern void ix86_expand_mul_widen_hilo (rtx, rtx, rtx, bool, bool);
 extern void ix86_expand_sse2_mulv4si3 (rtx, rtx, rtx);
 extern void ix86_expand_sse2_mulvxdi3 (rtx, rtx, rtx);
+extern void ix86_expand_sse2_abs (rtx, rtx);

 /* In i386-c.c  */
 extern void ix86_target_macros (void);
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 02cbbbd..71905fc 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -41696,6 +41696,53 @@ ix86_expand_sse2_mulvxdi3 (rtx op0, rtx op1, rtx op2)
gen_rtx_MULT (mode, op1, op2));
 }

+void
+ix86_expand_sse2_abs (rtx op0, rtx op1)
+{
+  enum machine_mode mode = GET_MODE (op0);
+  rtx tmp0, tmp1;
+
+  switch (mode)
+{
+  /* For 32-bit signed integer X, the best way to calculate the absolute
+ value of X is (((signed) X >> (W-1)) ^ X) - ((signed) X >> (W-1)).  */
+  case V4SImode:
+ tmp0 = expand_simple_binop (mode, ASHIFTRT, op1,
+GEN_INT (GET_MODE_BITSIZE
+ (GET_MODE_INNER (mode)) - 1),
+NULL, 0, OPTAB_DIRECT);
+ if (tmp0)
+  tmp1 = expand_simple_binop (mode, XOR, op1, tmp0,
+  NULL, 0, OPTAB_DIRECT);
+ if (tmp0 && tmp1)
+  expand_simple_binop (mode, MINUS, tmp1, tmp0,
+   op0, 0, OPTAB_DIRECT);
+ break;
+
+  /* For 16-bit signed integer X, the best way to calculate the absolute
+ value of X is max (X, -X), as SSE2 provides the PMAXSW insn.  */
+  case V8HImode:
+ tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0);
+ if (tmp0)
+  expand_simple_binop (mode, SMAX, op1, tmp0, op0, 0,
+   OPTAB_DIRECT);
+ break;
+
+  /* For 8-bit signed integer X, the best way to calculate the absolute
+ value of X is min ((unsigned char) X, (unsigned char) (-X)),
+ as SSE2 provides the PMINUB insn.  */
+  case V16QImode:
+ tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0);
+ if (tmp0)
+  expand_simple_binop (V16QImode, UMIN, op1, tmp0, op0, 0,
+   OPTAB_DIRECT);
+ break;
+
+  default:
+ break;
+}
+}
+
 /* Expand an insert into a vector register through pinsr insn.
Return true if successful.  */

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index c3f6c94..0d9cefe 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -8721,7 +8721,7 @@
(set (attr "prefix_rex") (

Re: [PATCH] Vectorizing abs(char/short/int) on x86.

2013-10-29 Thread Cong Hou
On Tue, Oct 29, 2013 at 10:34 AM, Uros Bizjak  wrote:
> On Tue, Oct 29, 2013 at 6:18 PM, Cong Hou  wrote:
>
>>>> For the define_expand I added as below, the else body is there to
>>>> avoid fall-through transformations to ABS operation in optabs.c.
>>>> Otherwise ABS will be converted to other operations even that we have
>>>> corresponding instructions from SSSE3.
>>>
>>> No, it wont be.
>>>
>>> Fallthrough will generate the pattern that will be matched by the insn
>>> pattern above, just like you are doing by hand below.
>>
>>
>> I think the case is special for abs(). In optabs.c, there is a
>> function expand_abs() in which the function expand_abs_nojump() is
>> called. This function first tries the expand function defined for the
>> target and if it fails it will try max(v, -v) then shift-xor-sub
>> method. If I don't generate any instruction for SSSE3, the
>> fall-through will be max(v, -v). I have tested it on my machine.
>
> Huh, strange.
>
> Then you can rename previous pattern to abs2_1 and call it from
> the new expander instead of expanding it manually. Please also add a
> small comment, describing the situation to prevent future
> "optimizations" in this place.

Could you tell me how to do that? The renamed pattern abs2_1 is
also a "define_expand"? How to call this expander?

Thank you!


Cong



>
> Thanks,
> Uros.


[PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.

2013-10-29 Thread Cong Hou
Hi

SAD (Sum of Absolute Differences) is a common and important algorithm
in image processing and other areas. SSE2 even introduced a new
instruction PSADBW for it. A SAD loop can be greatly accelerated by
this instruction after being vectorized. This patch introduced a new
operation SAD_EXPR and a SAD pattern recognizer in vectorizer.

The pattern of SAD is shown below:

 unsigned type x_t, y_t;
 signed TYPE1 diff, abs_diff;
 TYPE2 sum = init;
   loop:
 sum_0 = phi 
 S1  x_t = ...
 S2  y_t = ...
 S3  x_T = (TYPE1) x_t;
 S4  y_T = (TYPE1) y_t;
 S5  diff = x_T - y_T;
 S6  abs_diff = ABS_EXPR ;
 [S7  abs_diff = (TYPE2) abs_diff;  #optional]
 S8  sum_1 = abs_diff + sum_0;

   where 'TYPE1' is at least double the size of type 'type', and 'TYPE2' is the
   same size of 'TYPE1' or bigger. This is a special case of a reduction
   computation.

For SSE2, type is char, and TYPE1 and TYPE2 are int.


In order to express this new operation, a new expression SAD_EXPR is
introduced in tree.def, and the corresponding entry in optabs is
added. The patch also added the "define_expand" for SSE2 and AVX2
platforms for i386.

The patch is pasted below and also attached as a text file (in which
you can see tabs). Bootstrap and make check got passed on x86. Please
give me your comments.



thanks,
Cong



diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 8a38316..d528307 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,23 @@
+2013-10-29  Cong Hou  
+
+ * tree-vect-patterns.c (vect_recog_sad_pattern): New function for SAD
+ pattern recognition.
+ (type_conversion_p): PROMOTION is true if it's a type promotion
+ conversion, and false otherwise.  Return true if the given expression
+ is a type conversion one.
+ * tree-vectorizer.h: Adjust the number of patterns.
+ * tree.def: Add SAD_EXPR.
+ * optabs.def: Add sad_optab.
+ * cfgexpand.c (expand_debug_expr): Add SAD_EXPR case.
+ * expr.c (expand_expr_real_2): Likewise.
+ * gimple-pretty-print.c (dump_ternary_rhs): Likewise.
+ * gimple.c (get_gimple_rhs_num_ops): Likewise.
+ * optabs.c (optab_for_tree_code): Likewise.
+ * tree-cfg.c (estimate_operator_cost): Likewise.
+ * tree-ssa-operands.c (get_expr_operands): Likewise.
+ * tree-vect-loop.c (get_initial_def_for_reduction): Likewise.
+ * config/i386/sse.md: Add SSE2 and AVX2 expand for SAD.
+
 2013-10-14  David Malcolm  

  * dumpfile.h (gcc::dump_manager): New class, to hold state
diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c
index 7ed29f5..9ec761a 100644
--- a/gcc/cfgexpand.c
+++ b/gcc/cfgexpand.c
@@ -2730,6 +2730,7 @@ expand_debug_expr (tree exp)
  {
  case COND_EXPR:
  case DOT_PROD_EXPR:
+ case SAD_EXPR:
  case WIDEN_MULT_PLUS_EXPR:
  case WIDEN_MULT_MINUS_EXPR:
  case FMA_EXPR:
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index c3f6c94..ca1ab70 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -6052,6 +6052,40 @@
   DONE;
 })

+(define_expand "sadv16qi"
+  [(match_operand:V4SI 0 "register_operand")
+   (match_operand:V16QI 1 "register_operand")
+   (match_operand:V16QI 2 "register_operand")
+   (match_operand:V4SI 3 "register_operand")]
+  "TARGET_SSE2"
+{
+  rtx t1 = gen_reg_rtx (V2DImode);
+  rtx t2 = gen_reg_rtx (V4SImode);
+  emit_insn (gen_sse2_psadbw (t1, operands[1], operands[2]));
+  convert_move (t2, t1, 0);
+  emit_insn (gen_rtx_SET (VOIDmode, operands[0],
+  gen_rtx_PLUS (V4SImode,
+ operands[3], t2)));
+  DONE;
+})
+
+(define_expand "sadv32qi"
+  [(match_operand:V8SI 0 "register_operand")
+   (match_operand:V32QI 1 "register_operand")
+   (match_operand:V32QI 2 "register_operand")
+   (match_operand:V8SI 3 "register_operand")]
+  "TARGET_AVX2"
+{
+  rtx t1 = gen_reg_rtx (V4DImode);
+  rtx t2 = gen_reg_rtx (V8SImode);
+  emit_insn (gen_avx2_psadbw (t1, operands[1], operands[2]));
+  convert_move (t2, t1, 0);
+  emit_insn (gen_rtx_SET (VOIDmode, operands[0],
+  gen_rtx_PLUS (V8SImode,
+ operands[3], t2)));
+  DONE;
+})
+
 (define_insn "ashr3"
   [(set (match_operand:VI24_AVX2 0 "register_operand" "=x,x")
  (ashiftrt:VI24_AVX2
diff --git a/gcc/expr.c b/gcc/expr.c
index 4975a64..1db8a49 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -9026,6 +9026,20 @@ expand_expr_real_2 (sepops ops, rtx target,
enum machine_mode tmode,
  return target;
   }

+  case SAD_EXPR:
+  {
+ tree oprnd0 = treeop0;
+ tree oprnd1 = treeop1;
+ tree oprnd2 = treeop2;
+ rtx op2;
+
+ expand_operands (oprnd0, oprnd1, NULL_RTX, &op0, &op1, EXPAND_NORMAL);
+ op2 = expand_normal (oprnd2);
+ target = expand_widen_pattern_expr (ops, op0, op1, op2,
+target, unsignedp);
+ return target;
+  }
+
 case REALIGN_LOAD_EXPR:
   {
 tree oprnd0 = treeop0;
diff --git a/gcc/gimple-pretty-print.c b/gcc/gimple-pretty-print.c
index f

Re: [PATCH] Vectorizing abs(char/short/int) on x86.

2013-10-30 Thread Cong Hou
I found my problem: I put DONE outside of if not inside. You are
right. I have updated my patch.

I appreciate your comment and test on it!


thanks,
Cong



diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 8a38316..84c7ab5 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,10 @@
+2013-10-22  Cong Hou  
+
+ PR target/58762
+ * config/i386/i386-protos.h (ix86_expand_sse2_abs): New function.
+ * config/i386/i386.c (ix86_expand_sse2_abs): New function.
+ * config/i386/sse.md: Add SSE2 support to abs (8/16/32-bit-int).
+
 2013-10-14  David Malcolm  

  * dumpfile.h (gcc::dump_manager): New class, to hold state
diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index 3ab2f3a..ca31224 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -238,6 +238,7 @@ extern void ix86_expand_mul_widen_evenodd (rtx,
rtx, rtx, bool, bool);
 extern void ix86_expand_mul_widen_hilo (rtx, rtx, rtx, bool, bool);
 extern void ix86_expand_sse2_mulv4si3 (rtx, rtx, rtx);
 extern void ix86_expand_sse2_mulvxdi3 (rtx, rtx, rtx);
+extern void ix86_expand_sse2_abs (rtx, rtx);

 /* In i386-c.c  */
 extern void ix86_target_macros (void);
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 02cbbbd..71905fc 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -41696,6 +41696,53 @@ ix86_expand_sse2_mulvxdi3 (rtx op0, rtx op1, rtx op2)
gen_rtx_MULT (mode, op1, op2));
 }

+void
+ix86_expand_sse2_abs (rtx op0, rtx op1)
+{
+  enum machine_mode mode = GET_MODE (op0);
+  rtx tmp0, tmp1;
+
+  switch (mode)
+{
+  /* For 32-bit signed integer X, the best way to calculate the absolute
+ value of X is (((signed) X >> (W-1)) ^ X) - ((signed) X >> (W-1)).  */
+  case V4SImode:
+ tmp0 = expand_simple_binop (mode, ASHIFTRT, op1,
+GEN_INT (GET_MODE_BITSIZE
+ (GET_MODE_INNER (mode)) - 1),
+NULL, 0, OPTAB_DIRECT);
+ if (tmp0)
+  tmp1 = expand_simple_binop (mode, XOR, op1, tmp0,
+  NULL, 0, OPTAB_DIRECT);
+ if (tmp0 && tmp1)
+  expand_simple_binop (mode, MINUS, tmp1, tmp0,
+   op0, 0, OPTAB_DIRECT);
+ break;
+
+  /* For 16-bit signed integer X, the best way to calculate the absolute
+ value of X is max (X, -X), as SSE2 provides the PMAXSW insn.  */
+  case V8HImode:
+ tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0);
+ if (tmp0)
+  expand_simple_binop (mode, SMAX, op1, tmp0, op0, 0,
+   OPTAB_DIRECT);
+ break;
+
+  /* For 8-bit signed integer X, the best way to calculate the absolute
+ value of X is min ((unsigned char) X, (unsigned char) (-X)),
+ as SSE2 provides the PMINUB insn.  */
+  case V16QImode:
+ tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0);
+ if (tmp0)
+  expand_simple_binop (V16QImode, UMIN, op1, tmp0, op0, 0,
+   OPTAB_DIRECT);
+ break;
+
+  default:
+ break;
+}
+}
+
 /* Expand an insert into a vector register through pinsr insn.
Return true if successful.  */

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index c3f6c94..46e1df4 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -8721,7 +8721,7 @@
(set (attr "prefix_rex") (symbol_ref "x86_extended_reg_mentioned_p (insn)"))
(set_attr "mode" "DI")])

-(define_insn "abs2"
+(define_insn "*abs2"
   [(set (match_operand:VI124_AVX2_48_AVX512F 0 "register_operand" "=v")
  (abs:VI124_AVX2_48_AVX512F
   (match_operand:VI124_AVX2_48_AVX512F 1 "nonimmediate_operand" "vm")))]
@@ -8733,6 +8733,19 @@
(set_attr "prefix" "maybe_vex")
(set_attr "mode" "")])

+(define_expand "abs2"
+  [(set (match_operand:VI124_AVX2_48_AVX512F 0 "register_operand")
+ (abs:VI124_AVX2_48_AVX512F
+  (match_operand:VI124_AVX2_48_AVX512F 1 "nonimmediate_operand")))]
+  "TARGET_SSE2"
+{
+  if (!TARGET_SSSE3)
+{
+  ix86_expand_sse2_abs (operands[0], operands[1]);
+  DONE;
+}
+})
+
 (define_insn "abs2"
   [(set (match_operand:MMXMODEI 0 "register_operand" "=y")
  (abs:MMXMODEI
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 075d071..cf5b942 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,8 @@
+2013-10-22  Cong Hou  
+
+ PR target/58762
+ * gcc.dg/vect/pr58762.c: New test.
+
 2013-10-14  Tobias Burnus  

  PR fortran/58658
diff --git a/gcc/testsuite/gcc.dg/vect/pr58762.c
b/gcc/testsuite/gcc.dg/vect/pr58762.c
new file mode 100644
index 000..6468d0a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr58762.c
@@ -0,0 +1,28 @@
+/* { dg-require-effective-target vect_int } */
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize" } */
+
+void test1 (char* a, char* b)
+{
+  int i;
+  for (i = 0; i < 1; ++i)
+a[i] = abs (b[i]);
+}
+
+void test2 (short* a, short* b)
+{
+  int i;
+  for (i = 0; i &l

Re: [PATCH] Vectorizing abs(char/short/int) on x86.

2013-10-30 Thread Cong Hou
Forget to attach the patch file.



thanks,
Cong


On Wed, Oct 30, 2013 at 10:01 AM, Cong Hou  wrote:
> I found my problem: I put DONE outside of if not inside. You are
> right. I have updated my patch.
>
> I appreciate your comment and test on it!
>
>
> thanks,
> Cong
>
>
>
> diff --git a/gcc/ChangeLog b/gcc/ChangeLog
> index 8a38316..84c7ab5 100644
> --- a/gcc/ChangeLog
> +++ b/gcc/ChangeLog
> @@ -1,3 +1,10 @@
> +2013-10-22  Cong Hou  
> +
> + PR target/58762
> + * config/i386/i386-protos.h (ix86_expand_sse2_abs): New function.
> + * config/i386/i386.c (ix86_expand_sse2_abs): New function.
> + * config/i386/sse.md: Add SSE2 support to abs (8/16/32-bit-int).
> +
>  2013-10-14  David Malcolm  
>
>   * dumpfile.h (gcc::dump_manager): New class, to hold state
> diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
> index 3ab2f3a..ca31224 100644
> --- a/gcc/config/i386/i386-protos.h
> +++ b/gcc/config/i386/i386-protos.h
> @@ -238,6 +238,7 @@ extern void ix86_expand_mul_widen_evenodd (rtx,
> rtx, rtx, bool, bool);
>  extern void ix86_expand_mul_widen_hilo (rtx, rtx, rtx, bool, bool);
>  extern void ix86_expand_sse2_mulv4si3 (rtx, rtx, rtx);
>  extern void ix86_expand_sse2_mulvxdi3 (rtx, rtx, rtx);
> +extern void ix86_expand_sse2_abs (rtx, rtx);
>
>  /* In i386-c.c  */
>  extern void ix86_target_macros (void);
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index 02cbbbd..71905fc 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -41696,6 +41696,53 @@ ix86_expand_sse2_mulvxdi3 (rtx op0, rtx op1, rtx op2)
> gen_rtx_MULT (mode, op1, op2));
>  }
>
> +void
> +ix86_expand_sse2_abs (rtx op0, rtx op1)
> +{
> +  enum machine_mode mode = GET_MODE (op0);
> +  rtx tmp0, tmp1;
> +
> +  switch (mode)
> +{
> +  /* For 32-bit signed integer X, the best way to calculate the absolute
> + value of X is (((signed) X >> (W-1)) ^ X) - ((signed) X >> (W-1)).  */
> +  case V4SImode:
> + tmp0 = expand_simple_binop (mode, ASHIFTRT, op1,
> +GEN_INT (GET_MODE_BITSIZE
> + (GET_MODE_INNER (mode)) - 1),
> +NULL, 0, OPTAB_DIRECT);
> + if (tmp0)
> +  tmp1 = expand_simple_binop (mode, XOR, op1, tmp0,
> +  NULL, 0, OPTAB_DIRECT);
> + if (tmp0 && tmp1)
> +  expand_simple_binop (mode, MINUS, tmp1, tmp0,
> +   op0, 0, OPTAB_DIRECT);
> + break;
> +
> +  /* For 16-bit signed integer X, the best way to calculate the absolute
> + value of X is max (X, -X), as SSE2 provides the PMAXSW insn.  */
> +  case V8HImode:
> + tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0);
> + if (tmp0)
> +  expand_simple_binop (mode, SMAX, op1, tmp0, op0, 0,
> +   OPTAB_DIRECT);
> + break;
> +
> +  /* For 8-bit signed integer X, the best way to calculate the absolute
> + value of X is min ((unsigned char) X, (unsigned char) (-X)),
> + as SSE2 provides the PMINUB insn.  */
> +  case V16QImode:
> + tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0);
> + if (tmp0)
> +  expand_simple_binop (V16QImode, UMIN, op1, tmp0, op0, 0,
> +   OPTAB_DIRECT);
> + break;
> +
> +  default:
> + break;
> +}
> +}
> +
>  /* Expand an insert into a vector register through pinsr insn.
> Return true if successful.  */
>
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index c3f6c94..46e1df4 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -8721,7 +8721,7 @@
> (set (attr "prefix_rex") (symbol_ref "x86_extended_reg_mentioned_p 
> (insn)"))
> (set_attr "mode" "DI")])
>
> -(define_insn "abs2"
> +(define_insn "*abs2"
>[(set (match_operand:VI124_AVX2_48_AVX512F 0 "register_operand" "=v")
>   (abs:VI124_AVX2_48_AVX512F
>(match_operand:VI124_AVX2_48_AVX512F 1 "nonimmediate_operand" "vm")))]
> @@ -8733,6 +8733,19 @@
> (set_attr "prefix" "maybe_vex")
> (set_attr "mode" "")])
>
> +(define_expand "abs2"
> +  [(set (match_operand:VI124_AVX2_48_AVX512F 0 "register_operand")
> + (abs:VI124_AVX2_48_AVX512F
> +  (match_operand:VI124_AVX2_48_AVX512F 1 "nonimmediate_operand")))]
> +  "TARGET_SSE2"
> +{
> +  if (!TARGET_SSSE3)
> +{
> +  ix86_expand_sse2_abs (operands[0], operands[1]);
> +  DONE;
> +}
> +})
> +
>  (define_insn "abs2"
>[(set (match_operand:MMXMODEI 0 "register_operand" "=y")
>   (abs:MMXMODEI
> diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
>

Re: [PATCH] Vectorizing abs(char/short/int) on x86.

2013-10-30 Thread Cong Hou
On Wed, Oct 30, 2013 at 10:22 AM, Uros Bizjak  wrote:
> On Wed, Oct 30, 2013 at 6:01 PM, Cong Hou  wrote:
>> I found my problem: I put DONE outside of if not inside. You are
>> right. I have updated my patch.
>
> OK, great that we put things in order ;)
>
> Does this patch need some extra middle-end functionality? I was not
> able to vectorize char and short part of your patch.


In the original patch, I converted abs() on short and char values to
their own types by removing type casts. That is, originally char_val1
= abs(char_val2) will be converted to char_val1 = (char) abs((int)
char_val2) in the frontend, and I would like to convert it back to
char_val1 = abs(char_val2). But after several discussions, it seems
this conversion has some problems such as overflow converns, and I
thereby removed that part.

Now you should still be able to vectorize abs(char) and abs(short) but
with packing and unpacking. Later I will consider to write pattern
recognizer for abs(char) and abs(short) and then the expand on
abs(char)/abs(short) in this patch will be used during vectorization.


>
> Regarding the testcase - please put it to gcc.target/i386/ directory.
> There is nothing generic in the test, as confirmed by target-dependent
> scan test. You will find plenty of examples in the mentioned
> directory. I'd suggest to split the testcase in three files, and to
> simplify it to something like the testcase with global variables I
> used earlier.


I have done it. The test case is split into three for s8/s16/s32 in
gcc.target/i386.


Thank you!

Cong



diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 8a38316..84c7ab5 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,10 @@
+2013-10-22  Cong Hou  
+
+ PR target/58762
+ * config/i386/i386-protos.h (ix86_expand_sse2_abs): New function.
+ * config/i386/i386.c (ix86_expand_sse2_abs): New function.
+ * config/i386/sse.md: Add SSE2 support to abs (8/16/32-bit-int).
+
 2013-10-14  David Malcolm  

  * dumpfile.h (gcc::dump_manager): New class, to hold state
diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index 3ab2f3a..ca31224 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -238,6 +238,7 @@ extern void ix86_expand_mul_widen_evenodd (rtx,
rtx, rtx, bool, bool);
 extern void ix86_expand_mul_widen_hilo (rtx, rtx, rtx, bool, bool);
 extern void ix86_expand_sse2_mulv4si3 (rtx, rtx, rtx);
 extern void ix86_expand_sse2_mulvxdi3 (rtx, rtx, rtx);
+extern void ix86_expand_sse2_abs (rtx, rtx);

 /* In i386-c.c  */
 extern void ix86_target_macros (void);
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 02cbbbd..71905fc 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -41696,6 +41696,53 @@ ix86_expand_sse2_mulvxdi3 (rtx op0, rtx op1, rtx op2)
gen_rtx_MULT (mode, op1, op2));
 }

+void
+ix86_expand_sse2_abs (rtx op0, rtx op1)
+{
+  enum machine_mode mode = GET_MODE (op0);
+  rtx tmp0, tmp1;
+
+  switch (mode)
+{
+  /* For 32-bit signed integer X, the best way to calculate the absolute
+ value of X is (((signed) X >> (W-1)) ^ X) - ((signed) X >> (W-1)).  */
+  case V4SImode:
+ tmp0 = expand_simple_binop (mode, ASHIFTRT, op1,
+GEN_INT (GET_MODE_BITSIZE
+ (GET_MODE_INNER (mode)) - 1),
+NULL, 0, OPTAB_DIRECT);
+ if (tmp0)
+  tmp1 = expand_simple_binop (mode, XOR, op1, tmp0,
+  NULL, 0, OPTAB_DIRECT);
+ if (tmp0 && tmp1)
+  expand_simple_binop (mode, MINUS, tmp1, tmp0,
+   op0, 0, OPTAB_DIRECT);
+ break;
+
+  /* For 16-bit signed integer X, the best way to calculate the absolute
+ value of X is max (X, -X), as SSE2 provides the PMAXSW insn.  */
+  case V8HImode:
+ tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0);
+ if (tmp0)
+  expand_simple_binop (mode, SMAX, op1, tmp0, op0, 0,
+   OPTAB_DIRECT);
+ break;
+
+  /* For 8-bit signed integer X, the best way to calculate the absolute
+ value of X is min ((unsigned char) X, (unsigned char) (-X)),
+ as SSE2 provides the PMINUB insn.  */
+  case V16QImode:
+ tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0);
+ if (tmp0)
+  expand_simple_binop (V16QImode, UMIN, op1, tmp0, op0, 0,
+   OPTAB_DIRECT);
+ break;
+
+  default:
+ break;
+}
+}
+
 /* Expand an insert into a vector register through pinsr insn.
Return true if successful.  */

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index c3f6c94..46e1df4 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -8721,7 +8721,7 @@
(set (attr "prefix_rex") (symbol_ref "x86_extended_reg_mentioned_p (insn)"))
(set_attr "mode" "DI")])

-(define_insn "abs2"
+(define_insn "*abs2"
   [(set (match_operand:VI124_AVX2_48_AVX512F 0 "register_operand" "=v")
  (abs:VI124_AVX2_48_AVX512F
   (match_operand:VI124_AVX2_48_AVX5

Re: [PATCH] Vectorizing abs(char/short/int) on x86.

2013-10-30 Thread Cong Hou
Also, as the current expand for abs() on 8/16bit integer is not used
at all, should I comment them temporarily now? Later I can uncomment
them once I finished the pattern recognizer.



thanks,
Cong


On Wed, Oct 30, 2013 at 10:22 AM, Uros Bizjak  wrote:
> On Wed, Oct 30, 2013 at 6:01 PM, Cong Hou  wrote:
>> I found my problem: I put DONE outside of if not inside. You are
>> right. I have updated my patch.
>
> OK, great that we put things in order ;)
>
> Does this patch need some extra middle-end functionality? I was not
> able to vectorize char and short part of your patch.
>
> Regarding the testcase - please put it to gcc.target/i386/ directory.
> There is nothing generic in the test, as confirmed by target-dependent
> scan test. You will find plenty of examples in the mentioned
> directory. I'd suggest to split the testcase in three files, and to
> simplify it to something like the testcase with global variables I
> used earlier.
>
> Modulo testcase, the patch is OK otherwise, but middle-end parts
> should be committed first.
>
> Thanks,
> Uros.


Re: [PATCH] Vectorizing abs(char/short/int) on x86.

2013-10-30 Thread Cong Hou
I have run check_GNU_style.sh on my patch.

The patch is submitted. Thank you for your comments and help on this patch!



thanks,
Cong


On Wed, Oct 30, 2013 at 11:13 AM, Uros Bizjak  wrote:
> On Wed, Oct 30, 2013 at 7:01 PM, Cong Hou  wrote:
>
>>>> I found my problem: I put DONE outside of if not inside. You are
>>>> right. I have updated my patch.
>>>
>>> OK, great that we put things in order ;)
>>>
>>> Does this patch need some extra middle-end functionality? I was not
>>> able to vectorize char and short part of your patch.
>>
>>
>> In the original patch, I converted abs() on short and char values to
>> their own types by removing type casts. That is, originally char_val1
>> = abs(char_val2) will be converted to char_val1 = (char) abs((int)
>> char_val2) in the frontend, and I would like to convert it back to
>> char_val1 = abs(char_val2). But after several discussions, it seems
>> this conversion has some problems such as overflow converns, and I
>> thereby removed that part.
>>
>> Now you should still be able to vectorize abs(char) and abs(short) but
>> with packing and unpacking. Later I will consider to write pattern
>> recognizer for abs(char) and abs(short) and then the expand on
>> abs(char)/abs(short) in this patch will be used during vectorization.
>
> OK, this seems reasonable. We already have "unused" SSSE3 8/16 bit abs
> pattern, so I think we can commit SSE2 expanders, even if they will be
> unused for now. The proposed recognizer will benefit SSE2 as well as
> existing SSSE3 patterns.
>
>>> Regarding the testcase - please put it to gcc.target/i386/ directory.
>>> There is nothing generic in the test, as confirmed by target-dependent
>>> scan test. You will find plenty of examples in the mentioned
>>> directory. I'd suggest to split the testcase in three files, and to
>>> simplify it to something like the testcase with global variables I
>>> used earlier.
>>
>>
>> I have done it. The test case is split into three for s8/s16/s32 in
>> gcc.target/i386.
>
> OK.
>
> The patch is OK for mainline, but please check formatting and
> whitespace before the patch is committed.
>
> Thanks,
> Uros.


Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.

2013-10-30 Thread Cong Hou
On Wed, Oct 30, 2013 at 4:27 AM, Richard Biener  wrote:
> On Tue, 29 Oct 2013, Cong Hou wrote:
>
>> Hi
>>
>> SAD (Sum of Absolute Differences) is a common and important algorithm
>> in image processing and other areas. SSE2 even introduced a new
>> instruction PSADBW for it. A SAD loop can be greatly accelerated by
>> this instruction after being vectorized. This patch introduced a new
>> operation SAD_EXPR and a SAD pattern recognizer in vectorizer.
>>
>> The pattern of SAD is shown below:
>>
>>  unsigned type x_t, y_t;
>>  signed TYPE1 diff, abs_diff;
>>  TYPE2 sum = init;
>>loop:
>>  sum_0 = phi 
>>  S1  x_t = ...
>>  S2  y_t = ...
>>  S3  x_T = (TYPE1) x_t;
>>  S4  y_T = (TYPE1) y_t;
>>  S5  diff = x_T - y_T;
>>  S6  abs_diff = ABS_EXPR ;
>>  [S7  abs_diff = (TYPE2) abs_diff;  #optional]
>>  S8  sum_1 = abs_diff + sum_0;
>>
>>where 'TYPE1' is at least double the size of type 'type', and 'TYPE2' is 
>> the
>>same size of 'TYPE1' or bigger. This is a special case of a reduction
>>computation.
>>
>> For SSE2, type is char, and TYPE1 and TYPE2 are int.
>>
>>
>> In order to express this new operation, a new expression SAD_EXPR is
>> introduced in tree.def, and the corresponding entry in optabs is
>> added. The patch also added the "define_expand" for SSE2 and AVX2
>> platforms for i386.
>>
>> The patch is pasted below and also attached as a text file (in which
>> you can see tabs). Bootstrap and make check got passed on x86. Please
>> give me your comments.
>
> Apart from the testcase comment made earlier
>
> +++ b/gcc/tree-cfg.c
> @@ -3797,6 +3797,7 @@ verify_gimple_assign_ternary (gimple stmt)
>return false;
>
>  case DOT_PROD_EXPR:
> +case SAD_EXPR:
>  case REALIGN_LOAD_EXPR:
>/* FIXME.  */
>return false;
>
> please add proper verification of the operand types.

OK.

>
> +/* Widening sad (sum of absolute differences).
> +   The first two arguments are of type t1 which should be unsigned
> integer.
> +   The third argument and the result are of type t2, such that t2 is at
> least
> +   twice the size of t1. SAD_EXPR(arg1,arg2,arg3) is equivalent to:
> +   tmp1 = WIDEN_MINUS_EXPR (arg1, arg2);
> +   tmp2 = ABS_EXPR (tmp1);
> +   arg3 = PLUS_EXPR (tmp2, arg3);   */
> +DEFTREECODE (SAD_EXPR, "sad_expr", tcc_expression, 3)
>
> WIDEN_MINUS_EXPR doesn't exist so you have to explain on its
> operation (it returns a signed wide difference?).  Why should
> the first two arguments be unsigned?  I cannot see a good reason
> to require that (other than that maybe the x86 target only has
> support for widened unsigned difference?).  So if you want to
> make that restriction maybe change the name to SADU_EXPR
> (sum of absolute differences of unsigned)?
>
> I suppose you tried introducing WIDEN_MINUS_EXPR instead and
> letting combine do it's work, avoiding the very special optab?

I may use the wrong representation here. I think the behavior of
"WIDEN_MINUS_EXPR" in SAD is different from the general one. SAD
usually works on unsigned integers (see
http://en.wikipedia.org/wiki/Sum_of_absolute_differences), and before
getting the difference between two unsigned integers, they are
promoted to bigger signed integers. And the result of (int)(char)(1) -
(int)(char)(-1) is different from (int)(unsigned char)(1) -
(int)(unsigned char)(-1). So we cannot implement SAD using
WIDEN_MINUS_EXPR.

Also, the SSE2 instruction PSADBW also requires the operands to be
unsigned 8-bit integers.

I will remove the improper description as you pointed out.



thanks,
Cong


>
> Thanks,
> Richard.
>
>>
>>
>> thanks,
>> Cong
>>
>>
>>
>> diff --git a/gcc/ChangeLog b/gcc/ChangeLog
>> index 8a38316..d528307 100644
>> --- a/gcc/ChangeLog
>> +++ b/gcc/ChangeLog
>> @@ -1,3 +1,23 @@
>> +2013-10-29  Cong Hou  
>> +
>> + * tree-vect-patterns.c (vect_recog_sad_pattern): New function for SAD
>> + pattern recognition.
>> + (type_conversion_p): PROMOTION is true if it's a type promotion
>> + conversion, and false otherwise.  Return true if the given expression
>> + is a type conversion one.
>> + * tree-vectorizer.h: Adjust the number of patterns.
>> + * tree.def: Add SAD_EXPR.
>> + * optabs.def: Add sad_optab.
>> + * cfgexpand.c (expand_debug_expr): Add SAD_EXPR case.
>> + * expr.c (expand_expr_real_2): Likewise.
>> + * gimple-pretty-print.c (dump_

Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.

2013-10-30 Thread Cong Hou
On Tue, Oct 29, 2013 at 4:49 PM, Ramana Radhakrishnan
 wrote:
> Cong,
>
> Please don't do the following.
>
>>+++ b/gcc/testsuite/gcc.dg/vect/
> vect-reduc-sad.c
> @@ -0,0 +1,54 @@
> +/* { dg-require-effective-target sse2 { target { i?86-*-* x86_64-*-* } } } */
>
> you are adding a test to gcc.dg/vect - It's a common directory
> containing tests that need to run on multiple architectures and such
> tests should be keyed by the feature they enable which can be turned
> on for ports that have such an instruction.
>
> The correct way of doing this is to key this on the feature something
> like dg-require-effective-target vect_sad_char . And define the
> equivalent routine in testsuite/lib/target-supports.exp and enable it
> for sse2 for the x86 port. If in doubt look at
> check_effective_target_vect_int and a whole family of such functions
> in testsuite/lib/target-supports.exp
>
> This makes life easy for other port maintainers who want to turn on
> this support. And for bonus points please update the testcase writing
> wiki page with this information if it isn't already there.
>

OK, I will likely move the test case to gcc.target/i386 as currently
only SSE2 provides SAD instruction. But your suggestion also helps!


> You are also missing documentation updates for SAD_EXPR, md.texi for
> the new standard pattern name. Shouldn't it be called sad4
> really ?
>


I will add the documentation for the new operation SAD_EXPR.

I use sad by just following udot_prod as those two
operations are quite similar:

 OPTAB_D (udot_prod_optab, "udot_prod$I$a")


thanks,
Cong


>
> regards
> Ramana
>
>
>
>
>
> On Tue, Oct 29, 2013 at 10:23 PM, Cong Hou  wrote:
>> Hi
>>
>> SAD (Sum of Absolute Differences) is a common and important algorithm
>> in image processing and other areas. SSE2 even introduced a new
>> instruction PSADBW for it. A SAD loop can be greatly accelerated by
>> this instruction after being vectorized. This patch introduced a new
>> operation SAD_EXPR and a SAD pattern recognizer in vectorizer.
>>
>> The pattern of SAD is shown below:
>>
>>  unsigned type x_t, y_t;
>>  signed TYPE1 diff, abs_diff;
>>  TYPE2 sum = init;
>>loop:
>>  sum_0 = phi 
>>  S1  x_t = ...
>>  S2  y_t = ...
>>  S3  x_T = (TYPE1) x_t;
>>  S4  y_T = (TYPE1) y_t;
>>  S5  diff = x_T - y_T;
>>  S6  abs_diff = ABS_EXPR ;
>>  [S7  abs_diff = (TYPE2) abs_diff;  #optional]
>>  S8  sum_1 = abs_diff + sum_0;
>>
>>where 'TYPE1' is at least double the size of type 'type', and 'TYPE2' is 
>> the
>>same size of 'TYPE1' or bigger. This is a special case of a reduction
>>computation.
>>
>> For SSE2, type is char, and TYPE1 and TYPE2 are int.
>>
>>
>> In order to express this new operation, a new expression SAD_EXPR is
>> introduced in tree.def, and the corresponding entry in optabs is
>> added. The patch also added the "define_expand" for SSE2 and AVX2
>> platforms for i386.
>>
>> The patch is pasted below and also attached as a text file (in which
>> you can see tabs). Bootstrap and make check got passed on x86. Please
>> give me your comments.
>>
>>
>>
>> thanks,
>> Cong
>>
>>
>>
>> diff --git a/gcc/ChangeLog b/gcc/ChangeLog
>> index 8a38316..d528307 100644
>> --- a/gcc/ChangeLog
>> +++ b/gcc/ChangeLog
>> @@ -1,3 +1,23 @@
>> +2013-10-29  Cong Hou  
>> +
>> + * tree-vect-patterns.c (vect_recog_sad_pattern): New function for SAD
>> + pattern recognition.
>> + (type_conversion_p): PROMOTION is true if it's a type promotion
>> + conversion, and false otherwise.  Return true if the given expression
>> + is a type conversion one.
>> + * tree-vectorizer.h: Adjust the number of patterns.
>> + * tree.def: Add SAD_EXPR.
>> + * optabs.def: Add sad_optab.
>> + * cfgexpand.c (expand_debug_expr): Add SAD_EXPR case.
>> + * expr.c (expand_expr_real_2): Likewise.
>> + * gimple-pretty-print.c (dump_ternary_rhs): Likewise.
>> + * gimple.c (get_gimple_rhs_num_ops): Likewise.
>> + * optabs.c (optab_for_tree_code): Likewise.
>> + * tree-cfg.c (estimate_operator_cost): Likewise.
>> + * tree-ssa-operands.c (get_expr_operands): Likewise.
>> + * tree-vect-loop.c (get_initial_def_for_reduction): Likewise.
>> + * config/i386/sse.md: Add SSE2 and AVX2 expand for SAD.
>> +
>>  2013-10-14  David Malcolm  
>>
>>   * dumpfile.h (gcc::dump_manager

Re: [PATCH] Vectorizing abs(char/short/int) on x86.

2013-10-31 Thread Cong Hou
This update makes it more safe. You showed me how to write better
expand code. Thank you for the improvement!



thanks,
Cong


On Thu, Oct 31, 2013 at 11:43 AM, Uros Bizjak  wrote:
> On Wed, Oct 30, 2013 at 9:02 PM, Cong Hou  wrote:
>> I have run check_GNU_style.sh on my patch.
>>
>> The patch is submitted. Thank you for your comments and help on this patch!
>
> I have committed a couple of fixes/improvements to your expander in
> i386.c. There is no need to check for the result of
> expand_simple_binop. Also, there is no guarantee that
> expand_simple_binop will expand to the target. It can return different
> RTX. Also, unhandled modes are now marked with gcc_unreachable.
>
> 2013-10-31  Uros Bizjak  
>
> * config/i386/i386.c (ix86_expand_sse2_abs): Rename function arguments.
> Use gcc_unreachable for unhandled modes.  Do not check results of
> expand_simple_binop.  If not expanded to target, move the result.
>
> Tested on x86_64-pc-linux-gnu and committed.
>
> Uros.


[PATCH] Handling == or != comparisons that may affect range test optimization.

2013-10-31 Thread Cong Hou
(This patch is for the bug 58728:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58728)

As in the bug report, consider the following loop:

int foo(unsigned int n)
{
  if (n != 0)
  if (n != 1)
  if (n != 2)
  if (n != 3)
  if (n != 4)
return ++n;
  return n;
}

The range test optimization should be able to merge all those five
conditions into one in reassoc pass, but I fails to do so. The reason
is that the phi arg of n is replaced by the constant it compares to in
case of == or != comparisons (in vrp pass). GCC checks there is no
side effect on n between any two neighboring conditions by examining
if they defined the same phi arg in the join node. But as the phi arg
is replace by a constant, the check fails.

This patch deals with this situation by considering the existence of
== or != comparisons, which is attached below (a text file is also
attached with proper tabs). Bootstrap and make check both get passed.

Any comment?


thanks,
Cong




diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 8a38316..9247222 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,11 @@
+2013-10-31  Cong Hou  
+
+ PR tree-optimization/58728
+ * tree-ssa-reassoc.c (suitable_cond_bb): Consider the situtation
+ that ==/!= comparisons between a variable and a constant may lead
+ to that the later phi arg of the variable is substitued by the
+ constant from prior passes, during range test optimization.
+
 2013-10-14  David Malcolm  

  * dumpfile.h (gcc::dump_manager): New class, to hold state
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 075d071..44a5e70 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,8 @@
+2013-10-31  Cong Hou  
+
+ PR tree-optimization/58728
+ * gcc.dg/tree-ssa/pr58728: New test.
+
 2013-10-14  Tobias Burnus  

  PR fortran/58658
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr58728.c
b/gcc/testsuite/gcc.dg/tree-ssa/pr58728.c
new file mode 100644
index 000..312aebc
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr58728.c
@@ -0,0 +1,25 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-reassoc1-details" } */
+
+int foo (unsigned int n)
+{
+  if (n != 0)
+if (n != 1)
+  return ++n;
+  return n;
+}
+
+int bar (unsigned int n)
+{
+  if (n == 0)
+;
+  else if (n == 1)
+;
+  else
+return ++n;
+  return n;
+}
+
+
+/* { dg-final { scan-tree-dump-times "Optimizing range tests" 2
"reassoc1" } } */
+/* { dg-final { cleanup-tree-dump "reassoc1" } } */
diff --git a/gcc/tree-ssa-reassoc.c b/gcc/tree-ssa-reassoc.c
index 6859518..bccf99f 100644
--- a/gcc/tree-ssa-reassoc.c
+++ b/gcc/tree-ssa-reassoc.c
@@ -2426,11 +2426,70 @@ suitable_cond_bb (basic_block bb, basic_block
test_bb, basic_block *other_bb,
   for (gsi = gsi_start_phis (e->dest); !gsi_end_p (gsi); gsi_next (&gsi))
 {
   gimple phi = gsi_stmt (gsi);
+  tree phi_arg = gimple_phi_arg_def (phi, e->dest_idx);
+  tree phi_arg2 = gimple_phi_arg_def (phi, e2->dest_idx);
+
   /* If both BB and TEST_BB end with GIMPLE_COND, all PHI arguments
  corresponding to BB and TEST_BB predecessor must be the same.  */
-  if (!operand_equal_p (gimple_phi_arg_def (phi, e->dest_idx),
-gimple_phi_arg_def (phi, e2->dest_idx), 0))
- {
+  if (!operand_equal_p (phi_arg, phi_arg2, 0))
+  {
+ /* If the condition in BB or TEST_BB is an NE or EQ comparison like
+   if (n != N) or if (n == N), it is possible that the corresponding
+   def of n in the phi function is replaced by N.  We should still allow
+   range test optimization in this case.  */
+
+ tree lhs = NULL, rhs = NULL,
+ lhs2 = NULL, rhs2 = NULL;
+ bool is_eq_expr = is_cond && (gimple_cond_code (stmt) == NE_EXPR
+ || gimple_cond_code (stmt) == EQ_EXPR)
+  && TREE_CODE (phi_arg) == INTEGER_CST;
+
+ if (is_eq_expr)
+  {
+lhs = gimple_cond_lhs (stmt);
+rhs = gimple_cond_rhs (stmt);
+
+if (operand_equal_p (lhs, phi_arg, 0))
+  {
+ tree t = lhs;
+ lhs = rhs;
+ rhs = t;
+  }
+if (operand_equal_p (rhs, phi_arg, 0)
+ && operand_equal_p (lhs, phi_arg2, 0))
+  continue;
+  }
+
+ gimple stmt2 = last_stmt (test_bb);
+ bool is_eq_expr2 = gimple_code (stmt2) == GIMPLE_COND
+ && (gimple_cond_code (stmt2) == NE_EXPR
+ || gimple_cond_code (stmt2) == EQ_EXPR)
+ && TREE_CODE (phi_arg2) == INTEGER_CST;
+
+ if (is_eq_expr2)
+  {
+lhs2 = gimple_cond_lhs (stmt2);
+rhs2 = gimple_cond_rhs (stmt2);
+
+if (operand_equal_p (lhs2, phi_arg2, 0))
+  {
+ tree t = lhs2;
+ lhs2 = rhs2;
+ rhs2 = t;
+  }
+if (operand_equal_p (rhs2, phi_arg2, 0)
+ && operand_equal_p (lhs2, phi_arg, 0))
+  continue;
+  }
+
+ if (is_eq_expr && is_eq_expr2)
+  {
+if (operand_equal_p (rhs, phi_arg, 0)
+ && operand_equal_p (rhs2, phi_arg2, 0)
+ && operand_equal_p (lhs, lhs2, 0))
+  continue;
+  }
+
   /* Otherwise, if one of the blocks doesn't end wit

[PATCH] Small fix: add { dg-require-effective-target vect_int } to testsuite/gcc.dg/vect/pr58508.c

2013-11-01 Thread Cong Hou
It seems that on some platforms the loops in
testsuite/gcc.dg/vect/pr58508.c may be unable to be vectorized. This
small patch added { dg-require-effective-target vect_int } to make
sure all loops can be vectorized.


thanks,
Cong


diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 9d0f4a5..3d9916d 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,7 @@
+2013-10-29  Cong Hou  
+
+   * gcc.dg/vect/pr58508.c: Update.
+
 2013-10-15  Cong Hou  

* gcc.dg/vect/pr58508.c: New test.
diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c
b/gcc/testsuite/gcc.dg/vect/pr58508.c
index 6484a65..fff7a04 100644
--- a/gcc/testsuite/gcc.dg/vect/pr58508.c
+++ b/gcc/testsuite/gcc.dg/vect/pr58508.c
@@ -1,3 +1,4 @@
+/* { dg-require-effective-target vect_int } */
 /* { dg-do compile } */
 /* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details" } */


Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.

2013-11-04 Thread Cong Hou
On Mon, Nov 4, 2013 at 2:06 AM, James Greenhalgh
 wrote:
> On Fri, Nov 01, 2013 at 04:48:53PM +0000, Cong Hou wrote:
>> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
>> index 2a5a2e1..8f5d39a 100644
>> --- a/gcc/doc/md.texi
>> +++ b/gcc/doc/md.texi
>> @@ -4705,6 +4705,16 @@ wider mode, is computed and added to operand 3.
>> Operand 3 is of a mode equal or
>>  wider than the mode of the product. The result is placed in operand 0, which
>>  is of the same mode as operand 3.
>>
>> +@cindex @code{ssad@var{m}} instruction pattern
>> +@item @samp{ssad@var{m}}
>> +@cindex @code{usad@var{m}} instruction pattern
>> +@item @samp{usad@var{m}}
>> +Compute the sum of absolute differences of two signed/unsigned elements.
>> +Operand 1 and operand 2 are of the same mode. Their absolute difference, 
>> which
>> +is of a wider mode, is computed and added to operand 3. Operand 3 is of a 
>> mode
>> +equal or wider than the mode of the absolute difference. The result is 
>> placed
>> +in operand 0, which is of the same mode as operand 3.
>> +
>>  @cindex @code{ssum_widen@var{m3}} instruction pattern
>>  @item @samp{ssum_widen@var{m3}}
>>  @cindex @code{usum_widen@var{m3}} instruction pattern
>> diff --git a/gcc/expr.c b/gcc/expr.c
>> index 4975a64..1db8a49 100644
>
> I'm not sure I follow, and if I do - I don't think it matches what
> you have implemented for i386.
>
> From your text description I would guess the series of operations to be:
>
>   v1 = widen (operands[1])
>   v2 = widen (operands[2])
>   v3 = abs (v1 - v2)
>   operands[0] = v3 + operands[3]
>
> But if I understand the behaviour of PSADBW correctly, what you have
> actually implemented is:
>
>   v1 = widen (operands[1])
>   v2 = widen (operands[2])
>   v3 = abs (v1 - v2)
>   v4 = reduce_plus (v3)
>   operands[0] = v4 + operands[3]
>
> To my mind, synthesizing the reduce_plus step will be wasteful for targets
> who do not get this for free with their Absolute Difference step. Imagine a
> simple loop where we have synthesized the reduce_plus, we compute partial
> sums each loop iteration, though we would be better to leave the reduce_plus
> step until after the loop. "REDUC_PLUS_EXPR" would be the appropriate
> Tree code for this.

What do you mean when you use "synthesizing" here? For each pattern,
the only synthesized operation is the one being returned from the
pattern recognizer. In this case, it is USAD_EXPR. The recognition of
reduce sum is necessary as we need corresponding prolog and epilog for
reductions, which is already done before pattern recognition. Note
that reduction is not a pattern but is a type of vector definition. A
vectorization pattern can still be a reduction operation as long as
STMT_VINFO_RELATED_STMT of this pattern is a reduction operation. You
can check the other two reduction patterns: widen_sum_pattern and
dot_prod_pattern for reference.

Thank you for your comment!


Cong

>
> I would prefer to see this Tree code not imply the reduce_plus.
>
> Thanks,
> James
>


Re: [PATCH] Fixing PR59006 and PR58921 by delaying loop invariant hoisting in vectorizer.

2013-11-27 Thread Cong Hou
On Wed, Nov 27, 2013 at 1:53 AM, Richard Biener  wrote:
> On Fri, 22 Nov 2013, Cong Hou wrote:
>
>> Hi
>>
>> Currently in GCC vectorization, some loop invariant may be detected
>> after aliasing checks, which can be hoisted outside of the loop. The
>> current method in GCC may break the information built during the
>> analysis phase, causing some crash (see PR59006 and PR58921).
>>
>> This patch improves the loop invariant hoisting by delaying it until
>> all statements are vectorized, thereby keeping all built information.
>> But those loop invariant statements won't be vectorized, and if a
>> variable is defined by one of those loop invariant, it is treated as
>> an external definition.
>>
>> Bootstrapped and testes on an x86-64 machine.
>
> Hmm.  I'm still thinking that we should handle this during the regular
> transform step.
>
> Like with the following incomplete patch.  Missing is adjusting
> the rest of the vectorizable_* functions to handle the case where all defs
> are dt_external or constant by setting their own STMT_VINFO_DEF_TYPE to
> dt_external.  From the gcc.dg/vect/pr58508.c we get only 4 hoists
> instead of 8 because of this (I think).
>
> Also gcc.dg/vect/pr52298.c ICEs for yet unanalyzed reason.
>
> I can take over the bug if you like.
>
> Thanks,
> Richard.
>
> Index: gcc/tree-vect-data-refs.c
> ===
> *** gcc/tree-vect-data-refs.c   (revision 205435)
> --- gcc/tree-vect-data-refs.c   (working copy)
> *** again:
> *** 3668,3673 
> --- 3668,3682 
> }
>   STMT_VINFO_STRIDE_LOAD_P (stmt_info) = true;
> }
> +   else if (loop_vinfo
> +  && integer_zerop (DR_STEP (dr)))
> +   {
> + /* All loads from a non-varying address will be disambiguated
> +by data-ref analysis or via a runtime alias check and thus
> +they will become invariant.  Force them to be vectorized
> +as external.  */
> + STMT_VINFO_DEF_TYPE (stmt_info) = vect_external_def;
> +   }
>   }
>
> /* If we stopped analysis at the first dataref we could not analyze


I agree that setting the statement that loads a data-ref with zero
step as vect_external_def early at this point is a good idea. This
avoids two loop analyses seeing inconsistent def-info if we do this
later. Note with this change the following loop in PR59006 will not be
vectorized:


int a[8], b;

void fn1(void) {
  int c;
  for (; b; b++) {
int d = a[b];
c = a[0] ? d : 0;
a[b] = c;
  }
}

This is because the load to a[0] is now treated as an external def, in
which case vectype cannot be found for the condition of the
conditional expression, while vectorizable_condition requires that
comp_vectype should be set properly. We can treat it as a missed
optimization.



> Index: gcc/tree-vect-loop-manip.c
> ===
> *** gcc/tree-vect-loop-manip.c  (revision 205435)
> --- gcc/tree-vect-loop-manip.c  (working copy)
> *** vect_loop_versioning (loop_vec_info loop
> *** 2269,2275 
>
> /* Extract load statements on memrefs with zero-stride accesses.  */
>
> !   if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo))
>   {
> /* In the loop body, we iterate each statement to check if it is a 
> load.
>  Then we check the DR_STEP of the data reference.  If DR_STEP is zero,
> --- 2269,2275 
>
> /* Extract load statements on memrefs with zero-stride accesses.  */
>
> !   if (0 && LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo))
>   {
> /* In the loop body, we iterate each statement to check if it is a 
> load.
>  Then we check the DR_STEP of the data reference.  If DR_STEP is zero,
> Index: gcc/tree-vect-loop.c
> ===
> *** gcc/tree-vect-loop.c(revision 205435)
> --- gcc/tree-vect-loop.c(working copy)
> *** vect_transform_loop (loop_vec_info loop_
> *** 5995,6000 
> --- 5995,6020 
> }
> }
>
> + /* If the stmt is loop invariant simply move it.  */
> + if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_external_def)
> +   {
> + if (dump_enabled_p ())
> +   {
> + dump_printf_loc (MSG_NOTE, vect_location,
> +  "hoisting out of the vectorized loop: ");
> + dump_gimple_stmt (MSG_NOTE, TDF_SLIM, stmt, 0);
> + dump_printf (MSG_NOTE, "

Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.

2013-12-02 Thread Cong Hou
Any comment on this patch?


thanks,
Cong


On Fri, Nov 22, 2013 at 11:40 AM, Cong Hou  wrote:
> On Fri, Nov 22, 2013 at 3:57 AM, Marc Glisse  wrote:
>> On Thu, 21 Nov 2013, Cong Hou wrote:
>>
>>> On Thu, Nov 21, 2013 at 4:39 PM, Marc Glisse  wrote:
>>>>
>>>> On Thu, 21 Nov 2013, Cong Hou wrote:
>>>>
>>>>> While I added the new define_insn_and_split for vec_merge, a bug is
>>>>> exposed: in config/i386/sse.md, [ define_expand "xop_vmfrcz2" ]
>>>>> only takes one input, but the corresponding builtin functions have two
>>>>> inputs, which are shown in i386.c:
>>>>>
>>>>>  { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv4sf2,
>>>>> "__builtin_ia32_vfrczss", IX86_BUILTIN_VFRCZSS, UNKNOWN,
>>>>> (int)MULTI_ARG_2_SF },
>>>>>  { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv2df2,
>>>>> "__builtin_ia32_vfrczsd", IX86_BUILTIN_VFRCZSD, UNKNOWN,
>>>>> (int)MULTI_ARG_2_DF },
>>>>>
>>>>> In consequence, the ix86_expand_multi_arg_builtin() function tries to
>>>>> check two args but based on the define_expand of xop_vmfrcz2,
>>>>> the content of insn_data[CODE_FOR_xop_vmfrczv4sf2].operand[2] may be
>>>>> incorrect (because it only needs one input).
>>>>>
>>>>> The patch below fixed this issue.
>>>>>
>>>>> Bootstrapped and tested on ax x86-64 machine. Note that this patch
>>>>> should be applied before the one I sent earlier (sorry for sending
>>>>> them in wrong order).
>>>>
>>>>
>>>>
>>>> This is PR 56788. Your patch seems strange to me and I don't think it
>>>> fixes the real issue, but I'll let more knowledgeable people answer.
>>>
>>>
>>>
>>> Thank you for pointing out the bug report. This patch is not intended
>>> to fix PR56788.
>>
>>
>> IMHO, if PR56788 was fixed, you wouldn't have this issue, and if PR56788
>> doesn't get fixed, I'll post a patch to remove _mm_frcz_sd and the
>> associated builtin, which would solve your issue as well.
>
>
> I agree. Then I will wait until your patch is merged to the trunk,
> otherwise my patch could not pass the test.
>
>
>>
>>
>>> For your function:
>>>
>>> #include 
>>> __m128d f(__m128d x, __m128d y){
>>>  return _mm_frcz_sd(x,y);
>>> }
>>>
>>> Note that the second parameter is ignored intentionally, but the
>>> prototype of this function contains two parameters. My fix is
>>> explicitly telling GCC that the optab xop_vmfrczv4sf3 should have
>>> three operands instead of two, to let it have the correct information
>>> in insn_data[CODE_FOR_xop_vmfrczv4sf3].operand[2] which is used to
>>> match the type of the second parameter in the builtin function in
>>> ix86_expand_multi_arg_builtin().
>>
>>
>> I disagree that this is intentional, it is a bug. AFAIK there is no AMD
>> documentation that could be used as a reference for what _mm_frcz_sd is
>> supposed to do. The only existing documentations are by Microsoft (which
>> does *not* ignore the second argument) and by LLVM (which has a single
>> argument). Whatever we chose for _mm_frcz_sd, the builtin should take a
>> single argument, and if necessary we'll use 2 builtins to implement
>> _mm_frcz_sd.
>>
>
>
> I also only found the one by Microsoft.. If the second argument is
> ignored, we could just remove it, as long as there is no "standard"
> that requires two arguments. Hopefully it won't break current projects
> using _mm_frcz_sd.
>
> Thank you for your comments!
>
>
> Cong
>
>
>> --
>> Marc Glisse


Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.

2013-12-02 Thread Cong Hou
Hi Richard

Could you please take a look at this patch and see if it is ready for
the trunk? The patch is pasted as a text file here again.

Thank you very much!


Cong


On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou  wrote:
> Hi James
>
> Sorry for the late reply.
>
>
> On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh
>  wrote:
>>> On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou  wrote:
>>> > Thank you for your detailed explanation.
>>> >
>>> > Once GCC detects a reduction operation, it will automatically
>>> > accumulate all elements in the vector after the loop. In the loop the
>>> > reduction variable is always a vector whose elements are reductions of
>>> > corresponding values from other vectors. Therefore in your case the
>>> > only instruction you need to generate is:
>>> >
>>> > VABAL   ops[3], ops[1], ops[2]
>>> >
>>> > It is OK if you accumulate the elements into one in the vector inside
>>> > of the loop (if one instruction can do this), but you have to make
>>> > sure other elements in the vector should remain zero so that the final
>>> > result is correct.
>>> >
>>> > If you are confused about the documentation, check the one for
>>> > udot_prod (just above usad in md.texi), as it has very similar
>>> > behavior as usad. Actually I copied the text from there and did some
>>> > changes. As those two instruction patterns are both for vectorization,
>>> > their behavior should not be difficult to explain.
>>> >
>>> > If you have more questions or think that the documentation is still
>>> > improper please let me know.
>>
>> Hi Cong,
>>
>> Thanks for your reply.
>>
>> I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and
>> DOT_PROD_EXPR and I see that the same ambiguity exists for
>> DOT_PROD_EXPR. Can you please add a note in your tree.def
>> that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either:
>>
>>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
>>   tmp2 = ABS_EXPR (tmp)
>>   arg3 = PLUS_EXPR (tmp2, arg3)
>>
>> or:
>>
>>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
>>   tmp2 = ABS_EXPR (tmp)
>>   arg3 = WIDEN_SUM_EXPR (tmp2, arg3)
>>
>> Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a
>> a value of the same (widened) type as arg3.
>>
>
>
> I have added it, although we currently don't have WIDEN_MINUS_EXPR (I
> mentioned it in tree.def).
>
>
>> Also, while looking for the history of DOT_PROD_EXPR I spotted this
>> patch:
>>
>>   [autovect] [patch] detect mult-hi and sad patterns
>>   http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html
>>
>> I wonder what the reason was for that patch to be dropped?
>>
>
> It has been 8 years.. I have no idea why this patch is not accepted
> finally. There is even no reply in that thread. But I believe the SAD
> pattern is very important to be recognized. ARM also provides
> instructions for it.
>
>
> Thank you for your comment again!
>
>
> thanks,
> Cong
>
>
>
>> Thanks,
>> James
>>
diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 6bdaa31..37ff6c4 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,4 +1,24 @@
-2013-11-01  Trevor Saunders  
+2013-10-29  Cong Hou  
+
+   * tree-vect-patterns.c (vect_recog_sad_pattern): New function for SAD
+   pattern recognition.
+   (type_conversion_p): PROMOTION is true if it's a type promotion
+   conversion, and false otherwise.  Return true if the given expression
+   is a type conversion one.
+   * tree-vectorizer.h: Adjust the number of patterns.
+   * tree.def: Add SAD_EXPR.
+   * optabs.def: Add sad_optab.
+   * cfgexpand.c (expand_debug_expr): Add SAD_EXPR case.
+   * expr.c (expand_expr_real_2): Likewise.
+   * gimple-pretty-print.c (dump_ternary_rhs): Likewise.
+   * gimple.c (get_gimple_rhs_num_ops): Likewise.
+   * optabs.c (optab_for_tree_code): Likewise.
+   * tree-cfg.c (estimate_operator_cost): Likewise.
+   * tree-ssa-operands.c (get_expr_operands): Likewise.
+   * tree-vect-loop.c (get_initial_def_for_reduction): Likewise.
+   * config/i386/sse.md: Add SSE2 and AVX2 expand for SAD.
+   * doc/generic.texi: Add document for SAD_EXPR.
+   * doc/md.texi: Add document for ssad and usad.
 
* function.c (reorder_blocks): Convert block_stack to a stack_vec.
* gimplify.c (gimplify_compound_lval): Likewise.
diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c
index fb05ce7..1f824fb 100644
--- a/gcc/cfgexpand.c
+++ b/gcc/cfgex

[PATCH] Enhancing the widen-mult pattern in vectorization.

2013-12-03 Thread Cong Hou
Hi

The current widen-mult pattern only considers two operands with the
same size. However, operands with different sizes can also benefit
from this pattern. The following loop shows such an example:


char a[N];
short b[N];
int c[N];

for (int i = 0; i < N; ++i)
  c[i] = a[i] * b[i];


In this case, we can convert a[i] into short type then perform
widen-mult on b[i] and the converted value:


for (int i = 0; i < N; ++i) {
  short t = a[i];
  c[i] = t w* b[i];
}


This patch adds such support. In addition, the following loop fails to
be recognized as a widen-mult pattern because the widening operation
from char to int is not directly supported by the target:


char a[N], b[N];
int c[N];

for (int i = 0; i < N; ++i)
  c[i] = a[i] * b[i];


In this case, we can still perform widen-mult on a[i] and b[i], and
get a result of short type, then convert it to int:


char a[N], b[N];
int c[N];

for (int i = 0; i < N; ++i) {
  short t = a[i] w* b[i];
  c[i] = (int) t;
}


Currently GCC does not allow multi-step conversions for binary
widening operations. This pattern removes this restriction and use
VEC_UNPACK_LO_EXPR/VEC_UNPACK_HI_EXPR to arrange data after the
widen-mult is performed for the widen-mult pattern. This can reduce
several unpacking instructions (for this example, the number of
packings/unpackings is reduced from 12 to 8. For SSE2, the inefficient
multiplication between two V4SI vectors can also be avoided).

Bootstrapped and tested on an x86_64 machine.



thanks,
Cong



diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index f298c0b..44ed204 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,12 @@
+2013-12-02  Cong Hou  
+
+ * tree-vect-patterns.c (vect_recog_widen_mult_pattern): Enhance
+ the widen-mult pattern by handling two operands with different
+ sizes.
+ * tree-vect-stmts.c (vectorizable_conversion): Allow multi-steps
+ conversions after widening mult operation.
+ (supportable_widening_operation): Likewise.
+
 2013-11-22  Jakub Jelinek  

  PR sanitizer/59061
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 12d2c90..611ae1c 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,8 @@
+2013-12-02  Cong Hou  
+
+ * gcc.dg/vect/vect-widen-mult-u8-s16-s32.c: New test.
+ * gcc.dg/vect/vect-widen-mult-u8-u32.c: New test.
+
 2013-11-22  Jakub Jelinek  

  * c-c++-common/asan/no-redundant-instrumentation-7.c: Fix
diff --git a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c
b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c
new file mode 100644
index 000..9f9081b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c
@@ -0,0 +1,48 @@
+/* { dg-require-effective-target vect_int } */
+
+#include 
+#include "tree-vect.h"
+
+#define N 64
+
+unsigned char X[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
+short Y[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
+int result[N];
+
+/* unsigned char * short -> int widening-mult.  */
+__attribute__ ((noinline)) int
+foo1(int len) {
+  int i;
+
+  for (i=0; i
+#include "tree-vect.h"
+
+#define N 64
+
+unsigned char X[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
+unsigned char Y[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
+unsigned int result[N];
+
+/* unsigned char-> unsigned int widening-mult.  */
+__attribute__ ((noinline)) int
+foo1(int len) {
+  int i;
+
+  for (i=0; i *stmts,
 return NULL;
 }

+  /* If the two arguments have different sizes, convert the one with
+ the smaller type into the larger type.  */
+  if (TYPE_PRECISION (half_type0) != TYPE_PRECISION (half_type1))
+{
+  tree* oprnd = NULL;
+  gimple def_stmt = NULL;
+
+  if (TYPE_PRECISION (half_type0) < TYPE_PRECISION (half_type1))
+ {
+  def_stmt = def_stmt0;
+  half_type0 = half_type1;
+  oprnd = &oprnd0;
+ }
+  else
+ {
+  def_stmt = def_stmt1;
+  half_type1 = half_type0;
+  oprnd = &oprnd1;
+ }
+
+  if (STMT_VINFO_RELATED_STMT (vinfo_for_stmt (def_stmt)))
+ {
+  gimple new_stmt = STMT_VINFO_RELATED_STMT (vinfo_for_stmt (def_stmt));
+  /* Check if the already created pattern stmt is what we need.  */
+  if (!is_gimple_assign (new_stmt)
+  || gimple_assign_rhs_code (new_stmt) != NOP_EXPR
+  || TREE_TYPE (gimple_assign_lhs (new_stmt)) != half_type0)
+return NULL;
+
+  stmts->safe_push (def_stmt);
+  *oprnd = gimple_assign_lhs (new_stmt);
+ }
+  else
+ {
+  tree old_oprnd = gimple_assign_rhs1 (def_stmt);
+  tree new_oprnd = make_ssa_name (half_type0, NULL);
+  gimple new_stmt = gimple_build_assign_with_ops (NOP_EXPR, new_oprnd,
+  old_oprnd, NULL_TREE);
+  STMT_VINFO_RELATED_STMT (vinfo_for_stmt (def_stmt)) = new_stmt;
+  stmts->safe_push (def_stmt);
+  *oprnd = new_oprnd;
+ }
+}
+
   /* Handle unsigned case.  Look for
  S6  u_prod_T = (unsigned TYPE) prod_T;
  Use unsigned TYPE as the type for WIDEN_MULT_EXPR.  */
diff --git a/gcc/tree-vect-stmts.c b/gc

Re: [PATCH] Hoist loop invariant statements containing data refs with zero-step during loop-versioning in vectorization.

2013-12-05 Thread Cong Hou
Hi Richard

You mentioned that Micha has a patch pending that enables of zero-step
stores. What is the status of this patch? I could not find it through
searching "Micha".

Thank you!


Cong


On Wed, Oct 16, 2013 at 2:02 AM, Richard Biener  wrote:
> On Tue, 15 Oct 2013, Cong Hou wrote:
>
>> Thank you for your reminder, Jeff! I just noticed Richard's comment. I
>> have modified the patch according to that.
>>
>> The new patch is attached.
>
> (posting patches inline is easier for review, now you have to deal
> with no quoting markers ;))
>
> Comments inline.
>
> diff --git a/gcc/ChangeLog b/gcc/ChangeLog
> index 8a38316..2637309 100644
> --- a/gcc/ChangeLog
> +++ b/gcc/ChangeLog
> @@ -1,3 +1,8 @@
> +2013-10-15  Cong Hou  
> +
> +   * tree-vect-loop-manip.c (vect_loop_versioning): Hoist loop invariant
> +   statement that contains data refs with zero-step.
> +
>  2013-10-14  David Malcolm  
>
> * dumpfile.h (gcc::dump_manager): New class, to hold state
> diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
> index 075d071..9d0f4a5 100644
> --- a/gcc/testsuite/ChangeLog
> +++ b/gcc/testsuite/ChangeLog
> @@ -1,3 +1,7 @@
> +2013-10-15  Cong Hou  
> +
> +   * gcc.dg/vect/pr58508.c: New test.
> +
>  2013-10-14  Tobias Burnus  
>
> PR fortran/58658
> diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c 
> b/gcc/testsuite/gcc.dg/vect/pr58508.c
> new file mode 100644
> index 000..cb22b50
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/pr58508.c
> @@ -0,0 +1,20 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details" } */
> +
> +
> +/* The GCC vectorizer generates loop versioning for the following loop
> +   since there may exist aliasing between A and B.  The predicate checks
> +   if A may alias with B across all iterations.  Then for the loop in
> +   the true body, we can assert that *B is a loop invariant so that
> +   we can hoist the load of *B before the loop body.  */
> +
> +void foo (int* a, int* b)
> +{
> +  int i;
> +  for (i = 0; i < 10; ++i)
> +a[i] = *b + 1;
> +}
> +
> +
> +/* { dg-final { scan-tree-dump-times "hoist" 2 "vect" } } */
> +/* { dg-final { cleanup-tree-dump "vect" } } */
> diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> index 574446a..f4fdec2 100644
> --- a/gcc/tree-vect-loop-manip.c
> +++ b/gcc/tree-vect-loop-manip.c
> @@ -2477,6 +2477,92 @@ vect_loop_versioning (loop_vec_info loop_vinfo,
>adjust_phi_and_debug_stmts (orig_phi, e, PHI_RESULT (new_phi));
>  }
>
>
> Note that applying this kind of transform at this point invalidates
> some of the earlier analysis the vectorizer performed (namely the
> def-kind which now effectively gets vect_external_def from
> vect_internal_def).  In this case it doesn't seem to cause any
> issues (we re-compute the def-kind everytime we need it (how wasteful)).
>
> +  /* Extract load and store statements on pointers with zero-stride
> + accesses.  */
> +  if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo))
> +{
> +  /* In the loop body, we iterate each statement to check if it is a load
> +or store.  Then we check the DR_STEP of the data reference.  If
> +DR_STEP is zero, then we will hoist the load statement to the loop
> +preheader, and move the store statement to the loop exit.  */
>
> We don't move the store yet.  Micha has a patch pending that enables
> vectorization of zero-step stores.
>
> +  for (gimple_stmt_iterator si = gsi_start_bb (loop->header);
> +  !gsi_end_p (si);)
>
> While technically ok now (vectorized loops contain a single basic block)
> please use LOOP_VINFO_BBS () to get at the vector of basic-blcoks
> and iterate over them like other code does.
>
> +   {
> + gimple stmt = gsi_stmt (si);
> + stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
> + struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
> +
> + if (dr && integer_zerop (DR_STEP (dr)))
> +   {
> + if (DR_IS_READ (dr))
> +   {
> + if (dump_enabled_p ())
> +   {
> + dump_printf_loc
> + (MSG_NOTE, vect_location,
> +  "hoist the statement to outside of the loop ");
>
> "hoisting out of the vectorized loop: "
>
> + dump_gimple_stmt (MSG_NOTE, TDF_SLIM, stmt, 0);
> + dump_printf (MSG_NOTE, "\n");
> +  

Re: [PATCH] Enhancing the widen-mult pattern in vectorization.

2013-12-06 Thread Cong Hou
After further reviewing this patch, I found I don't have to change the
code in tree-vect-stmts.c to allow further type conversion after
widen-mult operation. Instead, I detect the following pattern in
vect_recog_widen_mult_pattern():

T1 a, b;
ai = (T2) a;
bi = (T2) b;
c = ai * bi;

where T2 is more that double the size of T1. (e.g. T1 is char and T2 is int).

In this case I just create a new type T3 whose size is double of the
size of T1, then get an intermediate result of type T3 from
widen-mult. Then I add a new statement to STMT_VINFO_PATTERN_DEF_SEQ
converting the result into type T2.

This strategy makes the patch more clean.

Bootstrapped and tested on an x86-64 machine.


thanks,
Cong


diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index f298c0b..12990b2 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,10 @@
+2013-12-02  Cong Hou  
+
+ * tree-vect-patterns.c (vect_recog_widen_mult_pattern): Enhance
+ the widen-mult pattern by handling two operands with different
+ sizes, and operands whose size is smaller than half of the result
+ type.
+
 2013-11-22  Jakub Jelinek  

  PR sanitizer/59061
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 12d2c90..611ae1c 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,8 @@
+2013-12-02  Cong Hou  
+
+ * gcc.dg/vect/vect-widen-mult-u8-s16-s32.c: New test.
+ * gcc.dg/vect/vect-widen-mult-u8-u32.c: New test.
+
 2013-11-22  Jakub Jelinek  

  * c-c++-common/asan/no-redundant-instrumentation-7.c: Fix
diff --git a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c
b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c
new file mode 100644
index 000..9f9081b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c
@@ -0,0 +1,48 @@
+/* { dg-require-effective-target vect_int } */
+
+#include 
+#include "tree-vect.h"
+
+#define N 64
+
+unsigned char X[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
+short Y[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
+int result[N];
+
+/* unsigned char * short -> int widening-mult.  */
+__attribute__ ((noinline)) int
+foo1(int len) {
+  int i;
+
+  for (i=0; i
+#include "tree-vect.h"
+
+#define N 64
+
+unsigned char X[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
+unsigned char Y[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
+unsigned int result[N];
+
+/* unsigned char-> unsigned int widening-mult.  */
+__attribute__ ((noinline)) int
+foo1(int len) {
+  int i;
+
+  for (i=0; i
+   If the result of WIDEN_MULT needs to be converted to a larger type, the
+   returned stmt will be this type conversion stmt.
 */

 static gimple
@@ -606,8 +610,8 @@ vect_recog_widen_mult_pattern (vec *stmts,
   gimple def_stmt0, def_stmt1;
   tree oprnd0, oprnd1;
   tree type, half_type0, half_type1;
-  gimple pattern_stmt;
-  tree vectype, vectype_out = NULL_TREE;
+  gimple new_stmt = NULL, pattern_stmt = NULL;
+  tree vectype, vecitype;
   tree var;
   enum tree_code dummy_code;
   int dummy_int;
@@ -661,6 +665,33 @@ vect_recog_widen_mult_pattern (vec *stmts,
 return NULL;
 }

+  /* If the two arguments have different sizes, convert the one with
+ the smaller type into the larger type.  */
+  if (TYPE_PRECISION (half_type0) != TYPE_PRECISION (half_type1))
+{
+  tree* oprnd = NULL;
+  gimple def_stmt = NULL;
+
+  if (TYPE_PRECISION (half_type0) < TYPE_PRECISION (half_type1))
+ {
+  def_stmt = def_stmt0;
+  half_type0 = half_type1;
+  oprnd = &oprnd0;
+ }
+  else
+ {
+  def_stmt = def_stmt1;
+  half_type1 = half_type0;
+  oprnd = &oprnd1;
+ }
+
+tree old_oprnd = gimple_assign_rhs1 (def_stmt);
+tree new_oprnd = make_ssa_name (half_type0, NULL);
+new_stmt = gimple_build_assign_with_ops (NOP_EXPR, new_oprnd,
+ old_oprnd, NULL_TREE);
+*oprnd = new_oprnd;
+}
+
   /* Handle unsigned case.  Look for
  S6  u_prod_T = (unsigned TYPE) prod_T;
  Use unsigned TYPE as the type for WIDEN_MULT_EXPR.  */
@@ -692,6 +723,15 @@ vect_recog_widen_mult_pattern (vec *stmts,
   if (!types_compatible_p (half_type0, half_type1))
 return NULL;

+  /* If TYPE is more than twice larger than HALF_TYPE, we use WIDEN_MULT
+ to get an intermediate result of type ITYPE.  In this case we need
+ to build a statement to convert this intermediate result to type TYPE.  */
+  tree itype = type;
+  if (TYPE_PRECISION (type) > TYPE_PRECISION (half_type0) * 2)
+itype = build_nonstandard_integer_type
+  (GET_MODE_BITSIZE (TYPE_MODE (half_type0)) * 2,
+   TYPE_UNSIGNED (type));
+
   /* Pattern detected.  */
   if (dump_enabled_p ())
 dump_printf_loc (MSG_NOTE, vect_location,
@@ -699,23 +739,56 @@ vect_recog_widen_mult_pattern (vec *stmts,

   /* Check target support  */
   vectype = get_vectype_for_scalar_type (half_type0);
-  vectype_out = ge

Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.

2013-12-17 Thread Cong Hou
Ping?


thanks,
Cong


On Mon, Dec 2, 2013 at 5:06 PM, Cong Hou  wrote:
> Hi Richard
>
> Could you please take a look at this patch and see if it is ready for
> the trunk? The patch is pasted as a text file here again.
>
> Thank you very much!
>
>
> Cong
>
>
> On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou  wrote:
>> Hi James
>>
>> Sorry for the late reply.
>>
>>
>> On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh
>>  wrote:
>>>> On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou  wrote:
>>>> > Thank you for your detailed explanation.
>>>> >
>>>> > Once GCC detects a reduction operation, it will automatically
>>>> > accumulate all elements in the vector after the loop. In the loop the
>>>> > reduction variable is always a vector whose elements are reductions of
>>>> > corresponding values from other vectors. Therefore in your case the
>>>> > only instruction you need to generate is:
>>>> >
>>>> > VABAL   ops[3], ops[1], ops[2]
>>>> >
>>>> > It is OK if you accumulate the elements into one in the vector inside
>>>> > of the loop (if one instruction can do this), but you have to make
>>>> > sure other elements in the vector should remain zero so that the final
>>>> > result is correct.
>>>> >
>>>> > If you are confused about the documentation, check the one for
>>>> > udot_prod (just above usad in md.texi), as it has very similar
>>>> > behavior as usad. Actually I copied the text from there and did some
>>>> > changes. As those two instruction patterns are both for vectorization,
>>>> > their behavior should not be difficult to explain.
>>>> >
>>>> > If you have more questions or think that the documentation is still
>>>> > improper please let me know.
>>>
>>> Hi Cong,
>>>
>>> Thanks for your reply.
>>>
>>> I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and
>>> DOT_PROD_EXPR and I see that the same ambiguity exists for
>>> DOT_PROD_EXPR. Can you please add a note in your tree.def
>>> that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either:
>>>
>>>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
>>>   tmp2 = ABS_EXPR (tmp)
>>>   arg3 = PLUS_EXPR (tmp2, arg3)
>>>
>>> or:
>>>
>>>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
>>>   tmp2 = ABS_EXPR (tmp)
>>>   arg3 = WIDEN_SUM_EXPR (tmp2, arg3)
>>>
>>> Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a
>>> a value of the same (widened) type as arg3.
>>>
>>
>>
>> I have added it, although we currently don't have WIDEN_MINUS_EXPR (I
>> mentioned it in tree.def).
>>
>>
>>> Also, while looking for the history of DOT_PROD_EXPR I spotted this
>>> patch:
>>>
>>>   [autovect] [patch] detect mult-hi and sad patterns
>>>   http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html
>>>
>>> I wonder what the reason was for that patch to be dropped?
>>>
>>
>> It has been 8 years.. I have no idea why this patch is not accepted
>> finally. There is even no reply in that thread. But I believe the SAD
>> pattern is very important to be recognized. ARM also provides
>> instructions for it.
>>
>>
>> Thank you for your comment again!
>>
>>
>> thanks,
>> Cong
>>
>>
>>
>>> Thanks,
>>> James
>>>


Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.

2013-12-17 Thread Cong Hou
Ping?


thanks,
Cong


On Mon, Dec 2, 2013 at 5:02 PM, Cong Hou  wrote:
> Any comment on this patch?
>
>
> thanks,
> Cong
>
>
> On Fri, Nov 22, 2013 at 11:40 AM, Cong Hou  wrote:
>> On Fri, Nov 22, 2013 at 3:57 AM, Marc Glisse  wrote:
>>> On Thu, 21 Nov 2013, Cong Hou wrote:
>>>
>>>> On Thu, Nov 21, 2013 at 4:39 PM, Marc Glisse  wrote:
>>>>>
>>>>> On Thu, 21 Nov 2013, Cong Hou wrote:
>>>>>
>>>>>> While I added the new define_insn_and_split for vec_merge, a bug is
>>>>>> exposed: in config/i386/sse.md, [ define_expand "xop_vmfrcz2" ]
>>>>>> only takes one input, but the corresponding builtin functions have two
>>>>>> inputs, which are shown in i386.c:
>>>>>>
>>>>>>  { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv4sf2,
>>>>>> "__builtin_ia32_vfrczss", IX86_BUILTIN_VFRCZSS, UNKNOWN,
>>>>>> (int)MULTI_ARG_2_SF },
>>>>>>  { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv2df2,
>>>>>> "__builtin_ia32_vfrczsd", IX86_BUILTIN_VFRCZSD, UNKNOWN,
>>>>>> (int)MULTI_ARG_2_DF },
>>>>>>
>>>>>> In consequence, the ix86_expand_multi_arg_builtin() function tries to
>>>>>> check two args but based on the define_expand of xop_vmfrcz2,
>>>>>> the content of insn_data[CODE_FOR_xop_vmfrczv4sf2].operand[2] may be
>>>>>> incorrect (because it only needs one input).
>>>>>>
>>>>>> The patch below fixed this issue.
>>>>>>
>>>>>> Bootstrapped and tested on ax x86-64 machine. Note that this patch
>>>>>> should be applied before the one I sent earlier (sorry for sending
>>>>>> them in wrong order).
>>>>>
>>>>>
>>>>>
>>>>> This is PR 56788. Your patch seems strange to me and I don't think it
>>>>> fixes the real issue, but I'll let more knowledgeable people answer.
>>>>
>>>>
>>>>
>>>> Thank you for pointing out the bug report. This patch is not intended
>>>> to fix PR56788.
>>>
>>>
>>> IMHO, if PR56788 was fixed, you wouldn't have this issue, and if PR56788
>>> doesn't get fixed, I'll post a patch to remove _mm_frcz_sd and the
>>> associated builtin, which would solve your issue as well.
>>
>>
>> I agree. Then I will wait until your patch is merged to the trunk,
>> otherwise my patch could not pass the test.
>>
>>
>>>
>>>
>>>> For your function:
>>>>
>>>> #include 
>>>> __m128d f(__m128d x, __m128d y){
>>>>  return _mm_frcz_sd(x,y);
>>>> }
>>>>
>>>> Note that the second parameter is ignored intentionally, but the
>>>> prototype of this function contains two parameters. My fix is
>>>> explicitly telling GCC that the optab xop_vmfrczv4sf3 should have
>>>> three operands instead of two, to let it have the correct information
>>>> in insn_data[CODE_FOR_xop_vmfrczv4sf3].operand[2] which is used to
>>>> match the type of the second parameter in the builtin function in
>>>> ix86_expand_multi_arg_builtin().
>>>
>>>
>>> I disagree that this is intentional, it is a bug. AFAIK there is no AMD
>>> documentation that could be used as a reference for what _mm_frcz_sd is
>>> supposed to do. The only existing documentations are by Microsoft (which
>>> does *not* ignore the second argument) and by LLVM (which has a single
>>> argument). Whatever we chose for _mm_frcz_sd, the builtin should take a
>>> single argument, and if necessary we'll use 2 builtins to implement
>>> _mm_frcz_sd.
>>>
>>
>>
>> I also only found the one by Microsoft.. If the second argument is
>> ignored, we could just remove it, as long as there is no "standard"
>> that requires two arguments. Hopefully it won't break current projects
>> using _mm_frcz_sd.
>>
>> Thank you for your comments!
>>
>>
>> Cong
>>
>>
>>> --
>>> Marc Glisse


Re: [PATCH] Fixing PR59006 and PR58921 by delaying loop invariant hoisting in vectorizer.

2014-01-13 Thread Cong Hou
I noticed that LIM could not hoist vector invariant, and that is why
my first implementation tries to hoist them all.

In addition, there are two disadvantages of hoisting invariant load +
lim method:

First, for some instructions the scalar version is faster than the
vector version, and in this case hoisting scalar instructions before
vectorization is better. Those instructions include data
packing/unpacking, integer multiplication with SSE2, etc..

Second, it may use more SIMD registers.

The following code shows a simple example:

char *a, *b, *c;
for (int i = 0; i < N; ++i)
  a[i] = b[0] * c[0] + a[i];

Vectorizing b[0]*c[0] is worse than loading the result of b[0]*c[0]
into a vector.


thanks,
Cong


On Mon, Jan 13, 2014 at 5:37 AM, Richard Biener  wrote:
> On Wed, 27 Nov 2013, Jakub Jelinek wrote:
>
>> On Wed, Nov 27, 2013 at 10:53:56AM +0100, Richard Biener wrote:
>> > Hmm.  I'm still thinking that we should handle this during the regular
>> > transform step.
>>
>> I wonder if it can't be done instead just in vectorizable_load,
>> if LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo) and the load is
>> invariant, just emit the (broadcasted) load not inside of the loop, but on
>> the loop preheader edge.
>
> So this implements this suggestion, XFAILing the no longer handled cases.
> For example we get
>
>   _94 = *b_8(D);
>   vect_cst_.18_95 = {_94, _94, _94, _94};
>   _99 = prolog_loop_adjusted_niters.9_132 * 4;
>   vectp_a.22_98 = a_6(D) + _99;
>   ivtmp.43_77 = (unsigned long) vectp_a.22_98;
>
>   :
>   # ivtmp.41_67 = PHI 
>   # ivtmp.43_71 = PHI 
>   vect__10.19_97 = vect_cst_.18_95 + { 1, 1, 1, 1 };
>   _76 = (void *) ivtmp.43_71;
>   MEM[base: _76, offset: 0B] = vect__10.19_97;
>
> ...
>
> instead of having hoisted *b_8 + 1 as scalar computation.  Not sure
> why LIM doesn't hoist the vector variant later.
>
> vect__10.19_97 = vect_cst_.18_95 + vect_cst_.20_96;
>   invariant up to level 1, cost 1.
>
> ah, the cost thing.  Should be "improved" to see that hoisting
> reduces the number of live SSA names in the loop.
>
> Eventually lower_vector_ssa could optimize vector to scalar
> code again ... (ick).
>
> Bootstrap / regtest running on x86_64.
>
> Comments?
>
> Thanks,
> Richard.
>
> 2014-01-13  Richard Biener  
>
> PR tree-optimization/58921
> PR tree-optimization/59006
> * tree-vect-loop-manip.c (vect_loop_versioning): Remove code
> hoisting invariant stmts.
> * tree-vect-stmts.c (vectorizable_load): Insert the splat of
> invariant loads on the preheader edge if possible.
>
> * gcc.dg/torture/pr58921.c: New testcase.
> * gcc.dg/torture/pr59006.c: Likewise.
> * gcc.dg/vect/pr58508.c: XFAIL no longer handled cases.
>
> Index: gcc/tree-vect-loop-manip.c
> ===
> *** gcc/tree-vect-loop-manip.c  (revision 206576)
> --- gcc/tree-vect-loop-manip.c  (working copy)
> *** vect_loop_versioning (loop_vec_info loop
> *** 2435,2507 
> }
>   }
>
> -
> -   /* Extract load statements on memrefs with zero-stride accesses.  */
> -
> -   if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo))
> - {
> -   /* In the loop body, we iterate each statement to check if it is a 
> load.
> -Then we check the DR_STEP of the data reference.  If DR_STEP is zero,
> -then we will hoist the load statement to the loop preheader.  */
> -
> -   basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo);
> -   int nbbs = loop->num_nodes;
> -
> -   for (int i = 0; i < nbbs; ++i)
> -   {
> - for (gimple_stmt_iterator si = gsi_start_bb (bbs[i]);
> -  !gsi_end_p (si);)
> -   {
> - gimple stmt = gsi_stmt (si);
> - stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
> - struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
> -
> - if (is_gimple_assign (stmt)
> - && (!dr
> - || (DR_IS_READ (dr) && integer_zerop (DR_STEP (dr)
> -   {
> - bool hoist = true;
> - ssa_op_iter iter;
> - tree var;
> -
> - /* We hoist a statement if all SSA uses in it are defined
> -outside of the loop.  */
> - FOR_EACH_SSA_TREE_OPERAND (var, stmt, iter, SSA_OP_USE)
> -   {
> - gimple def = SSA_NAME_DEF_STMT (var);
> - if (!gimple_nop_p (def)
> - && flow_bb_inside_loop_p (loop, gimple_bb (def)))
> -   {
> - hoist = false;
> - break;
> -   }
> -   }
> -
> - if (hoist)
> -   {
> - if (dr)
> -   gimple_set_vuse (stmt, NULL);
> -
> - gsi_remove (&si, false);
> - gsi_i

Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.

2013-11-05 Thread Cong Hou
Thank you for your detailed explanation.

Once GCC detects a reduction operation, it will automatically
accumulate all elements in the vector after the loop. In the loop the
reduction variable is always a vector whose elements are reductions of
corresponding values from other vectors. Therefore in your case the
only instruction you need to generate is:

VABAL   ops[3], ops[1], ops[2]

It is OK if you accumulate the elements into one in the vector inside
of the loop (if one instruction can do this), but you have to make
sure other elements in the vector should remain zero so that the final
result is correct.

If you are confused about the documentation, check the one for
udot_prod (just above usad in md.texi), as it has very similar
behavior as usad. Actually I copied the text from there and did some
changes. As those two instruction patterns are both for vectorization,
their behavior should not be difficult to explain.

If you have more questions or think that the documentation is still
improper please let me know.

Thank you very much!


Cong


On Tue, Nov 5, 2013 at 1:53 AM, James Greenhalgh
 wrote:
> On Mon, Nov 04, 2013 at 06:30:55PM +0000, Cong Hou wrote:
>> On Mon, Nov 4, 2013 at 2:06 AM, James Greenhalgh
>>  wrote:
>> > On Fri, Nov 01, 2013 at 04:48:53PM +, Cong Hou wrote:
>> >> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
>> >> index 2a5a2e1..8f5d39a 100644
>> >> --- a/gcc/doc/md.texi
>> >> +++ b/gcc/doc/md.texi
>> >> @@ -4705,6 +4705,16 @@ wider mode, is computed and added to operand 3.
>> >> Operand 3 is of a mode equal or
>> >>  wider than the mode of the product. The result is placed in operand 0, 
>> >> which
>> >>  is of the same mode as operand 3.
>> >>
>> >> +@cindex @code{ssad@var{m}} instruction pattern
>> >> +@item @samp{ssad@var{m}}
>> >> +@cindex @code{usad@var{m}} instruction pattern
>> >> +@item @samp{usad@var{m}}
>> >> +Compute the sum of absolute differences of two signed/unsigned elements.
>> >> +Operand 1 and operand 2 are of the same mode. Their absolute difference, 
>> >> which
>> >> +is of a wider mode, is computed and added to operand 3. Operand 3 is of 
>> >> a mode
>> >> +equal or wider than the mode of the absolute difference. The result is 
>> >> placed
>> >> +in operand 0, which is of the same mode as operand 3.
>> >> +
>> >>  @cindex @code{ssum_widen@var{m3}} instruction pattern
>> >>  @item @samp{ssum_widen@var{m3}}
>> >>  @cindex @code{usum_widen@var{m3}} instruction pattern
>> >> diff --git a/gcc/expr.c b/gcc/expr.c
>> >> index 4975a64..1db8a49 100644
>> >
>> > I'm not sure I follow, and if I do - I don't think it matches what
>> > you have implemented for i386.
>> >
>> > From your text description I would guess the series of operations to be:
>> >
>> >   v1 = widen (operands[1])
>> >   v2 = widen (operands[2])
>> >   v3 = abs (v1 - v2)
>> >   operands[0] = v3 + operands[3]
>> >
>> > But if I understand the behaviour of PSADBW correctly, what you have
>> > actually implemented is:
>> >
>> >   v1 = widen (operands[1])
>> >   v2 = widen (operands[2])
>> >   v3 = abs (v1 - v2)
>> >   v4 = reduce_plus (v3)
>> >   operands[0] = v4 + operands[3]
>> >
>> > To my mind, synthesizing the reduce_plus step will be wasteful for targets
>> > who do not get this for free with their Absolute Difference step. Imagine a
>> > simple loop where we have synthesized the reduce_plus, we compute partial
>> > sums each loop iteration, though we would be better to leave the 
>> > reduce_plus
>> > step until after the loop. "REDUC_PLUS_EXPR" would be the appropriate
>> > Tree code for this.
>>
>> What do you mean when you use "synthesizing" here? For each pattern,
>> the only synthesized operation is the one being returned from the
>> pattern recognizer. In this case, it is USAD_EXPR. The recognition of
>> reduce sum is necessary as we need corresponding prolog and epilog for
>> reductions, which is already done before pattern recognition. Note
>> that reduction is not a pattern but is a type of vector definition. A
>> vectorization pattern can still be a reduction operation as long as
>> STMT_VINFO_RELATED_STMT of this pattern is a reduction operation. You
>> can check the other two reduction patterns: widen_sum_pattern and
>> dot_prod_pattern for reference.
>
> My apologies for not

Re: [PATCH] Handling == or != comparisons that may affect range test optimization.

2013-11-05 Thread Cong Hou
It seems there are some changes in GCC. But if you change the type of
n into signed int, the issue appears again:


int foo(int n)
{
   if (n != 0)
   if (n != 1)
   if (n != 2)
   if (n != 3)
   if (n != 4)
 return ++n;
   return n;
}

Also, ifcombine also suffers from the same issue here.


thanks,
Cong


On Tue, Nov 5, 2013 at 12:53 PM, Jakub Jelinek  wrote:
> On Tue, Nov 05, 2013 at 01:23:00PM -0700, Jeff Law wrote:
>> On 10/31/13 18:03, Cong Hou wrote:
>> >(This patch is for the bug 58728:
>> >http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58728)
>> >
>> >As in the bug report, consider the following loop:
>> >
>> >int foo(unsigned int n)
>> >{
>> >   if (n != 0)
>> >   if (n != 1)
>> >   if (n != 2)
>> >   if (n != 3)
>> >   if (n != 4)
>> > return ++n;
>> >   return n;
>> >}
>> >
>> >The range test optimization should be able to merge all those five
>> >conditions into one in reassoc pass, but I fails to do so. The reason
>> >is that the phi arg of n is replaced by the constant it compares to in
>> >case of == or != comparisons (in vrp pass). GCC checks there is no
>> >side effect on n between any two neighboring conditions by examining
>> >if they defined the same phi arg in the join node. But as the phi arg
>> >is replace by a constant, the check fails.
>
> I can't reproduce this, at least not on x86_64-linux with -O2,
> the ifcombine pass already merges those.
>
> Jakub


Re: [PATCH] Handling == or != comparisons that may affect range test optimization.

2013-11-05 Thread Cong Hou
On Tue, Nov 5, 2013 at 12:23 PM, Jeff Law  wrote:
> On 10/31/13 18:03, Cong Hou wrote:
>>
>> (This patch is for the bug 58728:
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58728)
>>
>> As in the bug report, consider the following loop:
>>
>> int foo(unsigned int n)
>> {
>>if (n != 0)
>>if (n != 1)
>>if (n != 2)
>>if (n != 3)
>>if (n != 4)
>>  return ++n;
>>return n;
>> }
>>
>> The range test optimization should be able to merge all those five
>> conditions into one in reassoc pass, but I fails to do so. The reason
>> is that the phi arg of n is replaced by the constant it compares to in
>> case of == or != comparisons (in vrp pass). GCC checks there is no
>> side effect on n between any two neighboring conditions by examining
>> if they defined the same phi arg in the join node. But as the phi arg
>> is replace by a constant, the check fails.
>>
>> This patch deals with this situation by considering the existence of
>> == or != comparisons, which is attached below (a text file is also
>> attached with proper tabs). Bootstrap and make check both get passed.
>>
>> Any comment?
>
>
> +   bool is_eq_expr = is_cond && (gimple_cond_code (stmt) == NE_EXPR
> +   || gimple_cond_code (stmt) ==
> EQ_EXPR)
> + && TREE_CODE (phi_arg) == INTEGER_CST;
> +
> +   if (is_eq_expr)
> + {
> +   lhs = gimple_cond_lhs (stmt);
> +   rhs = gimple_cond_rhs (stmt);
> +
> +   if (operand_equal_p (lhs, phi_arg, 0))
> + {
> +   tree t = lhs;
> +   lhs = rhs;
> +   rhs = t;
> + }
> +   if (operand_equal_p (rhs, phi_arg, 0)
> +   && operand_equal_p (lhs, phi_arg2, 0))
> + continue;
> + }
> +
> +   gimple stmt2 = last_stmt (test_bb);
> +   bool is_eq_expr2 = gimple_code (stmt2) == GIMPLE_COND
> +&& (gimple_cond_code (stmt2) == NE_EXPR
> +|| gimple_cond_code (stmt2) == EQ_EXPR)
> +&& TREE_CODE (phi_arg2) == INTEGER_CST;
> +
> +   if (is_eq_expr2)
> + {
> +   lhs2 = gimple_cond_lhs (stmt2);
> +   rhs2 = gimple_cond_rhs (stmt2);
> +
> +   if (operand_equal_p (lhs2, phi_arg2, 0))
> + {
> +   tree t = lhs2;
> +   lhs2 = rhs2;
> +   rhs2 = t;
> + }
> +   if (operand_equal_p (rhs2, phi_arg2, 0)
> +   && operand_equal_p (lhs2, phi_arg, 0))
> + continue;
> + }
>
> Can you factor those two hunks of nearly identical code into a single
> function and call it twice?  I'm also curious if you really need the code to
> swap lhs/rhs.  When can the LHS of a cond be an integer constant?  Don't we
> canonicalize it as   ?


I was not aware that the comparison between a variable and a constant
will always be canonicalized as   . Then I
will remove the swap, and as the code is much smaller, I think it may
not be necessary to create a function for them.


>
> I'd probably write the ChangeLog as:
>
> * tree-ssa-reassoc.c (suitable_cond_bb): Handle constant PHI
> operands resulting from propagation of edge equivalences.
>
>

OK, much better than mine ;)


> I'm also curious -- did this code show up in a particular benchmark, if so,
> which one?

I didn't find this problem from any benchmark, but from another
concern about loop upper bound estimation. Look at the following code:

int foo(unsigned int n, int r)
{
  int i;
  if (n > 0)
if (n < 4)
{
  do {
 --n;
 r *= 2;
  } while (n > 0);
}
  return r+n;
}


In order to get the upper bound of the loop in this function, GCC
traverses conditions n<4 and n>0 separately and tries to get any
useful information. But as those two conditions cannot be combined
into one due to this issue (note that n>0 will be transformed into
n!=0), when GCC sees n<4, it will consider the possibility that n may
be equal to 0, in which case the upper bound is UINT_MAX. If those two
conditions can be combined into one, which is n-1<=2, then we can get
the correct upper bound of the loop.


thanks,
Cong

>
> jeff


Re: [PATCH] Small fix: add { dg-require-effective-target vect_int } to testsuite/gcc.dg/vect/pr58508.c

2013-11-07 Thread Cong Hou
Ping. OK for the trunk?




thanks,
Cong


On Fri, Nov 1, 2013 at 10:47 AM, Cong Hou  wrote:
> It seems that on some platforms the loops in
> testsuite/gcc.dg/vect/pr58508.c may be unable to be vectorized. This
> small patch added { dg-require-effective-target vect_int } to make
> sure all loops can be vectorized.
>
>
> thanks,
> Cong
>
>
> diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
> index 9d0f4a5..3d9916d 100644
> --- a/gcc/testsuite/ChangeLog
> +++ b/gcc/testsuite/ChangeLog
> @@ -1,3 +1,7 @@
> +2013-10-29  Cong Hou  
> +
> +   * gcc.dg/vect/pr58508.c: Update.
> +
>  2013-10-15  Cong Hou  
>
> * gcc.dg/vect/pr58508.c: New test.
> diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c
> b/gcc/testsuite/gcc.dg/vect/pr58508.c
> index 6484a65..fff7a04 100644
> --- a/gcc/testsuite/gcc.dg/vect/pr58508.c
> +++ b/gcc/testsuite/gcc.dg/vect/pr58508.c
> @@ -1,3 +1,4 @@
> +/* { dg-require-effective-target vect_int } */
>  /* { dg-do compile } */
>  /* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details" } */


Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.

2013-11-07 Thread Cong Hou
Now is this patch OK for the trunk? Thank you!



thanks,
Cong


On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou  wrote:
> Thank you for your detailed explanation.
>
> Once GCC detects a reduction operation, it will automatically
> accumulate all elements in the vector after the loop. In the loop the
> reduction variable is always a vector whose elements are reductions of
> corresponding values from other vectors. Therefore in your case the
> only instruction you need to generate is:
>
> VABAL   ops[3], ops[1], ops[2]
>
> It is OK if you accumulate the elements into one in the vector inside
> of the loop (if one instruction can do this), but you have to make
> sure other elements in the vector should remain zero so that the final
> result is correct.
>
> If you are confused about the documentation, check the one for
> udot_prod (just above usad in md.texi), as it has very similar
> behavior as usad. Actually I copied the text from there and did some
> changes. As those two instruction patterns are both for vectorization,
> their behavior should not be difficult to explain.
>
> If you have more questions or think that the documentation is still
> improper please let me know.
>
> Thank you very much!
>
>
> Cong
>
>
> On Tue, Nov 5, 2013 at 1:53 AM, James Greenhalgh
>  wrote:
>> On Mon, Nov 04, 2013 at 06:30:55PM +, Cong Hou wrote:
>>> On Mon, Nov 4, 2013 at 2:06 AM, James Greenhalgh
>>>  wrote:
>>> > On Fri, Nov 01, 2013 at 04:48:53PM +, Cong Hou wrote:
>>> >> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
>>> >> index 2a5a2e1..8f5d39a 100644
>>> >> --- a/gcc/doc/md.texi
>>> >> +++ b/gcc/doc/md.texi
>>> >> @@ -4705,6 +4705,16 @@ wider mode, is computed and added to operand 3.
>>> >> Operand 3 is of a mode equal or
>>> >>  wider than the mode of the product. The result is placed in operand 0, 
>>> >> which
>>> >>  is of the same mode as operand 3.
>>> >>
>>> >> +@cindex @code{ssad@var{m}} instruction pattern
>>> >> +@item @samp{ssad@var{m}}
>>> >> +@cindex @code{usad@var{m}} instruction pattern
>>> >> +@item @samp{usad@var{m}}
>>> >> +Compute the sum of absolute differences of two signed/unsigned elements.
>>> >> +Operand 1 and operand 2 are of the same mode. Their absolute 
>>> >> difference, which
>>> >> +is of a wider mode, is computed and added to operand 3. Operand 3 is of 
>>> >> a mode
>>> >> +equal or wider than the mode of the absolute difference. The result is 
>>> >> placed
>>> >> +in operand 0, which is of the same mode as operand 3.
>>> >> +
>>> >>  @cindex @code{ssum_widen@var{m3}} instruction pattern
>>> >>  @item @samp{ssum_widen@var{m3}}
>>> >>  @cindex @code{usum_widen@var{m3}} instruction pattern
>>> >> diff --git a/gcc/expr.c b/gcc/expr.c
>>> >> index 4975a64..1db8a49 100644
>>> >
>>> > I'm not sure I follow, and if I do - I don't think it matches what
>>> > you have implemented for i386.
>>> >
>>> > From your text description I would guess the series of operations to be:
>>> >
>>> >   v1 = widen (operands[1])
>>> >   v2 = widen (operands[2])
>>> >   v3 = abs (v1 - v2)
>>> >   operands[0] = v3 + operands[3]
>>> >
>>> > But if I understand the behaviour of PSADBW correctly, what you have
>>> > actually implemented is:
>>> >
>>> >   v1 = widen (operands[1])
>>> >   v2 = widen (operands[2])
>>> >   v3 = abs (v1 - v2)
>>> >   v4 = reduce_plus (v3)
>>> >   operands[0] = v4 + operands[3]
>>> >
>>> > To my mind, synthesizing the reduce_plus step will be wasteful for targets
>>> > who do not get this for free with their Absolute Difference step. Imagine 
>>> > a
>>> > simple loop where we have synthesized the reduce_plus, we compute partial
>>> > sums each loop iteration, though we would be better to leave the 
>>> > reduce_plus
>>> > step until after the loop. "REDUC_PLUS_EXPR" would be the appropriate
>>> > Tree code for this.
>>>
>>> What do you mean when you use "synthesizing" here? For each pattern,
>>> the only synthesized operation is the one being returned from the
>>> pattern recognizer. In this case, it is USAD_EXPR. The recognition of
&

[PATCH] Bug fix for PR59050

2013-11-08 Thread Cong Hou
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59050

This is my bad. I forget to check the test result for gfortran. With
this patch the bug should be fixed (tested on x86-64).


thanks,
Cong


diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 90b01f2..e62c672 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,8 @@
+2013-11-08  Cong Hou  
+
+   PR tree-optimization/59050
+   * tree-vect-data-refs.c (comp_dr_addr_with_seg_len_pair): Bug fix.
+
 2013-11-07  Cong Hou  

* tree-vect-loop-manip.c (vect_create_cond_for_alias_checks):
diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index b2a31b1..b7eb926 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -2669,9 +2669,9 @@ comp_dr_addr_with_seg_len_pair (const void *p1_,
const void *p2_)
   if (comp_res != 0)
return comp_res;
 }
-  if (tree_int_cst_compare (p11.offset, p21.offset) < 0)
+  else if (tree_int_cst_compare (p11.offset, p21.offset) < 0)
 return -1;
-  if (tree_int_cst_compare (p11.offset, p21.offset) > 0)
+  else if (tree_int_cst_compare (p11.offset, p21.offset) > 0)
 return 1;
   if (TREE_CODE (p12.offset) != INTEGER_CST
   || TREE_CODE (p22.offset) != INTEGER_CST)
@@ -2680,9 +2680,9 @@ comp_dr_addr_with_seg_len_pair (const void *p1_,
const void *p2_)
   if (comp_res != 0)
return comp_res;
 }
-  if (tree_int_cst_compare (p12.offset, p22.offset) < 0)
+  else if (tree_int_cst_compare (p12.offset, p22.offset) < 0)
 return -1;
-  if (tree_int_cst_compare (p12.offset, p22.offset) > 0)
+  else if (tree_int_cst_compare (p12.offset, p22.offset) > 0)
 return 1;

   return 0;


Re: [PATCH] Reducing number of alias checks in vectorization.

2013-11-08 Thread Cong Hou
Thank you for the report. I have submitted a bug fix patch waiting to
be reviewed.



thanks,
Cong


On Fri, Nov 8, 2013 at 5:26 AM, Dominique Dhumieres  wrote:
> According to http://gcc.gnu.org/ml/gcc-regression/2013-11/msg00197.html
> revision 204538 is breaking several tests. On x86_64-apple-darwin* the
> failures I have looked at are of the kind
>
> /opt/gcc/work/gcc/testsuite/gfortran.dg/typebound_operator_9.f03: In function 
> 'nabla2_cart2d':
> /opt/gcc/work/gcc/testsuite/gfortran.dg/typebound_operator_9.f03:272:0: 
> internal compiler error: tree check: expected integer_cst, have plus_expr in 
> tree_int_cst_lt, at tree.c:7083
>function nabla2_cart2d (obj)
>
> TIA
>
> Dominique


Re: [PATCH] Bug fix for PR59050

2013-11-08 Thread Cong Hou
Yes, I think so. The bug is that the arguments of
tree_int_cst_compare() may not be constant integers. This patch should
take care of it.



thanks,
Cong


On Fri, Nov 8, 2013 at 12:06 PM, H.J. Lu  wrote:
> On Fri, Nov 8, 2013 at 10:34 AM, Cong Hou  wrote:
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59050
>>
>> This is my bad. I forget to check the test result for gfortran. With
>> this patch the bug should be fixed (tested on x86-64).
>>
>>
>> thanks,
>> Cong
>>
>>
>> diff --git a/gcc/ChangeLog b/gcc/ChangeLog
>> index 90b01f2..e62c672 100644
>> --- a/gcc/ChangeLog
>> +++ b/gcc/ChangeLog
>> @@ -1,3 +1,8 @@
>> +2013-11-08  Cong Hou  
>> +
>> +   PR tree-optimization/59050
>> +   * tree-vect-data-refs.c (comp_dr_addr_with_seg_len_pair): Bug fix.
>> +
>
> Many SPEC CPU 2000 tests failed with
>
> costab.c: In function 'HandleCoinc2':
> costab.c:1565:17: internal compiler error: tree check: expected
> integer_cst, have plus_expr in tree_int_cst_lt, at tree.c:7083
>  voidHandleCoinc2 ( cos1, cos2, hdfactor )
>  ^
> 0xb6e084 tree_check_failed(tree_node const*, char const*, int, char const*, 
> ...)
> ../../src-trunk/gcc/tree.c:9477
> 0xb6ffe4 tree_check
> ../../src-trunk/gcc/tree.h:2914
> 0xb6ffe4 tree_int_cst_lt(tree_node const*, tree_node const*)
> ../../src-trunk/gcc/tree.c:7083
> 0xb70020 tree_int_cst_compare(tree_node const*, tree_node const*)
> ../../src-trunk/gcc/tree.c:7093
> 0xe53f1c comp_dr_addr_with_seg_len_pair
> ../../src-trunk/gcc/tree-vect-data-refs.c:2672
> 0xe5cbb5 vec vl_embed>::qsort(int (*)(void const*, void const*))
> ../../src-trunk/gcc/vec.h:941
> 0xe5cbb5 vec::qsort(int
> (*)(void const*, void const*))
> ../../src-trunk/gcc/vec.h:1620
> 0xe5cbb5 vect_prune_runtime_alias_test_list(_loop_vec_info*)
> ../../src-trunk/gcc/tree-vect-data-refs.c:2845
> 0xb39382 vect_analyze_loop_2
> ../../src-trunk/gcc/tree-vect-loop.c:1716
> 0xb39382 vect_analyze_loop(loop*)
> ../../src-trunk/gcc/tree-vect-loop.c:1807
> 0xb4f78f vectorize_loops()
> ../../src-trunk/gcc/tree-vectorizer.c:360
> Please submit a full bug report,
> with preprocessed source if appropriate.
> Please include the complete backtrace with any bug report.
> See <http://gcc.gnu.org/bugs.html> for instructions.
> specmake[3]: *** [costab.o] Error 1
> specmake[3]: *** Waiting for unfinished jobs
>
> Will this patch fix them?
>
>
> --
> H.J.


Re: [PATCH] Bug fix for PR59050

2013-11-11 Thread Cong Hou
Hi Jeff

I have committed the fix. Please update your repo.

Thank you!


Cong



On Mon, Nov 11, 2013 at 10:32 AM, Jeff Law  wrote:
> On 11/11/13 02:32, Richard Biener wrote:
>>
>> On Fri, 8 Nov 2013, Cong Hou wrote:
>>
>>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59050
>>>
>>> This is my bad. I forget to check the test result for gfortran. With
>>> this patch the bug should be fixed (tested on x86-64).
>>
>>
>> Ok.
>>
>> Btw, requirements are to bootstrap and test with all default
>> languages enabled (that is, without any --enable-languages or
>> --enable-languages=all).  That
>> gets you c,c++,objc,java,fortran,lto and misses obj-c++ ada and go.
>> I am personally using --enable-languages=all,ada,obj-c++.
>
> FWIW, I bootstrapped with Cong's patch to keep my own test results clean.
> So it's already been through those tests.
>
> If Cong doesn't get to it soon, I'll check it in myself.
>
> jeff
>


Re: [PATCH] Bug fix for PR59050

2013-11-11 Thread Cong Hou
Thank you for your advice! I will follow this instruction in future.


thanks,
Cong


On Mon, Nov 11, 2013 at 1:32 AM, Richard Biener  wrote:
> On Fri, 8 Nov 2013, Cong Hou wrote:
>
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59050
>>
>> This is my bad. I forget to check the test result for gfortran. With
>> this patch the bug should be fixed (tested on x86-64).
>
> Ok.
>
> Btw, requirements are to bootstrap and test with all default
> languages enabled (that is, without any --enable-languages or
> --enable-languages=all).  That
> gets you c,c++,objc,java,fortran,lto and misses obj-c++ ada and go.
> I am personally using --enable-languages=all,ada,obj-c++.
>
> Thanks,
> Richard.
>
>> thanks,
>> Cong
>>
>>
>> diff --git a/gcc/ChangeLog b/gcc/ChangeLog
>> index 90b01f2..e62c672 100644
>> --- a/gcc/ChangeLog
>> +++ b/gcc/ChangeLog
>> @@ -1,3 +1,8 @@
>> +2013-11-08  Cong Hou  
>> +
>> +   PR tree-optimization/59050
>> +   * tree-vect-data-refs.c (comp_dr_addr_with_seg_len_pair): Bug fix.
>> +
>>  2013-11-07  Cong Hou  
>>
>> * tree-vect-loop-manip.c (vect_create_cond_for_alias_checks):
>> diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
>> index b2a31b1..b7eb926 100644
>> --- a/gcc/tree-vect-data-refs.c
>> +++ b/gcc/tree-vect-data-refs.c
>> @@ -2669,9 +2669,9 @@ comp_dr_addr_with_seg_len_pair (const void *p1_,
>> const void *p2_)
>>if (comp_res != 0)
>> return comp_res;
>>  }
>> -  if (tree_int_cst_compare (p11.offset, p21.offset) < 0)
>> +  else if (tree_int_cst_compare (p11.offset, p21.offset) < 0)
>>  return -1;
>> -  if (tree_int_cst_compare (p11.offset, p21.offset) > 0)
>> +  else if (tree_int_cst_compare (p11.offset, p21.offset) > 0)
>>  return 1;
>>if (TREE_CODE (p12.offset) != INTEGER_CST
>>|| TREE_CODE (p22.offset) != INTEGER_CST)
>> @@ -2680,9 +2680,9 @@ comp_dr_addr_with_seg_len_pair (const void *p1_,
>> const void *p2_)
>>if (comp_res != 0)
>> return comp_res;
>>  }
>> -  if (tree_int_cst_compare (p12.offset, p22.offset) < 0)
>> +  else if (tree_int_cst_compare (p12.offset, p22.offset) < 0)
>>  return -1;
>> -  if (tree_int_cst_compare (p12.offset, p22.offset) > 0)
>> +  else if (tree_int_cst_compare (p12.offset, p22.offset) > 0)
>>  return 1;
>>
>>return 0;
>>
>>
>
> --
> Richard Biener 
> SUSE / SUSE Labs
> SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
> GF: Jeff Hawn, Jennifer Guild, Felix Imend


Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.

2013-11-11 Thread Cong Hou
Hi James

Sorry for the late reply.


On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh
 wrote:
>> On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou  wrote:
>> > Thank you for your detailed explanation.
>> >
>> > Once GCC detects a reduction operation, it will automatically
>> > accumulate all elements in the vector after the loop. In the loop the
>> > reduction variable is always a vector whose elements are reductions of
>> > corresponding values from other vectors. Therefore in your case the
>> > only instruction you need to generate is:
>> >
>> > VABAL   ops[3], ops[1], ops[2]
>> >
>> > It is OK if you accumulate the elements into one in the vector inside
>> > of the loop (if one instruction can do this), but you have to make
>> > sure other elements in the vector should remain zero so that the final
>> > result is correct.
>> >
>> > If you are confused about the documentation, check the one for
>> > udot_prod (just above usad in md.texi), as it has very similar
>> > behavior as usad. Actually I copied the text from there and did some
>> > changes. As those two instruction patterns are both for vectorization,
>> > their behavior should not be difficult to explain.
>> >
>> > If you have more questions or think that the documentation is still
>> > improper please let me know.
>
> Hi Cong,
>
> Thanks for your reply.
>
> I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and
> DOT_PROD_EXPR and I see that the same ambiguity exists for
> DOT_PROD_EXPR. Can you please add a note in your tree.def
> that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either:
>
>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
>   tmp2 = ABS_EXPR (tmp)
>   arg3 = PLUS_EXPR (tmp2, arg3)
>
> or:
>
>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
>   tmp2 = ABS_EXPR (tmp)
>   arg3 = WIDEN_SUM_EXPR (tmp2, arg3)
>
> Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a
> a value of the same (widened) type as arg3.
>


I have added it, although we currently don't have WIDEN_MINUS_EXPR (I
mentioned it in tree.def).


> Also, while looking for the history of DOT_PROD_EXPR I spotted this
> patch:
>
>   [autovect] [patch] detect mult-hi and sad patterns
>   http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html
>
> I wonder what the reason was for that patch to be dropped?
>

It has been 8 years.. I have no idea why this patch is not accepted
finally. There is even no reply in that thread. But I believe the SAD
pattern is very important to be recognized. ARM also provides
instructions for it.


Thank you for your comment again!


thanks,
Cong



> Thanks,
> James
>
diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 6bdaa31..37ff6c4 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,4 +1,24 @@
-2013-11-01  Trevor Saunders  
+2013-10-29  Cong Hou  
+
+   * tree-vect-patterns.c (vect_recog_sad_pattern): New function for SAD
+   pattern recognition.
+   (type_conversion_p): PROMOTION is true if it's a type promotion
+   conversion, and false otherwise.  Return true if the given expression
+   is a type conversion one.
+   * tree-vectorizer.h: Adjust the number of patterns.
+   * tree.def: Add SAD_EXPR.
+   * optabs.def: Add sad_optab.
+   * cfgexpand.c (expand_debug_expr): Add SAD_EXPR case.
+   * expr.c (expand_expr_real_2): Likewise.
+   * gimple-pretty-print.c (dump_ternary_rhs): Likewise.
+   * gimple.c (get_gimple_rhs_num_ops): Likewise.
+   * optabs.c (optab_for_tree_code): Likewise.
+   * tree-cfg.c (estimate_operator_cost): Likewise.
+   * tree-ssa-operands.c (get_expr_operands): Likewise.
+   * tree-vect-loop.c (get_initial_def_for_reduction): Likewise.
+   * config/i386/sse.md: Add SSE2 and AVX2 expand for SAD.
+   * doc/generic.texi: Add document for SAD_EXPR.
+   * doc/md.texi: Add document for ssad and usad.
 
* function.c (reorder_blocks): Convert block_stack to a stack_vec.
* gimplify.c (gimplify_compound_lval): Likewise.
diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c
index fb05ce7..1f824fb 100644
--- a/gcc/cfgexpand.c
+++ b/gcc/cfgexpand.c
@@ -2740,6 +2740,7 @@ expand_debug_expr (tree exp)
{
case COND_EXPR:
case DOT_PROD_EXPR:
+   case SAD_EXPR:
case WIDEN_MULT_PLUS_EXPR:
case WIDEN_MULT_MINUS_EXPR:
case FMA_EXPR:
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 9094a1c..af73817 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -7278,6 +7278,36 @@
   DONE;
 })
 
+(define_expand "usadv16qi"
+  [(match_operand:V4SI 0 "register_operand")
+   (match_operand:V16QI 1 "register_operand")
+   (match_

Re: [PATCH] Small fix: add { dg-require-effective-target vect_int } to testsuite/gcc.dg/vect/pr58508.c

2013-11-12 Thread Cong Hou
Hi Jakub

Thank you for pointing it out. The updated patch is pasted below. I
will pay attention to it in the future.


thanks,
Cong




diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 3d9916d..32a6ff7 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,7 @@
+2013-11-12  Cong Hou  
+
+   * gcc.dg/vect/pr58508.c: Remove dg-options as vect_int is indicated.
+
 2013-10-29  Cong Hou  

* gcc.dg/vect/pr58508.c: Update.
diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c
b/gcc/testsuite/gcc.dg/vect/pr58508.c
index fff7a04..c4921bb 100644
--- a/gcc/testsuite/gcc.dg/vect/pr58508.c
+++ b/gcc/testsuite/gcc.dg/vect/pr58508.c
@@ -1,6 +1,5 @@
 /* { dg-require-effective-target vect_int } */
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details" } */


 /* The GCC vectorizer generates loop versioning for the following loop





On Tue, Nov 12, 2013 at 6:05 AM, Jakub Jelinek  wrote:
> On Thu, Nov 07, 2013 at 06:24:55PM -0800, Cong Hou wrote:
>> Ping. OK for the trunk?
>> On Fri, Nov 1, 2013 at 10:47 AM, Cong Hou  wrote:
>> > It seems that on some platforms the loops in
>> > testsuite/gcc.dg/vect/pr58508.c may be unable to be vectorized. This
>> > small patch added { dg-require-effective-target vect_int } to make
>> > sure all loops can be vectorized.
>> > diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
>> > index 9d0f4a5..3d9916d 100644
>> > --- a/gcc/testsuite/ChangeLog
>> > +++ b/gcc/testsuite/ChangeLog
>> > @@ -1,3 +1,7 @@
>> > +2013-10-29  Cong Hou  
>> > +
>> > +   * gcc.dg/vect/pr58508.c: Update.
>> > +
>> >  2013-10-15  Cong Hou  
>> >
>> > * gcc.dg/vect/pr58508.c: New test.
>> > diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c
>> > b/gcc/testsuite/gcc.dg/vect/pr58508.c
>> > index 6484a65..fff7a04 100644
>> > --- a/gcc/testsuite/gcc.dg/vect/pr58508.c
>> > +++ b/gcc/testsuite/gcc.dg/vect/pr58508.c
>> > @@ -1,3 +1,4 @@
>> > +/* { dg-require-effective-target vect_int } */
>> >  /* { dg-do compile } */
>> >  /* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details" } */
>
> This isn't the only bug in the testcase.  Another one is using
> dg-options in gcc.dg/vect/, you should just leave that out,
> the default options already include those options, but explicit dg-options
> mean that other required options like -msse2 on i?86 aren't added.
>
> Jakub


Re: [PATCH] Small fix: add { dg-require-effective-target vect_int } to testsuite/gcc.dg/vect/pr58508.c

2013-11-12 Thread Cong Hou
Got it!


thanks,
Cong


On Tue, Nov 12, 2013 at 10:05 AM, Jakub Jelinek  wrote:
> On Tue, Nov 12, 2013 at 10:04:15AM -0800, Cong Hou wrote:
>> Thank you for pointing it out. The updated patch is pasted below. I
>> will pay attention to it in the future.
>
> Ok, thanks.
> Note, you can use dg-additional-options if needed in g*.dg/vect/, just not
> dg-options.
>
>> --- a/gcc/testsuite/ChangeLog
>> +++ b/gcc/testsuite/ChangeLog
>> @@ -1,3 +1,7 @@
>> +2013-11-12  Cong Hou  
>> +
>> +   * gcc.dg/vect/pr58508.c: Remove dg-options as vect_int is indicated.
>> +
>>  2013-10-29  Cong Hou  
>>
>> * gcc.dg/vect/pr58508.c: Update.
>> diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c
>> b/gcc/testsuite/gcc.dg/vect/pr58508.c
>> index fff7a04..c4921bb 100644
>> --- a/gcc/testsuite/gcc.dg/vect/pr58508.c
>> +++ b/gcc/testsuite/gcc.dg/vect/pr58508.c
>> @@ -1,6 +1,5 @@
>>  /* { dg-require-effective-target vect_int } */
>>  /* { dg-do compile } */
>> -/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details" } */
>>
>>
>>  /* The GCC vectorizer generates loop versioning for the following loop
>
> Jakub


[PATCH] [Vectorization] Fixing a bug in alias checks merger.

2013-11-12 Thread Cong Hou
The current alias check merger does not consider the DR_STEP of
data-refs when sorting data-refs. For the following loop:

for (i = 0; i < N; ++i)
  a[i] = b[0] + b[i] + b[1];

The data ref b[0] and b[i] have the same DR_INIT and DR_OFFSET, and
after sorting three DR pairs, the following order can be a possible
result:

 (a[i], b[0]), (a[i], b[i]), (a[i], b[1])

This prevents the alias checks for (a[i], b[0]) and  (a[i], b[1]) being merged.

This patch added the comparison between DR_STEP of two data refs
during the sort.

The test case is also updated. The previous one used explicit
dg-options which blocks the options from the target vect_int. The test
case also assumes a vector can hold at least 4 integers of int type,
which may not be true on some targets.

The patch is pasted below. Bootstrapped and tested on a x86-64 machine.



thanks,
Cong



diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 2c0554b..5faa5ca 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,14 @@
+2013-11-12  Cong Hou  
+
+ * tree-vectorizer.h (struct dr_with_seg_len): Remove the base
+ address field as it can be obtained from dr.  Rename the struct.
+ * tree-vect-data-refs.c (comp_dr_with_seg_len_pair): Consider
+ steps of data references during sort.
+ (vect_prune_runtime_alias_test_list): Adjust with the change to
+ struct dr_with_seg_len.
+ * tree-vect-loop-manip.c (vect_create_cond_for_alias_checks):
+ Adjust with the change to struct dr_with_seg_len.
+
 2013-11-12  Jeff Law  

  * tree-ssa-threadedge.c (thread_around_empty_blocks): New
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 09c7f20..8075409 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,7 @@
+2013-11-12  Cong Hou  
+
+ * gcc.dg/vect/vect-alias-check.c: Update.
+
 2013-11-12  Balaji V. Iyer  

  * gcc.dg/cilk-plus/cilk-plus.exp: Added a check for LTO before running
diff --git a/gcc/testsuite/gcc.dg/vect/vect-alias-check.c
b/gcc/testsuite/gcc.dg/vect/vect-alias-check.c
index 64a4e0c..c1bffed 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-alias-check.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-alias-check.c
@@ -1,17 +1,17 @@
 /* { dg-require-effective-target vect_int } */
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-vectorize
--param=vect-max-version-for-alias-checks=2 -fdump-tree-vect-details"
} */
+/* { dg-additional-options "--param=vect-max-version-for-alias-checks=2" } */

-/* A test case showing three potential alias checks between
-   a[i] and b[i], b[i+7], b[i+14]. With alias checks merging
-   enabled, those tree checks can be merged into one, and the
-   loop will be vectorized with vect-max-version-for-alias-checks=2.  */
+/* A test case showing four potential alias checks between a[i] and b[0], b[1],
+   b[i+1] and b[i+2].  With alias check merging enabled, those four checks
+   can be merged into two, and the loop will be vectorized with
+   vect-max-version-for-alias-checks=2.  */

 void foo (int *a, int *b)
 {
   int i;
   for (i = 0; i < 1000; ++i)
-a[i] = b[i] + b[i+7] + b[i+14];
+a[i] = b[0] + b[1] + b[i+1] + b[i+2];
 }

 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index c479775..7f0920d 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -2620,7 +2620,7 @@ vect_analyze_data_ref_accesses (loop_vec_info
loop_vinfo, bb_vec_info bb_vinfo)
 }


-/* Operator == between two dr_addr_with_seg_len objects.
+/* Operator == between two dr_with_seg_len objects.

This equality operator is used to make sure two data refs
are the same one so that we will consider to combine the
@@ -2628,62 +2628,51 @@ vect_analyze_data_ref_accesses (loop_vec_info
loop_vinfo, bb_vec_info bb_vinfo)
refs.  */

 static bool
-operator == (const dr_addr_with_seg_len& d1,
- const dr_addr_with_seg_len& d2)
+operator == (const dr_with_seg_len& d1,
+ const dr_with_seg_len& d2)
 {
-  return operand_equal_p (d1.basic_addr, d2.basic_addr, 0)
- && compare_tree (d1.offset, d2.offset) == 0
- && compare_tree (d1.seg_len, d2.seg_len) == 0;
+  return operand_equal_p (DR_BASE_ADDRESS (d1.dr),
+  DR_BASE_ADDRESS (d2.dr), 0)
+   && compare_tree (d1.offset, d2.offset) == 0
+   && compare_tree (d1.seg_len, d2.seg_len) == 0;
 }

-/* Function comp_dr_addr_with_seg_len_pair.
+/* Function comp_dr_with_seg_len_pair.

-   Comparison function for sorting objects of dr_addr_with_seg_len_pair_t
+   Comparison function for sorting objects of dr_with_seg_len_pair_t
so that we can combine aliasing checks in one scan.  */

 static int
-comp_dr_addr_with_seg_len_pair (const void *p1_, const void *p2_)
+comp_dr_with_seg_len_pair (const void *p1_, const void *p2_)
 {
-  const dr_addr_with_seg_len_pair_t* p1 =
-(const dr_addr_with_seg_len_pair_t *) p1_;
-  const dr_addr_with_seg_len_pair_t* p2 =
-(const dr_addr

[PATCH] Do not set flag_complex_method to 2 for C++ by default.

2013-11-13 Thread Cong Hou
This patch is for PR58963.

In the patch http://gcc.gnu.org/ml/gcc-patches/2005-02/msg00560.html,
the builtin function is used to perform complex multiplication and
division. This is to comply with C99 standard, but I am wondering if
C++ also needs this.

There is no complex keyword in C++, and no content in C++ standard
about the behavior of operations on complex types. The 
header file is all written in source code, including complex
multiplication and division. GCC should not do too much for them by
using builtin calls by default (although we can set -fcx-limited-range
to prevent GCC doing this), which has a big impact on performance
(there may exist vectorization opportunities).

In this patch flag_complex_method will not be set to 2 for C++.
Bootstraped and tested on an x86-64 machine.


thanks,
Cong


Index: gcc/c-family/c-opts.c
===
--- gcc/c-family/c-opts.c (revision 204712)
+++ gcc/c-family/c-opts.c (working copy)
@@ -198,8 +198,10 @@ c_common_init_options_struct (struct gcc
   opts->x_warn_write_strings = c_dialect_cxx ();
   opts->x_flag_warn_unused_result = true;

-  /* By default, C99-like requirements for complex multiply and divide.  */
-  opts->x_flag_complex_method = 2;
+  /* By default, C99-like requirements for complex multiply and divide.
+ But for C++ this should not be required.  */
+  if (c_language != clk_cxx && c_language != clk_objcxx)
+opts->x_flag_complex_method = 2;
 }

 /* Common initialization before calling option handlers.  */
Index: gcc/c-family/ChangeLog
===
--- gcc/c-family/ChangeLog (revision 204712)
+++ gcc/c-family/ChangeLog (working copy)
@@ -1,3 +1,8 @@
+2013-11-13  Cong Hou  
+
+ * c-opts.c (c_common_init_options_struct): Don't let C++ comply with
+ C99-like requirements for complex multiply and divide.
+
 2013-11-12  Joseph Myers  

  * c-common.c (c_common_reswords): Add _Thread_local.


Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.

2013-11-13 Thread Cong Hou
Ping?


thanks,
Cong


On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou  wrote:
> Hi James
>
> Sorry for the late reply.
>
>
> On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh
>  wrote:
>>> On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou  wrote:
>>> > Thank you for your detailed explanation.
>>> >
>>> > Once GCC detects a reduction operation, it will automatically
>>> > accumulate all elements in the vector after the loop. In the loop the
>>> > reduction variable is always a vector whose elements are reductions of
>>> > corresponding values from other vectors. Therefore in your case the
>>> > only instruction you need to generate is:
>>> >
>>> > VABAL   ops[3], ops[1], ops[2]
>>> >
>>> > It is OK if you accumulate the elements into one in the vector inside
>>> > of the loop (if one instruction can do this), but you have to make
>>> > sure other elements in the vector should remain zero so that the final
>>> > result is correct.
>>> >
>>> > If you are confused about the documentation, check the one for
>>> > udot_prod (just above usad in md.texi), as it has very similar
>>> > behavior as usad. Actually I copied the text from there and did some
>>> > changes. As those two instruction patterns are both for vectorization,
>>> > their behavior should not be difficult to explain.
>>> >
>>> > If you have more questions or think that the documentation is still
>>> > improper please let me know.
>>
>> Hi Cong,
>>
>> Thanks for your reply.
>>
>> I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and
>> DOT_PROD_EXPR and I see that the same ambiguity exists for
>> DOT_PROD_EXPR. Can you please add a note in your tree.def
>> that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either:
>>
>>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
>>   tmp2 = ABS_EXPR (tmp)
>>   arg3 = PLUS_EXPR (tmp2, arg3)
>>
>> or:
>>
>>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
>>   tmp2 = ABS_EXPR (tmp)
>>   arg3 = WIDEN_SUM_EXPR (tmp2, arg3)
>>
>> Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a
>> a value of the same (widened) type as arg3.
>>
>
>
> I have added it, although we currently don't have WIDEN_MINUS_EXPR (I
> mentioned it in tree.def).
>
>
>> Also, while looking for the history of DOT_PROD_EXPR I spotted this
>> patch:
>>
>>   [autovect] [patch] detect mult-hi and sad patterns
>>   http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html
>>
>> I wonder what the reason was for that patch to be dropped?
>>
>
> It has been 8 years.. I have no idea why this patch is not accepted
> finally. There is even no reply in that thread. But I believe the SAD
> pattern is very important to be recognized. ARM also provides
> instructions for it.
>
>
> Thank you for your comment again!
>
>
> thanks,
> Cong
>
>
>
>> Thanks,
>> James
>>


Re: [PATCH] Do not set flag_complex_method to 2 for C++ by default.

2013-11-14 Thread Cong Hou
See the following code:


#include 
using std::complex;

template
complex<_Tp>&
mult_assign (complex<_Tp>& __y, const complex<_Up>& __z)
{
  _Up& _M_real = __y.real();
  _Up& _M_imag = __y.imag();
  const _Tp __r = _M_real * __z.real() - _M_imag * __z.imag();
  _M_imag = _M_real * __z.imag() + _M_imag * __z.real();
  _M_real = __r;
  return __y;
}

void foo (complex& c1, complex& c2)
{ c1 *= c2; }

void bar (complex& c1, complex& c2)
{ mult_assign(c1, c2); }


The function mult_assign is written almost by copying the
implementation of operator *= from . They have exactly the
same behavior from the view of the source code. However, the compiled
results of foo() and bar() are different: foo() is using builtin
function for multiplication but bar() is not. Just because of a name
change the final behavior is changed? This should not be how a
compiler is working.


thanks,
Cong


On Thu, Nov 14, 2013 at 10:17 AM, Andrew Pinski  wrote:
> On Thu, Nov 14, 2013 at 8:25 AM, Xinliang David Li  wrote:
>> Can we revisit the decision for this? Here are the reasons:
>>
>> 1) It seems that the motivation to make C++ consistent with c99 is to
>> avoid confusing users who build the C source with both C and C++
>> compilers. Why should C++'s default behavior be tuned for this niche
>> case?
>
> It is not a niche case.  It is confusing for people who write C++ code
> to rewrite their code to C99 and find that C is much slower because of
> correctness?  I think they have this backwards here.  C++ should be
> consistent with C here.
>
>> 2) It is very confusing for users who see huge performance difference
>> between compiler generated code for Complex multiplication vs manually
>> expanded code
>
> I don't see why this is an issue if they understand how complex
> multiplication works for correctness.  I am sorry but correctness over
> speed is a good argument of why this should stay this way.
>
>> 3) The default setting can also block potential vectorization
>> opportunities for complex operations
>
> Yes so again this is about a correctness issue over a speed issue.
>
>> 4) GCC is about the only compiler which has this default -- very few
>> user knows about GCC's strict default, and will think GCC performs
>> poorly.
>
>
> Correctness over speed is better.  I am sorry GCC is the only one
> which gets it correct here.  If people don't like there is a flag to
> disable it.
>
> Thanks,
> Andrew Pinski
>
>>
>> thanks,
>>
>> David
>>
>>
>> On Wed, Nov 13, 2013 at 9:07 PM, Andrew Pinski  wrote:
>>> On Wed, Nov 13, 2013 at 5:26 PM, Cong Hou  wrote:
>>>> This patch is for PR58963.
>>>>
>>>> In the patch http://gcc.gnu.org/ml/gcc-patches/2005-02/msg00560.html,
>>>> the builtin function is used to perform complex multiplication and
>>>> division. This is to comply with C99 standard, but I am wondering if
>>>> C++ also needs this.
>>>>
>>>> There is no complex keyword in C++, and no content in C++ standard
>>>> about the behavior of operations on complex types. The 
>>>> header file is all written in source code, including complex
>>>> multiplication and division. GCC should not do too much for them by
>>>> using builtin calls by default (although we can set -fcx-limited-range
>>>> to prevent GCC doing this), which has a big impact on performance
>>>> (there may exist vectorization opportunities).
>>>>
>>>> In this patch flag_complex_method will not be set to 2 for C++.
>>>> Bootstraped and tested on an x86-64 machine.
>>>
>>> I think you need to look into this issue deeper as the original patch
>>> only enabled it for C99:
>>> http://gcc.gnu.org/ml/gcc-patches/2005-02/msg01483.html .
>>>
>>> Just a little deeper will find
>>> http://gcc.gnu.org/ml/gcc/2007-07/msg00124.html which says yes C++
>>> needs this.
>>>
>>> Thanks,
>>> Andrew Pinski
>>>
>>>>
>>>>
>>>> thanks,
>>>> Cong
>>>>
>>>>
>>>> Index: gcc/c-family/c-opts.c
>>>> ===
>>>> --- gcc/c-family/c-opts.c (revision 204712)
>>>> +++ gcc/c-family/c-opts.c (working copy)
>>>> @@ -198,8 +198,10 @@ c_common_init_options_struct (struct gcc
>>>>opts->x_warn_write_strings = c_dialect_cxx ();
>>>>opts->x_flag_warn_unused_result = true;
>>>>
>>>> - 

[PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.

2013-11-14 Thread Cong Hou
Hi

This patch adds the support to two non-isomorphic operations addsub
and subadd for SLP vectorizer. More non-isomorphic operations can be
added later, but the limitation is that operations on even/odd
elements should still be isomorphic. Once such an operation is
detected, the code of the operation used in vectorized code is stored
and later will be used during statement transformation. Two new GIMPLE
opeartions VEC_ADDSUB_EXPR and VEC_SUBADD_EXPR are defined. And also
new optabs for them. They are also documented.

The target supports for SSE/SSE2/SSE3/AVX are added for those two new
operations on floating points. SSE3/AVX provides ADDSUBPD and ADDSUBPS
instructions. For SSE/SSE2, those two operations are emulated using
two instructions (selectively negate then add).

With this patch the following function will be SLP vectorized:


float a[4], b[4], c[4];  // double also OK.

void subadd ()
{
  c[0] = a[0] - b[0];
  c[1] = a[1] + b[1];
  c[2] = a[2] - b[2];
  c[3] = a[3] + b[3];
}

void addsub ()
{
  c[0] = a[0] + b[0];
  c[1] = a[1] - b[1];
  c[2] = a[2] + b[2];
  c[3] = a[3] - b[3];
}


Boostrapped and tested on an x86-64 machine.


thanks,
Cong





diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 2c0554b..656d5fb 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,31 @@
+2013-11-14  Cong Hou  
+
+ * tree-vect-slp.c (vect_create_new_slp_node): Initialize
+ SLP_TREE_OP_CODE.
+ (slp_supported_non_isomorphic_op): New function.  Check if the
+ non-isomorphic operation is supported or not.
+ (vect_build_slp_tree_1): Consider non-isomorphic operations.
+ (vect_build_slp_tree): Change argument.
+ * tree-vect-stmts.c (vectorizable_operation): Consider the opcode
+ for non-isomorphic operations.
+ * optabs.def (vec_addsub_optab, vec_subadd_optab): New optabs.
+ * tree.def (VEC_ADDSUB_EXPR, VEC_SUBADD_EXPR): New operations.
+ * expr.c (expand_expr_real_2): Add support to VEC_ADDSUB_EXPR and
+ VEC_SUBADD_EXPR.
+ * gimple-pretty-print.c (dump_binary_rhs): Likewise.
+ * optabs.c (optab_for_tree_code): Likewise.
+ * tree-cfg.c (verify_gimple_assign_binary): Likewise.
+ * tree-vectorizer.h (struct _slp_tree): New data member.
+ * config/i386/i386-protos.h (ix86_sse_expand_fp_addsub_operator):
+ New funtion.  Expand addsub/subadd operations for SSE2.
+ * config/i386/i386.c (ix86_sse_expand_fp_addsub_operator): Likewise.
+ * config/i386/sse.md (UNSPEC_SUBADD, UNSPEC_ADDSUB): New RTL operation.
+ (vec_subadd_v4sf3, vec_subadd_v2df3, vec_subadd_3,
+ vec_addsub_v4sf3, vec_addsub_v2df3, vec_addsub_3):
+ Expand addsub/subadd operations for SSE/SSE2/SSE3/AVX.
+ * doc/generic.texi (VEC_ADDSUB_EXPR, VEC_SUBADD_EXPR): New doc.
+ * doc/md.texi (vec_addsub_@var{m}3, vec_subadd_@var{m}3): New doc.
+
 2013-11-12  Jeff Law  

  * tree-ssa-threadedge.c (thread_around_empty_blocks): New
diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index fdf9d58..b02b757 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -117,6 +117,7 @@ extern rtx ix86_expand_adjust_ufix_to_sfix_si (rtx, rtx *);
 extern enum ix86_fpcmp_strategy ix86_fp_comparison_strategy (enum rtx_code);
 extern void ix86_expand_fp_absneg_operator (enum rtx_code, enum machine_mode,
 rtx[]);
+extern void ix86_sse_expand_fp_addsub_operator (bool, enum
machine_mode, rtx[]);
 extern void ix86_expand_copysign (rtx []);
 extern void ix86_split_copysign_const (rtx []);
 extern void ix86_split_copysign_var (rtx []);
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 5287b49..76f38f5 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -18702,6 +18702,51 @@ ix86_expand_fp_absneg_operator (enum rtx_code
code, enum machine_mode mode,
 emit_insn (set);
 }

+/* Generate code for addsub or subadd on fp vectors for sse/sse2.  The flag
+   SUBADD indicates if we are generating code for subadd or addsub.  */
+
+void
+ix86_sse_expand_fp_addsub_operator (bool subadd, enum machine_mode mode,
+rtx operands[])
+{
+  rtx mask;
+  rtx neg_mask32 = GEN_INT (0x8000);
+  rtx neg_mask64 = GEN_INT ((HOST_WIDE_INT)1 << 63);
+
+  switch (mode)
+{
+case V4SFmode:
+  if (subadd)
+ mask = gen_rtx_CONST_VECTOR (V4SImode, gen_rtvec (4,
+ neg_mask32, const0_rtx, neg_mask32, const0_rtx));
+  else
+ mask = gen_rtx_CONST_VECTOR (V4SImode, gen_rtvec (4,
+ const0_rtx, neg_mask32, const0_rtx, neg_mask32));
+  break;
+
+case V2DFmode:
+  if (subadd)
+ mask = gen_rtx_CONST_VECTOR (V2DImode, gen_rtvec (2,
+ neg_mask64, const0_rtx));
+  else
+ mask = gen_rtx_CONST_VECTOR (V2DImode, gen_rtvec (2,
+ const0_rtx, neg_mask64));
+  break;
+
+default:
+  gcc_unreachable ();
+}
+
+  rtx tmp = gen_reg_rtx (mode);
+  convert_move (tmp, mask, false);
+
+  rtx tmp2 = gen_reg_rtx (mode);
+  tmp2 = expand_simple_binop (mode, XOR, tmp, operands[2],
+  tmp2, 0, OPTAB_DIRECT);
+  expand_simple_binop (mode, PLUS, operands[1], tmp2,
+   operands[0], 0, OPTAB_

Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.

2013-11-15 Thread Cong Hou
Any more comments?



thanks,
Cong


On Wed, Nov 13, 2013 at 6:06 PM, Cong Hou  wrote:
> Ping?
>
>
> thanks,
> Cong
>
>
> On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou  wrote:
>> Hi James
>>
>> Sorry for the late reply.
>>
>>
>> On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh
>>  wrote:
>>>> On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou  wrote:
>>>> > Thank you for your detailed explanation.
>>>> >
>>>> > Once GCC detects a reduction operation, it will automatically
>>>> > accumulate all elements in the vector after the loop. In the loop the
>>>> > reduction variable is always a vector whose elements are reductions of
>>>> > corresponding values from other vectors. Therefore in your case the
>>>> > only instruction you need to generate is:
>>>> >
>>>> > VABAL   ops[3], ops[1], ops[2]
>>>> >
>>>> > It is OK if you accumulate the elements into one in the vector inside
>>>> > of the loop (if one instruction can do this), but you have to make
>>>> > sure other elements in the vector should remain zero so that the final
>>>> > result is correct.
>>>> >
>>>> > If you are confused about the documentation, check the one for
>>>> > udot_prod (just above usad in md.texi), as it has very similar
>>>> > behavior as usad. Actually I copied the text from there and did some
>>>> > changes. As those two instruction patterns are both for vectorization,
>>>> > their behavior should not be difficult to explain.
>>>> >
>>>> > If you have more questions or think that the documentation is still
>>>> > improper please let me know.
>>>
>>> Hi Cong,
>>>
>>> Thanks for your reply.
>>>
>>> I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and
>>> DOT_PROD_EXPR and I see that the same ambiguity exists for
>>> DOT_PROD_EXPR. Can you please add a note in your tree.def
>>> that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either:
>>>
>>>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
>>>   tmp2 = ABS_EXPR (tmp)
>>>   arg3 = PLUS_EXPR (tmp2, arg3)
>>>
>>> or:
>>>
>>>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
>>>   tmp2 = ABS_EXPR (tmp)
>>>   arg3 = WIDEN_SUM_EXPR (tmp2, arg3)
>>>
>>> Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a
>>> a value of the same (widened) type as arg3.
>>>
>>
>>
>> I have added it, although we currently don't have WIDEN_MINUS_EXPR (I
>> mentioned it in tree.def).
>>
>>
>>> Also, while looking for the history of DOT_PROD_EXPR I spotted this
>>> patch:
>>>
>>>   [autovect] [patch] detect mult-hi and sad patterns
>>>   http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html
>>>
>>> I wonder what the reason was for that patch to be dropped?
>>>
>>
>> It has been 8 years.. I have no idea why this patch is not accepted
>> finally. There is even no reply in that thread. But I believe the SAD
>> pattern is very important to be recognized. ARM also provides
>> instructions for it.
>>
>>
>> Thank you for your comment again!
>>
>>
>> thanks,
>> Cong
>>
>>
>>
>>> Thanks,
>>> James
>>>


Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.

2013-11-18 Thread Cong Hou
I tried your method and it works well for doubles. But for float,
there is an issue. For the following gimple code:

   c1 = a - b;
   c2 = a + b;
   c = VEC_PERM 

It needs two instructions to implement the VEC_PERM operation in
SSE2-4, one of which should be using shufps which is represented by
the following pattern in rtl:


(define_insn "sse_shufps_"
  [(set (match_operand:VI4F_128 0 "register_operand" "=x,x")
(vec_select:VI4F_128
 (vec_concat:
   (match_operand:VI4F_128 1 "register_operand" "0,x")
   (match_operand:VI4F_128 2 "nonimmediate_operand" "xm,xm"))
 (parallel [(match_operand 3 "const_0_to_3_operand")
(match_operand 4 "const_0_to_3_operand")
(match_operand 5 "const_4_to_7_operand")
(match_operand 6 "const_4_to_7_operand")])))]
...)

Note that it contains two rtl instructions. Together with minus, plus,
and one more shuffling instruction, we have at least five instructions
for addsub pattern. I think during the combine pass, only four
instructions are considered to be combined, right? So unless we
compress those five instructions into four or less, we could not use
this method for float values.

What do you think?




thanks,
Cong


On Fri, Nov 15, 2013 at 12:53 AM, Richard Biener  wrote:
> On Thu, 14 Nov 2013, Cong Hou wrote:
>
>> Hi
>>
>> This patch adds the support to two non-isomorphic operations addsub
>> and subadd for SLP vectorizer. More non-isomorphic operations can be
>> added later, but the limitation is that operations on even/odd
>> elements should still be isomorphic. Once such an operation is
>> detected, the code of the operation used in vectorized code is stored
>> and later will be used during statement transformation. Two new GIMPLE
>> opeartions VEC_ADDSUB_EXPR and VEC_SUBADD_EXPR are defined. And also
>> new optabs for them. They are also documented.
>>
>> The target supports for SSE/SSE2/SSE3/AVX are added for those two new
>> operations on floating points. SSE3/AVX provides ADDSUBPD and ADDSUBPS
>> instructions. For SSE/SSE2, those two operations are emulated using
>> two instructions (selectively negate then add).
>>
>> With this patch the following function will be SLP vectorized:
>>
>>
>> float a[4], b[4], c[4];  // double also OK.
>>
>> void subadd ()
>> {
>>   c[0] = a[0] - b[0];
>>   c[1] = a[1] + b[1];
>>   c[2] = a[2] - b[2];
>>   c[3] = a[3] + b[3];
>> }
>>
>> void addsub ()
>> {
>>   c[0] = a[0] + b[0];
>>   c[1] = a[1] - b[1];
>>   c[2] = a[2] + b[2];
>>   c[3] = a[3] - b[3];
>> }
>>
>>
>> Boostrapped and tested on an x86-64 machine.
>
> I managed to do this without adding new tree codes or optabs by
> vectorizing the above as
>
>c1 = a + b;
>c2 = a - b;
>c = VEC_PERM 
>
> which then matches sse3_addsubv4sf3 if you fix that pattern to
> not use vec_merge (or fix PR56766).  Doing it this way also
> means that the code is vectorizable if you don't have a HW
> instruction for that but can do the VEC_PERM efficiently.
>
> So, I'd like to avoid new tree codes and optabs whenever possible
> and here I've already proved (with a patch) that it is possible.
> Didn't have time to clean it up, and it likely doesn't apply anymore
> (and PR56766 blocks it but it even has a patch).
>
> Btw, this was PR56902 where I attached my patch.
>
> Richard.
>
>>
>> thanks,
>> Cong
>>
>>
>>
>>
>>
>> diff --git a/gcc/ChangeLog b/gcc/ChangeLog
>> index 2c0554b..656d5fb 100644
>> --- a/gcc/ChangeLog
>> +++ b/gcc/ChangeLog
>> @@ -1,3 +1,31 @@
>> +2013-11-14  Cong Hou  
>> +
>> + * tree-vect-slp.c (vect_create_new_slp_node): Initialize
>> + SLP_TREE_OP_CODE.
>> + (slp_supported_non_isomorphic_op): New function.  Check if the
>> + non-isomorphic operation is supported or not.
>> + (vect_build_slp_tree_1): Consider non-isomorphic operations.
>> + (vect_build_slp_tree): Change argument.
>> + * tree-vect-stmts.c (vectorizable_operation): Consider the opcode
>> + for non-isomorphic operations.
>> + * optabs.def (vec_addsub_optab, vec_subadd_optab): New optabs.
>> + * tree.def (VEC_ADDSUB_EXPR, VEC_SUBADD_EXPR): New operations.
>> + * expr.c (expand_expr_real_2): Add support to VEC_ADDSUB_EXPR and
>> + VEC_SUBADD_EXPR.
>> + * gimple-pretty-print.c (dump_binary_rhs): Likewise.
>> + * optabs.c (optab_for_tree_code): Likewise.
>> + * tree-cfg.c (verify_gimple_assign_binary): Likewise.
>> + * tree-vectorizer.h (struct _slp_tree): New data member

Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.

2013-11-18 Thread Cong Hou
On Fri, Nov 15, 2013 at 1:20 AM, Uros Bizjak  wrote:
> Hello!
>
>> This patch adds the support to two non-isomorphic operations addsub
>> and subadd for SLP vectorizer. More non-isomorphic operations can be
>> added later, but the limitation is that operations on even/odd
>> elements should still be isomorphic. Once such an operation is
>> detected, the code of the operation used in vectorized code is stored
>> and later will be used during statement transformation. Two new GIMPLE
>> opeartions VEC_ADDSUB_EXPR and VEC_SUBADD_EXPR are defined. And also
>> new optabs for them. They are also documented.
>>
>> The target supports for SSE/SSE2/SSE3/AVX are added for those two new
>> operations on floating points. SSE3/AVX provides ADDSUBPD and ADDSUBPS
>> instructions. For SSE/SSE2, those two operations are emulated using
>> two instructions (selectively negate then add).
>
>;; SSE3
>UNSPEC_LDDQU
> +  UNSPEC_SUBADD
> +  UNSPEC_ADDSUB
>
> No! Please avoid unspecs.


OK, got it.


>
> +(define_expand "vec_subadd_v4sf3"
> +  [(set (match_operand:V4SF 0 "register_operand")
> + (unspec:V4SF
> +  [(match_operand:V4SF 1 "register_operand")
> +   (match_operand:V4SF 2 "nonimmediate_operand")] UNSPEC_SUBADD))]
> +  "TARGET_SSE"
> +{
> +  if (TARGET_SSE3)
> +emit_insn (gen_sse3_addsubv4sf3 (operands[0], operands[1], operands[2]));
> +  else
> +ix86_sse_expand_fp_addsub_operator (true, V4SFmode, operands);
> +  DONE;
> +})
>
> Make the expander pattern look like correspondig sse3 insn and:
> ...
> {
>   if (!TARGET_SSE3)
> {
>   ix86_sse_expand_fp_...();
>   DONE;
> }
> }
>

You mean I should write two expanders for SSE and SSE3 respectively?

Thank you for your comment!



Cong



> Uros.


Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.

2013-11-18 Thread Cong Hou
On Fri, Nov 15, 2013 at 10:18 AM, Richard Earnshaw  wrote:
> On 15/11/13 02:06, Cong Hou wrote:
>> Hi
>>
>> This patch adds the support to two non-isomorphic operations addsub
>> and subadd for SLP vectorizer. More non-isomorphic operations can be
>> added later, but the limitation is that operations on even/odd
>> elements should still be isomorphic. Once such an operation is
>> detected, the code of the operation used in vectorized code is stored
>> and later will be used during statement transformation. Two new GIMPLE
>> opeartions VEC_ADDSUB_EXPR and VEC_SUBADD_EXPR are defined. And also
>> new optabs for them. They are also documented.
>>
>
> Not withstanding what Richi has already said on this subject, you
> certainly don't need both VEC_ADDSUB_EXPR and VEC_SUBADD_EXPR.  The
> latter can always be formed by vec-negating the second operand and
> passing it to VEC_ADDSUB_EXPR.
>

Right. But I also considered targets without the support to addsub
instructions. Then we could still selectively negate odd/even elements
using masks then use PLUS_EXPR (at most 2 instructions). If I
implement VEC_ADDSUB_EXPR by negating the second operand then using
VEC_ADDSUB_EXPR, I end up with one more instruction.


thanks,
Cong



> R.
>
>


Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.

2013-11-18 Thread Cong Hou
On Mon, Nov 18, 2013 at 12:27 PM, Uros Bizjak  wrote:
> On Mon, Nov 18, 2013 at 9:15 PM, Cong Hou  wrote:
>
>>>> This patch adds the support to two non-isomorphic operations addsub
>>>> and subadd for SLP vectorizer. More non-isomorphic operations can be
>>>> added later, but the limitation is that operations on even/odd
>>>> elements should still be isomorphic. Once such an operation is
>>>> detected, the code of the operation used in vectorized code is stored
>>>> and later will be used during statement transformation. Two new GIMPLE
>>>> opeartions VEC_ADDSUB_EXPR and VEC_SUBADD_EXPR are defined. And also
>>>> new optabs for them. They are also documented.
>>>>
>>>> The target supports for SSE/SSE2/SSE3/AVX are added for those two new
>>>> operations on floating points. SSE3/AVX provides ADDSUBPD and ADDSUBPS
>>>> instructions. For SSE/SSE2, those two operations are emulated using
>>>> two instructions (selectively negate then add).
>>>
>>> +(define_expand "vec_subadd_v4sf3"
>>> +  [(set (match_operand:V4SF 0 "register_operand")
>>> + (unspec:V4SF
>>> +  [(match_operand:V4SF 1 "register_operand")
>>> +   (match_operand:V4SF 2 "nonimmediate_operand")] UNSPEC_SUBADD))]
>>> +  "TARGET_SSE"
>>> +{
>>> +  if (TARGET_SSE3)
>>> +emit_insn (gen_sse3_addsubv4sf3 (operands[0], operands[1], 
>>> operands[2]));
>>> +  else
>>> +ix86_sse_expand_fp_addsub_operator (true, V4SFmode, operands);
>>> +  DONE;
>>> +})
>>>
>>> Make the expander pattern look like correspondig sse3 insn and:
>>> ...
>>> {
>>>   if (!TARGET_SSE3)
>>> {
>>>   ix86_sse_expand_fp_...();
>>>   DONE;
>>> }
>>> }
>>>
>>
>> You mean I should write two expanders for SSE and SSE3 respectively?
>
> No, please use the same approach as you did for abs2 expander.
> For !TARGET_SSE3, call the helper function (ix86_sse_expand...),
> otherwise expand through pattern. Also, it looks to me that you should
> partially expand in the pattern before calling helper function, mainly
> to avoid a bunch of "if (...)" at the beginning of the helper
> function.
>


I know what you mean. Then I have to change the pattern being detected
for sse3_addsubv4sf3, so that it can handle ADDSUB_EXPR for SSE3.

Currently I am considering using Richard's method without creating new
tree nodes and optabs, based on pattern matching. I will handle SSE2
and SSE3 separately by define_expand and define_insn. The current
problem is that the pattern may contain more than four instructions
which cannot be processed by the combine pass.

I am considering how to reduce the number of instructions in the
pattern to four.

Thank you very much!


Cong



> Uros.


Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.

2013-11-19 Thread Cong Hou
On Tue, Nov 19, 2013 at 1:45 AM, Richard Biener  wrote:
>
> On Mon, 18 Nov 2013, Cong Hou wrote:
>
> > I tried your method and it works well for doubles. But for float,
> > there is an issue. For the following gimple code:
> >
> >c1 = a - b;
> >c2 = a + b;
> >c = VEC_PERM 
> >
> > It needs two instructions to implement the VEC_PERM operation in
> > SSE2-4, one of which should be using shufps which is represented by
> > the following pattern in rtl:
> >
> >
> > (define_insn "sse_shufps_"
> >   [(set (match_operand:VI4F_128 0 "register_operand" "=x,x")
> > (vec_select:VI4F_128
> >  (vec_concat:
> >(match_operand:VI4F_128 1 "register_operand" "0,x")
> >(match_operand:VI4F_128 2 "nonimmediate_operand" "xm,xm"))
> >  (parallel [(match_operand 3 "const_0_to_3_operand")
> > (match_operand 4 "const_0_to_3_operand")
> > (match_operand 5 "const_4_to_7_operand")
> > (match_operand 6 "const_4_to_7_operand")])))]
> > ...)
> >
> > Note that it contains two rtl instructions.
>
> It's a single instruction as far as combine is concerned (RTL
> instructions have arbitrary complexity).


Even it is one instruction, we will end up with four rtl statements,
which still cannot be combined as there are restrictions on combining
four instructions (loads of constants or binary operations involving a
constant). Note that vec_select instead of vec_merge is used here
because currently vec_merge is emitted only if SSE4 is enabled (thus
blend instructions can be used. If you look at
ix86_expand_vec_perm_const_1() in i386.c, you can find that vec_merge
is generated in expand_vec_perm_1() with SSE4.). Without SSE4 support,
in most cases a vec_merge statement could not be translated by one SSE
instruction.

>
>
> > Together with minus, plus,
> > and one more shuffling instruction, we have at least five instructions
> > for addsub pattern. I think during the combine pass, only four
> > instructions are considered to be combined, right? So unless we
> > compress those five instructions into four or less, we could not use
> > this method for float values.
>
> At the moment addsubv4sf looks like
>
> (define_insn "sse3_addsubv4sf3"
>   [(set (match_operand:V4SF 0 "register_operand" "=x,x")
> (vec_merge:V4SF
>   (plus:V4SF
> (match_operand:V4SF 1 "register_operand" "0,x")
> (match_operand:V4SF 2 "nonimmediate_operand" "xm,xm"))
>   (minus:V4SF (match_dup 1) (match_dup 2))
>   (const_int 10)))]
>
> to match this it's best to have the VEC_SHUFFLE retained as
> vec_merge and thus support arbitrary(?) vec_merge for the aid
> of combining until reload(?) after which we can split it.
>


You mean VEC_PERM (this is generated in gimple from your patch)? Note
as I mentioned above, without SSE4, it is difficult to translate
VEC_PERM into vec_merge. Even if we can do it, we still need do define
split to convert one vec_merge into two or more other statements
later. ADDSUB instructions are proved by SSE3 and I think we should
not rely on SSE4 to perform this transformation, right?

To sum up, if we use vec_select instead of vec_merge, we may have four
rtl statements for float types, in which case they cannot be combined.
If we use vec_merge, we need to define the split for it without SSE4
support, and we need also to change the behavior of
ix86_expand_vec_perm_const_1().


> > What do you think?
>
> Besides of addsub are there other instructions that can be expressed
> similarly?  Thus, how far should the combiner pattern go?
>


I think your method is quite flexible. Beside blending add/sub, we
could blend other combinations of two operations, and even one
operation and a no-op. For example, consider vectorizing complex
conjugate operation:

for (int i = 0; i < N; i+=2) {
  a[i] = b[i];
  a[i+1] = -b[i+1];
}

This is loop is better to be vectorized by hybrid SLP. The second
statement has a unary minus operation but there is no operation in the
first one. We can improve out SLP grouping algorithm to let GCC SLP
vectorize it.


thanks,
Cong



> Richard.
>
> >
> >
> >
> > thanks,
> > Cong
> >
> >
> > On Fri, Nov 15, 2013 at 12:53 AM, Richard Biener  wrote:
> > > On Thu, 14 Nov 2013, Cong Hou wrote:
> > >
> > >> Hi
> > >>
> > >> This patch adds the support to two non-isomorphic operations addsub
> > >> and subadd for SLP vectorizer. More non-isomorphic operations can be
> > >> add

Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.

2013-11-20 Thread Cong Hou
Ping...


thanks,
Cong


On Fri, Nov 15, 2013 at 9:52 AM, Cong Hou  wrote:
> Any more comments?
>
>
>
> thanks,
> Cong
>
>
> On Wed, Nov 13, 2013 at 6:06 PM, Cong Hou  wrote:
>> Ping?
>>
>>
>> thanks,
>> Cong
>>
>>
>> On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou  wrote:
>>> Hi James
>>>
>>> Sorry for the late reply.
>>>
>>>
>>> On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh
>>>  wrote:
>>>>> On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou  wrote:
>>>>> > Thank you for your detailed explanation.
>>>>> >
>>>>> > Once GCC detects a reduction operation, it will automatically
>>>>> > accumulate all elements in the vector after the loop. In the loop the
>>>>> > reduction variable is always a vector whose elements are reductions of
>>>>> > corresponding values from other vectors. Therefore in your case the
>>>>> > only instruction you need to generate is:
>>>>> >
>>>>> > VABAL   ops[3], ops[1], ops[2]
>>>>> >
>>>>> > It is OK if you accumulate the elements into one in the vector inside
>>>>> > of the loop (if one instruction can do this), but you have to make
>>>>> > sure other elements in the vector should remain zero so that the final
>>>>> > result is correct.
>>>>> >
>>>>> > If you are confused about the documentation, check the one for
>>>>> > udot_prod (just above usad in md.texi), as it has very similar
>>>>> > behavior as usad. Actually I copied the text from there and did some
>>>>> > changes. As those two instruction patterns are both for vectorization,
>>>>> > their behavior should not be difficult to explain.
>>>>> >
>>>>> > If you have more questions or think that the documentation is still
>>>>> > improper please let me know.
>>>>
>>>> Hi Cong,
>>>>
>>>> Thanks for your reply.
>>>>
>>>> I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and
>>>> DOT_PROD_EXPR and I see that the same ambiguity exists for
>>>> DOT_PROD_EXPR. Can you please add a note in your tree.def
>>>> that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either:
>>>>
>>>>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
>>>>   tmp2 = ABS_EXPR (tmp)
>>>>   arg3 = PLUS_EXPR (tmp2, arg3)
>>>>
>>>> or:
>>>>
>>>>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
>>>>   tmp2 = ABS_EXPR (tmp)
>>>>   arg3 = WIDEN_SUM_EXPR (tmp2, arg3)
>>>>
>>>> Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a
>>>> a value of the same (widened) type as arg3.
>>>>
>>>
>>>
>>> I have added it, although we currently don't have WIDEN_MINUS_EXPR (I
>>> mentioned it in tree.def).
>>>
>>>
>>>> Also, while looking for the history of DOT_PROD_EXPR I spotted this
>>>> patch:
>>>>
>>>>   [autovect] [patch] detect mult-hi and sad patterns
>>>>   http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html
>>>>
>>>> I wonder what the reason was for that patch to be dropped?
>>>>
>>>
>>> It has been 8 years.. I have no idea why this patch is not accepted
>>> finally. There is even no reply in that thread. But I believe the SAD
>>> pattern is very important to be recognized. ARM also provides
>>> instructions for it.
>>>
>>>
>>> Thank you for your comment again!
>>>
>>>
>>> thanks,
>>> Cong
>>>
>>>
>>>
>>>> Thanks,
>>>> James
>>>>


Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.

2013-11-21 Thread Cong Hou
On Thu, Nov 21, 2013 at 4:39 PM, Marc Glisse  wrote:
> On Thu, 21 Nov 2013, Cong Hou wrote:
>
>> While I added the new define_insn_and_split for vec_merge, a bug is
>> exposed: in config/i386/sse.md, [ define_expand "xop_vmfrcz2" ]
>> only takes one input, but the corresponding builtin functions have two
>> inputs, which are shown in i386.c:
>>
>>  { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv4sf2,
>> "__builtin_ia32_vfrczss", IX86_BUILTIN_VFRCZSS, UNKNOWN,
>> (int)MULTI_ARG_2_SF },
>>  { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv2df2,
>> "__builtin_ia32_vfrczsd", IX86_BUILTIN_VFRCZSD, UNKNOWN,
>> (int)MULTI_ARG_2_DF },
>>
>> In consequence, the ix86_expand_multi_arg_builtin() function tries to
>> check two args but based on the define_expand of xop_vmfrcz2,
>> the content of insn_data[CODE_FOR_xop_vmfrczv4sf2].operand[2] may be
>> incorrect (because it only needs one input).
>>
>> The patch below fixed this issue.
>>
>> Bootstrapped and tested on ax x86-64 machine. Note that this patch
>> should be applied before the one I sent earlier (sorry for sending
>> them in wrong order).
>
>
> This is PR 56788. Your patch seems strange to me and I don't think it
> fixes the real issue, but I'll let more knowledgeable people answer.


Thank you for pointing out the bug report. This patch is not intended
to fix PR56788. For your function:

#include 
__m128d f(__m128d x, __m128d y){
  return _mm_frcz_sd(x,y);
}

Note that the second parameter is ignored intentionally, but the
prototype of this function contains two parameters. My fix is
explicitly telling GCC that the optab xop_vmfrczv4sf3 should have
three operands instead of two, to let it have the correct information
in insn_data[CODE_FOR_xop_vmfrczv4sf3].operand[2] which is used to
match the type of the second parameter in the builtin function in
ix86_expand_multi_arg_builtin().


thanks,
Cong


>
> --
> Marc Glisse


Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.

2013-11-22 Thread Cong Hou
On Fri, Nov 22, 2013 at 1:32 AM, Uros Bizjak  wrote:
> Hello!
>
>> In consequence, the ix86_expand_multi_arg_builtin() function tries to
>> check two args but based on the define_expand of xop_vmfrcz2,
>> the content of insn_data[CODE_FOR_xop_vmfrczv4sf2].operand[2] may be
>> incorrect (because it only needs one input).
>
>  ;; scalar insns
> -(define_expand "xop_vmfrcz2"
> +(define_expand "xop_vmfrcz3"
>[(set (match_operand:VF_128 0 "register_operand")
> (vec_merge:VF_128
>   (unspec:VF_128
>[(match_operand:VF_128 1 "nonimmediate_operand")]
>UNSPEC_FRCZ)
> - (match_dup 3)
> + (match_operand:VF_128 2 "register_operand")
>   (const_int 1)))]
>"TARGET_XOP"
>  {
> -  operands[3] = CONST0_RTX (mode);
> +  operands[2] = CONST0_RTX (mode);
>  })
>
> No, just use (match_dup 2) in the RTX in addition to operands[2]
> change. Do not rename patterns.


If I use match_dup 2, GCC still thinks this optab has one input
argument instead of two, which won't fix the current issue.

Marc suggested we should remove the second argument. This also works.

Thank you!


Cong


>
> Uros.


Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.

2013-11-22 Thread Cong Hou
On Fri, Nov 22, 2013 at 3:57 AM, Marc Glisse  wrote:
> On Thu, 21 Nov 2013, Cong Hou wrote:
>
>> On Thu, Nov 21, 2013 at 4:39 PM, Marc Glisse  wrote:
>>>
>>> On Thu, 21 Nov 2013, Cong Hou wrote:
>>>
>>>> While I added the new define_insn_and_split for vec_merge, a bug is
>>>> exposed: in config/i386/sse.md, [ define_expand "xop_vmfrcz2" ]
>>>> only takes one input, but the corresponding builtin functions have two
>>>> inputs, which are shown in i386.c:
>>>>
>>>>  { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv4sf2,
>>>> "__builtin_ia32_vfrczss", IX86_BUILTIN_VFRCZSS, UNKNOWN,
>>>> (int)MULTI_ARG_2_SF },
>>>>  { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv2df2,
>>>> "__builtin_ia32_vfrczsd", IX86_BUILTIN_VFRCZSD, UNKNOWN,
>>>> (int)MULTI_ARG_2_DF },
>>>>
>>>> In consequence, the ix86_expand_multi_arg_builtin() function tries to
>>>> check two args but based on the define_expand of xop_vmfrcz2,
>>>> the content of insn_data[CODE_FOR_xop_vmfrczv4sf2].operand[2] may be
>>>> incorrect (because it only needs one input).
>>>>
>>>> The patch below fixed this issue.
>>>>
>>>> Bootstrapped and tested on ax x86-64 machine. Note that this patch
>>>> should be applied before the one I sent earlier (sorry for sending
>>>> them in wrong order).
>>>
>>>
>>>
>>> This is PR 56788. Your patch seems strange to me and I don't think it
>>> fixes the real issue, but I'll let more knowledgeable people answer.
>>
>>
>>
>> Thank you for pointing out the bug report. This patch is not intended
>> to fix PR56788.
>
>
> IMHO, if PR56788 was fixed, you wouldn't have this issue, and if PR56788
> doesn't get fixed, I'll post a patch to remove _mm_frcz_sd and the
> associated builtin, which would solve your issue as well.


I agree. Then I will wait until your patch is merged to the trunk,
otherwise my patch could not pass the test.


>
>
>> For your function:
>>
>> #include 
>> __m128d f(__m128d x, __m128d y){
>>  return _mm_frcz_sd(x,y);
>> }
>>
>> Note that the second parameter is ignored intentionally, but the
>> prototype of this function contains two parameters. My fix is
>> explicitly telling GCC that the optab xop_vmfrczv4sf3 should have
>> three operands instead of two, to let it have the correct information
>> in insn_data[CODE_FOR_xop_vmfrczv4sf3].operand[2] which is used to
>> match the type of the second parameter in the builtin function in
>> ix86_expand_multi_arg_builtin().
>
>
> I disagree that this is intentional, it is a bug. AFAIK there is no AMD
> documentation that could be used as a reference for what _mm_frcz_sd is
> supposed to do. The only existing documentations are by Microsoft (which
> does *not* ignore the second argument) and by LLVM (which has a single
> argument). Whatever we chose for _mm_frcz_sd, the builtin should take a
> single argument, and if necessary we'll use 2 builtins to implement
> _mm_frcz_sd.
>


I also only found the one by Microsoft.. If the second argument is
ignored, we could just remove it, as long as there is no "standard"
that requires two arguments. Hopefully it won't break current projects
using _mm_frcz_sd.

Thank you for your comments!


Cong


> --
> Marc Glisse


[PATCH] Fixing PR59006 and PR58921 by delaying loop invariant hoisting in vectorizer.

2013-11-22 Thread Cong Hou
Hi

Currently in GCC vectorization, some loop invariant may be detected
after aliasing checks, which can be hoisted outside of the loop. The
current method in GCC may break the information built during the
analysis phase, causing some crash (see PR59006 and PR58921).

This patch improves the loop invariant hoisting by delaying it until
all statements are vectorized, thereby keeping all built information.
But those loop invariant statements won't be vectorized, and if a
variable is defined by one of those loop invariant, it is treated as
an external definition.

Bootstrapped and testes on an x86-64 machine.


thanks,
Cong



diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 2c0554b..0614bab 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,18 @@
+2013-11-22  Cong Hou  
+
+ PR tree-optimization/58921
+ PR tree-optimization/59006
+ * tree-vectorizer.h (struct _stmt_vec_info): New data member
+ loop_invariant.
+ * tree-vect-loop-manip.c (vect_loop_versioning): Delay hoisting loop
+ invariants until all statements are vectorized.
+ * tree-vect-loop.c (vect_hoist_loop_invariants): New functions.
+ (vect_transform_loop): Hoist loop invariants after all statements
+ are vectorized.  Do not vectorize loop invariants stmts.
+ * tree-vect-stmts.c (vect_get_vec_def_for_operand): Treat a loop
+ invariant as an external definition.
+ (new_stmt_vec_info): Initialize new data member.
+
 2013-11-12  Jeff Law  

  * tree-ssa-threadedge.c (thread_around_empty_blocks): New
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 09c7f20..447625b 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,10 @@
+2013-11-22  Cong Hou  
+
+ PR tree-optimization/58921
+ PR tree-optimization/59006
+ * gcc.dg/vect/pr58921.c: New test.
+ * gcc.dg/vect/pr59006.c: New test.
+
 2013-11-12  Balaji V. Iyer  

  * gcc.dg/cilk-plus/cilk-plus.exp: Added a check for LTO before running
diff --git a/gcc/testsuite/gcc.dg/vect/pr58921.c
b/gcc/testsuite/gcc.dg/vect/pr58921.c
new file mode 100644
index 000..ee3694a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr58921.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+
+int a[7];
+int b;
+
+void
+fn1 ()
+{
+  for (; b; b++)
+a[b] = ((a[b] <= 0) == (a[0] != 0));
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/pr59006.c
b/gcc/testsuite/gcc.dg/vect/pr59006.c
new file mode 100644
index 000..95d90a9
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr59006.c
@@ -0,0 +1,24 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+
+int a[8], b;
+
+void fn1 (void)
+{
+  int c;
+  for (; b; b++)
+{
+  int d = a[b];
+  c = a[0] ? d : 0;
+  a[b] = c;
+}
+}
+
+void fn2 ()
+{
+  for (; b <= 0; b++)
+a[b] = a[0] || b;
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 2 "vect" } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 15227856..3adc73d 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -2448,8 +2448,12 @@ vect_loop_versioning (loop_vec_info loop_vinfo,
   FOR_EACH_SSA_TREE_OPERAND (var, stmt, iter, SSA_OP_USE)
 {
   gimple def = SSA_NAME_DEF_STMT (var);
+  stmt_vec_info def_stmt_info;
+
   if (!gimple_nop_p (def)
-  && flow_bb_inside_loop_p (loop, gimple_bb (def)))
+  && flow_bb_inside_loop_p (loop, gimple_bb (def))
+  && !((def_stmt_info = vinfo_for_stmt (def))
+ && STMT_VINFO_LOOP_INVARIANT_P (def_stmt_info)))
  {
   hoist = false;
   break;
@@ -2458,21 +2462,8 @@ vect_loop_versioning (loop_vec_info loop_vinfo,

   if (hoist)
 {
-  if (dr)
- gimple_set_vuse (stmt, NULL);
-
-  gsi_remove (&si, false);
-  gsi_insert_on_edge_immediate (loop_preheader_edge (loop),
-stmt);
-
-  if (dump_enabled_p ())
- {
-  dump_printf_loc
-  (MSG_NOTE, vect_location,
-   "hoisting out of the vectorized loop: ");
-  dump_gimple_stmt (MSG_NOTE, TDF_SLIM, stmt, 0);
-  dump_printf (MSG_NOTE, "\n");
- }
+  STMT_VINFO_LOOP_INVARIANT_P (stmt_info) = true;
+  gsi_next (&si);
   continue;
 }
  }
@@ -2481,6 +2472,7 @@ vect_loop_versioning (loop_vec_info loop_vinfo,
  }
 }

+
   /* End loop-exit-fixes after versioning.  */

   if (cond_expr_stmt_list)
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 292e771..148f9f1 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -5572,6 +5572,49 @@ vect_loop_kill_debug_uses (struct loop *loop,
gimple stmt)
 }
 }

+/* Find all loop invariants detected after alias checks, and hoist them
+   before the loop preheader.  */
+
+static void
+vect_hoist_loop_invariants (loop_vec_info loop_vinfo)
+{
+  struct

Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.

2014-06-23 Thread Cong Hou
It has been 8 months since this patch is posted. I have addressed all
comments to this patch.

The SAD pattern is very useful for some multimedia algorithms like
ffmpeg. This patch will greatly improve the performance of such
algorithms. Could you please have a look again and check if it is OK
for the trunk? If it is necessary I can re-post this patch in a new
thread.

Thank you!


Cong


On Tue, Dec 17, 2013 at 10:04 AM, Cong Hou  wrote:
>
> Ping?
>
>
> thanks,
> Cong
>
>
> On Mon, Dec 2, 2013 at 5:06 PM, Cong Hou  wrote:
> > Hi Richard
> >
> > Could you please take a look at this patch and see if it is ready for
> > the trunk? The patch is pasted as a text file here again.
> >
> > Thank you very much!
> >
> >
> > Cong
> >
> >
> > On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou  wrote:
> >> Hi James
> >>
> >> Sorry for the late reply.
> >>
> >>
> >> On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh
> >>  wrote:
> >>>> On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou  wrote:
> >>>> > Thank you for your detailed explanation.
> >>>> >
> >>>> > Once GCC detects a reduction operation, it will automatically
> >>>> > accumulate all elements in the vector after the loop. In the loop the
> >>>> > reduction variable is always a vector whose elements are reductions of
> >>>> > corresponding values from other vectors. Therefore in your case the
> >>>> > only instruction you need to generate is:
> >>>> >
> >>>> > VABAL   ops[3], ops[1], ops[2]
> >>>> >
> >>>> > It is OK if you accumulate the elements into one in the vector inside
> >>>> > of the loop (if one instruction can do this), but you have to make
> >>>> > sure other elements in the vector should remain zero so that the final
> >>>> > result is correct.
> >>>> >
> >>>> > If you are confused about the documentation, check the one for
> >>>> > udot_prod (just above usad in md.texi), as it has very similar
> >>>> > behavior as usad. Actually I copied the text from there and did some
> >>>> > changes. As those two instruction patterns are both for vectorization,
> >>>> > their behavior should not be difficult to explain.
> >>>> >
> >>>> > If you have more questions or think that the documentation is still
> >>>> > improper please let me know.
> >>>
> >>> Hi Cong,
> >>>
> >>> Thanks for your reply.
> >>>
> >>> I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and
> >>> DOT_PROD_EXPR and I see that the same ambiguity exists for
> >>> DOT_PROD_EXPR. Can you please add a note in your tree.def
> >>> that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either:
> >>>
> >>>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
> >>>   tmp2 = ABS_EXPR (tmp)
> >>>   arg3 = PLUS_EXPR (tmp2, arg3)
> >>>
> >>> or:
> >>>
> >>>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
> >>>   tmp2 = ABS_EXPR (tmp)
> >>>   arg3 = WIDEN_SUM_EXPR (tmp2, arg3)
> >>>
> >>> Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a
> >>> a value of the same (widened) type as arg3.
> >>>
> >>
> >>
> >> I have added it, although we currently don't have WIDEN_MINUS_EXPR (I
> >> mentioned it in tree.def).
> >>
> >>
> >>> Also, while looking for the history of DOT_PROD_EXPR I spotted this
> >>> patch:
> >>>
> >>>   [autovect] [patch] detect mult-hi and sad patterns
> >>>   http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html
> >>>
> >>> I wonder what the reason was for that patch to be dropped?
> >>>
> >>
> >> It has been 8 years.. I have no idea why this patch is not accepted
> >> finally. There is even no reply in that thread. But I believe the SAD
> >> pattern is very important to be recognized. ARM also provides
> >> instructions for it.
> >>
> >>
> >> Thank you for your comment again!
> >>
> >>
> >> thanks,
> >> Cong
> >>
> >>
> >>
> >>> Thanks,
> >>> James
> >>>


Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.

2014-06-24 Thread Cong Hou
OK. Thank you very much for your review, Richard!

thanks,
Cong


On Tue, Jun 24, 2014 at 4:19 AM, Richard Biener
 wrote:
> On Tue, Dec 3, 2013 at 2:06 AM, Cong Hou  wrote:
>> Hi Richard
>>
>> Could you please take a look at this patch and see if it is ready for
>> the trunk? The patch is pasted as a text file here again.
>
> (found it)
>
> The patch is ok for trunk.  (please consider re-testing before you commit)
>
> Thanks,
> Richard.
>
>> Thank you very much!
>>
>>
>> Cong
>>
>>
>> On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou  wrote:
>>> Hi James
>>>
>>> Sorry for the late reply.
>>>
>>>
>>> On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh
>>>  wrote:
>>>>> On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou  wrote:
>>>>> > Thank you for your detailed explanation.
>>>>> >
>>>>> > Once GCC detects a reduction operation, it will automatically
>>>>> > accumulate all elements in the vector after the loop. In the loop the
>>>>> > reduction variable is always a vector whose elements are reductions of
>>>>> > corresponding values from other vectors. Therefore in your case the
>>>>> > only instruction you need to generate is:
>>>>> >
>>>>> > VABAL   ops[3], ops[1], ops[2]
>>>>> >
>>>>> > It is OK if you accumulate the elements into one in the vector inside
>>>>> > of the loop (if one instruction can do this), but you have to make
>>>>> > sure other elements in the vector should remain zero so that the final
>>>>> > result is correct.
>>>>> >
>>>>> > If you are confused about the documentation, check the one for
>>>>> > udot_prod (just above usad in md.texi), as it has very similar
>>>>> > behavior as usad. Actually I copied the text from there and did some
>>>>> > changes. As those two instruction patterns are both for vectorization,
>>>>> > their behavior should not be difficult to explain.
>>>>> >
>>>>> > If you have more questions or think that the documentation is still
>>>>> > improper please let me know.
>>>>
>>>> Hi Cong,
>>>>
>>>> Thanks for your reply.
>>>>
>>>> I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and
>>>> DOT_PROD_EXPR and I see that the same ambiguity exists for
>>>> DOT_PROD_EXPR. Can you please add a note in your tree.def
>>>> that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either:
>>>>
>>>>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
>>>>   tmp2 = ABS_EXPR (tmp)
>>>>   arg3 = PLUS_EXPR (tmp2, arg3)
>>>>
>>>> or:
>>>>
>>>>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
>>>>   tmp2 = ABS_EXPR (tmp)
>>>>   arg3 = WIDEN_SUM_EXPR (tmp2, arg3)
>>>>
>>>> Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a
>>>> a value of the same (widened) type as arg3.
>>>>
>>>
>>>
>>> I have added it, although we currently don't have WIDEN_MINUS_EXPR (I
>>> mentioned it in tree.def).
>>>
>>>
>>>> Also, while looking for the history of DOT_PROD_EXPR I spotted this
>>>> patch:
>>>>
>>>>   [autovect] [patch] detect mult-hi and sad patterns
>>>>   http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html
>>>>
>>>> I wonder what the reason was for that patch to be dropped?
>>>>
>>>
>>> It has been 8 years.. I have no idea why this patch is not accepted
>>> finally. There is even no reply in that thread. But I believe the SAD
>>> pattern is very important to be recognized. ARM also provides
>>> instructions for it.
>>>
>>>
>>> Thank you for your comment again!
>>>
>>>
>>> thanks,
>>> Cong
>>>
>>>
>>>
>>>> Thanks,
>>>> James
>>>>


  1   2   >