[PATCH] Fixing improper conversion from sin() to sinf() in optimization mode.
When a sin() (cos(), log(), etc.) function is called on a value of float type and the returned double value is converted to another value of float type, GCC converts this function call into a float version (sinf()) in the optimization mode. This avoids two type conversions and the float version function call usually takes less time. However, this can result in different result and therefore is unsafe. For example, the following code produces different results using -O0 (correct), but the same results using -Ox other than -O0 (incorrect). #include #include int main() { float v = 1; printf("%.20f\n", (float)sin(v)); printf("%.20f\n", sinf(v)); } In this patch, we do this conversion only when the flag -funsafe-math-optimizations is set. The patch is shown below. thanks, Cong Index: gcc/testsuite/gcc.c-torture/execute/20030125-1.c === --- gcc/testsuite/gcc.c-torture/execute/20030125-1.c (revision 201891) +++ gcc/testsuite/gcc.c-torture/execute/20030125-1.c (working copy) @@ -44,11 +44,11 @@ __attribute__ ((noinline)) double sin(double a) { - abort (); + return a; } __attribute__ ((noinline)) float sinf(float a) { - return a; + abort (); } Index: gcc/convert.c === --- gcc/convert.c (revision 201891) +++ gcc/convert.c (working copy) @@ -99,7 +99,7 @@ convert_to_real (tree type, tree expr) /* Disable until we figure out how to decide whether the functions are present in runtime. */ /* Convert (float)sqrt((double)x) where x is float into sqrtf(x) */ - if (optimize + if (optimize && flag_unsafe_math_optimizations && (TYPE_MODE (type) == TYPE_MODE (double_type_node) || TYPE_MODE (type) == TYPE_MODE (float_type_node))) { Index: gcc/testsuite/gcc.c-torture/execute/20030125-1.c === --- gcc/testsuite/gcc.c-torture/execute/20030125-1.c(revision 201891) +++ gcc/testsuite/gcc.c-torture/execute/20030125-1.c(working copy) @@ -44,11 +44,11 @@ __attribute__ ((noinline)) double sin(double a) { - abort (); + return a; } __attribute__ ((noinline)) float sinf(float a) { - return a; + abort (); } Index: gcc/convert.c === --- gcc/convert.c (revision 201891) +++ gcc/convert.c (working copy) @@ -99,7 +99,7 @@ convert_to_real (tree type, tree expr) /* Disable until we figure out how to decide whether the functions are present in runtime. */ /* Convert (float)sqrt((double)x) where x is float into sqrtf(x) */ - if (optimize + if (optimize && flag_unsafe_math_optimizations && (TYPE_MODE (type) == TYPE_MODE (double_type_node) || TYPE_MODE (type) == TYPE_MODE (float_type_node))) {
Re: [PATCH] Fixing improper conversion from sin() to sinf() in optimization mode.
I have fixed my test code and replaced those aliasing violations with unions. Now the test result shows logb() is safe for the conversions. The conclusion is, logb() and fabs() are always safe for the converion, and sqrt() is unsafe for the conversion from sqrtl(double) to sqrt(double). Other math functions are not safe for the conversion. The new test code I used is shown below: #include #include #include #include typedef union { int i; float f; } T32; typedef union { long long int i; double f; } T64; #define N 1000 #define test_math_func(func) \ for (i = 0; i < N; ++i) \ { \ int d = rand(), e = rand(); \ if (d == 0) continue; \ T32 v, r1, r2; \ v.f = (float)e / d; \ r1.f = func(v.f), r2.f = func##f(v.f); \ if (r1.f != r2.f) \ { \ printf("%s double -> float (%X) %X %X\n", #func, v.i, r1.i, r2.i); \ break; \ } \ } \ for (i = 0; i < N; ++i) \ { \ int d = rand(), e = rand(); \ if (d == 0) continue; \ T32 v, r1, r2; \ v.f = (float)e / d; \ r1.f = func##l(v.f), r2.f = func##f(v.f); \ if (r1.f != r2.f) \ { \ printf("%s long double -> float (%X) %X %X\n", #func, v.i, r1.i, r2.i); \ break; \ } \ } \ for (i = 0; i < N; ++i) \ { \ int d = rand(), e = rand(); \ if (d == 0) continue; \ T64 v, r1, r2; \ v.f = (double)e / d; \ r1.f = func##l(v.f), r2.f = func(v.f); \ if (r1.f != r2.f) \ { \ printf("%s long double -> double (%016llX) %016llX %016llX\n", #func, v.i, r1.i, r2.i); \ break; \ } \ } int main() { int i; test_math_func(sin); test_math_func(cos); test_math_func(sinh); test_math_func(cosh); test_math_func(asin); test_math_func(acos); test_math_func(asinh); test_math_func(acosh); test_math_func(tan); test_math_func(tanh); test_math_func(atan); test_math_func(atanh); test_math_func(log); test_math_func(log10); test_math_func(log1p); test_math_func(log2); test_math_func(logb); test_math_func(cbrt); test_math_func(erf); test_math_func(erfc); test_math_func(exp); test_math_func(exp2); test_math_func(expm1); test_math_func(sqrt); test_math_func(fabs); } I have modified the patch according to this new conclusion. The patch is pasted as below. thanks, Cong === --- gcc/convert.c (revision 201891) +++ gcc/convert.c (working copy) @@ -135,16 +135,24 @@ convert_to_real (tree type, tree expr) CASE_MATHFN (COS) CASE_MATHFN (ERF) CASE_MATHFN (ERFC) - CASE_MATHFN (FABS) CASE_MATHFN (LOG) CASE_MATHFN (LOG10) CASE_MATHFN (LOG2) CASE_MATHFN (LOG1P) - CASE_MATHFN (LOGB) CASE_MATHFN (SIN) - CASE_MATHFN (SQRT) CASE_MATHFN (TAN) CASE_MATHFN (TANH) +/* The above functions are not safe to do this conversion. */ +if (!flag_unsafe_math_optimizations) + break; + CASE_MATHFN (SQRT) +/* sqrtl(double) cannot be safely converted to sqrt(double). */ +if (fcode == BUILT_IN_SQRTL && +(TYPE_MODE (type) == TYPE_MODE (double_type_node)) && +!flag_unsafe_math_optimizations) + break; + CASE_MATHFN (FABS) + CASE_MATHFN (LOGB) #undef CASE_MATHFN { tree arg0 = strip_float_extensions (CALL_EXPR_ARG (expr, 0)); Index: gcc/testsuite/gcc.c-torture/execute/20030125-1.c === --- gcc/testsuite/gcc.c-torture/execute/20030125-1.c (revision 201891) +++ gcc/testsuite/gcc.c-torture/execute/20030125-1.c (working copy) @@ -44,11 +44,11 @@ __attribute__ ((noinline)) double sin(double a) { - abort (); + return a; } __attribute__ ((noinline)) float sinf(float a) { - return a; + abort (); } On Sat, Aug 31, 2013 at 9:24 AM, Joseph S. Myers wrote: > On Sat, 31 Aug 2013, Cong Hou wrote: > >> > I don't see why it would be unsafe for logb - can you give an example >> > (exact float input value as hex float, and the values you believe logb >> > should return for float and double). >> > >> >> Please try the following code (you will get different results whether to >> use optimization mode): >> >> #include >> #include >> >> int main() >> { >> int i = 0x3edc67d5; >> float f = *((float*)&i); >> float r1 = logb(f); >> float r2 = logbf(f); >> printf("%x %x\n", *((int*)&r1), *((int*)&r2)); >> } > > (a) Please stop sending HTML email, so your messages reach the mailing > list, and resend your messages so far to the list. The mailing list needs > to see the whole of both sides of the discussion of any patch being > proposed for GCC. > > (b) I referred to the values *you believe logb should return*. > Optimization is not meant to preserve library bugs; the comparison should > be on
Re: [PATCH] Fixing improper conversion from sin() to sinf() in optimization mode.
Could you please tell me how to check the precision of long double in GCC on different platforms? Thank you! Cong On Tue, Sep 3, 2013 at 2:43 PM, Joseph S. Myers wrote: > On Tue, 3 Sep 2013, Xinliang David Li wrote: > >> >From Joseph: >> >> "The >> conversion is not safe for sqrt if the two types are double and long >> double and long double is x86 extended, for example." >> >> This is not reflected in the patch. > > No, the problem is that it tries to reflect it but hardcodes the specific > example I gave, rather than following the logic I explained regarding the > precisions of the types involved, which depend on the target. And since I > only gave a simplified analysis, for two types when this function deals > with cases involving three types, the patch submission needs to include > its own analysis for the full generality of three types to justify the > logic used (as inequalities involving the three precisions). (I suspect > it reduces to the case of two types so you don't need to go into the > details of reasoning about floating point to produce the more general > analysis. But in any case, it's for the patch submitter to give the full > explanation.) > > -- > Joseph S. Myers > jos...@codesourcery.com
Re: [PATCH] Fixing improper conversion from sin() to sinf() in optimization mode.
I have made a new patch according to your comments. I found several references saying that the precision 2p+2 is OK for the sqrt conversion (one here: http://www.cs.berkeley.edu/~fateman/generic/algorithms.pdf). The new patch is pasted as below. Thank you for all the suggestions, Joseph! Cong Index: gcc/testsuite/gcc.c-torture/execute/20030125-1.c === --- gcc/testsuite/gcc.c-torture/execute/20030125-1.c (revision 201891) +++ gcc/testsuite/gcc.c-torture/execute/20030125-1.c (working copy) @@ -44,11 +44,11 @@ __attribute__ ((noinline)) double sin(double a) { - abort (); + return a; } __attribute__ ((noinline)) float sinf(float a) { - return a; + abort (); } Index: gcc/convert.c === --- gcc/convert.c (revision 201891) +++ gcc/convert.c (working copy) @@ -135,16 +135,34 @@ convert_to_real (tree type, tree expr) CASE_MATHFN (COS) CASE_MATHFN (ERF) CASE_MATHFN (ERFC) - CASE_MATHFN (FABS) CASE_MATHFN (LOG) CASE_MATHFN (LOG10) CASE_MATHFN (LOG2) CASE_MATHFN (LOG1P) - CASE_MATHFN (LOGB) CASE_MATHFN (SIN) - CASE_MATHFN (SQRT) CASE_MATHFN (TAN) CASE_MATHFN (TANH) + CASE_MATHFN (SQRT) + +/* The above functions (except sqrt) are not safe to do this conversion. */ +if (!flag_unsafe_math_optimizations) +{ + /* sqrtl?(T1) could be safely converted into sqrtf?(T2) only if + * p1 >= p2*2+2, where p1 and p2 are precisions of T1 and T2. */ + if ((fcode == BUILT_IN_SQRT || fcode == BUILT_IN_SQRTL)) + { +int p1 = REAL_MODE_FORMAT (TYPE_MODE (type))->p; +int p2 = (fcode == BUILT_IN_SQRTL) ? +REAL_MODE_FORMAT (TYPE_MODE (long_double_type_node))->p : +REAL_MODE_FORMAT (TYPE_MODE (double_type_node))->p; +if (p2 < p1 * 2 + 2) + break; + } + else +break; +} + CASE_MATHFN (FABS) + CASE_MATHFN (LOGB) #undef CASE_MATHFN { tree arg0 = strip_float_extensions (CALL_EXPR_ARG (expr, 0)); On Tue, Sep 3, 2013 at 3:38 PM, Joseph S. Myers wrote: > On Tue, 3 Sep 2013, Cong Hou wrote: > >> Could you please tell me how to check the precision of long double in >> GCC on different platforms? > > REAL_MODE_FORMAT (TYPE_MODE (long_double_type_node))->p > > (but you should be referring to the relevant types - "type", the type > being converted to, "itype", the type of the function being called in the > source code, "TREE_TYPE (arg0)", the type of the argument after extensions > have been removed, and "newtype", computed from those - so you should have > expressions like the above with two or more of those four types, but not > with long_double_type_node directly). > > The patch submission will need to include a proper analysis to justify to > the reader why the particular inequality with particular types from those > four is correct in all cases where the relevant code may be executed. > > -- > Joseph S. Myers > jos...@codesourcery.com
Re: [PATCH] Fixing improper conversion from sin() to sinf() in optimization mode.
Updated patch according to your comment (tabs are not pasted here). Cong Index: gcc/convert.c === --- gcc/convert.c (revision 201891) +++ gcc/convert.c (working copy) @@ -135,16 +135,40 @@ convert_to_real (tree type, tree expr) CASE_MATHFN (COS) CASE_MATHFN (ERF) CASE_MATHFN (ERFC) - CASE_MATHFN (FABS) CASE_MATHFN (LOG) CASE_MATHFN (LOG10) CASE_MATHFN (LOG2) CASE_MATHFN (LOG1P) - CASE_MATHFN (LOGB) CASE_MATHFN (SIN) - CASE_MATHFN (SQRT) CASE_MATHFN (TAN) CASE_MATHFN (TANH) + CASE_MATHFN (SQRT) + +/* The above functions (except sqrt) are not safe to do this conversion. */ +if (!flag_unsafe_math_optimizations) + { + /* sqrtl?(T1) could be safely converted into sqrtf?(T2) only if + p1 >= p2*2+2, where p1 and p2 are precisions of T1 and T2. + For example, on x86 the conversion from float(sqrt((double)f) to + sqrtf(f) is safe where f has the type float, since float has 23 bits + precision and double has 52 bits precision, and 52 > 23*2+2. + However, the conversion from double(sqrtl((long double)d) to + sqrt(d) is unsafe where d has the type double. This is because + long double has 63 bits precision and then 63 < 52*2+2. */ + if ((fcode == BUILT_IN_SQRT || fcode == BUILT_IN_SQRTL)) + { +int p1 = REAL_MODE_FORMAT (TYPE_MODE (type))->p; +int p2 = (fcode == BUILT_IN_SQRTL) ? + REAL_MODE_FORMAT (TYPE_MODE (long_double_type_node))->p : + REAL_MODE_FORMAT (TYPE_MODE (double_type_node))->p; +if (p2 < p1 * 2 + 2) + break; + } + else + break; + } + CASE_MATHFN (FABS) + CASE_MATHFN (LOGB) #undef CASE_MATHFN { tree arg0 = strip_float_extensions (CALL_EXPR_ARG (expr, 0)); Index: gcc/testsuite/gcc.c-torture/execute/20030125-1.c === --- gcc/testsuite/gcc.c-torture/execute/20030125-1.c (revision 201891) +++ gcc/testsuite/gcc.c-torture/execute/20030125-1.c (working copy) @@ -44,11 +44,11 @@ __attribute__ ((noinline)) double sin(double a) { - abort (); + return a; } __attribute__ ((noinline)) float sinf(float a) { - return a; + abort (); } On Wed, Sep 4, 2013 at 2:21 PM, Xinliang David Li wrote: > On Wed, Sep 4, 2013 at 1:59 PM, Joseph S. Myers > wrote: >> On Wed, 4 Sep 2013, Cong Hou wrote: >> >>> I have made a new patch according to your comments. I found several >>> references saying that the precision 2p+2 is OK for the sqrt >>> conversion (one here: >>> http://www.cs.berkeley.edu/~fateman/generic/algorithms.pdf). The new >>> patch is pasted as below. >> >> This patch submission still fails to pay attention to various of my >> comments. >> > > If you can provide inlined comments in the patch, that will be more > useful and productive. > > thanks, > > David > > >> -- >> Joseph S. Myers >> jos...@codesourcery.com
Re: [PATCH] Fixing improper conversion from sin() to sinf() in optimization mode.
First, thank you for your detailed comments again! Then I deeply apologize for not explaining my patch properly and responding to your previous comment. I didn't understand thoroughly the problem before submitting the patch. Previously I only considered the following three conversions for sqrt(): 1: (float) sqrt ((double) float_val) -> sqrtf (float_val) 2: (float) sqrtl ((long double) float_val) -> sqrtf (float_val) 3: (double) sqrtl ((long double) double_val) -> sqrt (double_val) We have four types here: TYPE is the type to which the result of the function call is converted. ITYPE is the type of the math call expression. TREE_TYPE(arg0) is the type of the function argument (before type conversion). NEWTYPE is chosen from TYPE and TREE_TYPE(arg0) with higher precision. It will be the type of the new math call expression after conversion. For all three cases above, TYPE is always the same as NEWTYPE. That is why I only considered TYPE during the precision comparison. ITYPE can only be double_type_node or long_double_type_node depending on the type of the math function. That is why I explicitly used those two types instead of ITYPE (no correctness issue). But you are right, ITYPE is more elegant and better here. After further analysis, I found I missed two more cases. Note that we have the following conditions according to the code in convert.c: TYPE_PRECISION(NEWTYPE) >= TYPE_PRECISION(TYPE) TYPE_PRECISION(NEWTYPE) >= TYPE_PRECISION(TREE_TYPE(arg0)) TYPE_PRECISION (NEWTYPE) < TYPE_PRECISION (ITYPE) the last condition comes from the fact that we only consider converting a math function call into another one with narrower type. Therefore we have TYPE_PRECISION(TYPE) < TYPE_PRECISION (ITYPE) TYPE_PRECISION(TREE_TYPE(arg0)) < TYPE_PRECISION (ITYPE) So for sqrt(), TYPE and TREE_TYPE(arg0) can only be float, and for sqrtl(), TYPE and TREE_TYPE(arg0) can be either float or double with four possible combinations. Therefore we have two more conversions to consider besides the three ones I mentioned above: 4: (float) sqrtl ((long double) double_val) -> (float) sqrt (double_val) 5: (double) sqrtl ((long double) float_val) -> sqrt ((double) float_val) For the first conversion here, TYPE (float) is different from NEWTYPE (double), and my previous patch doesn't handle this case.The correct way is to compare precisions of ITYPE and NEWTYPE now. To sum up, we are converting the expression (TYPE) sqrtITYPE ((ITYPE) expr) to (TYPE) sqrtNEWTYPE ((NEWTYPE) expr) and we require PRECISION (ITYPE) >= PRECISION (NEWTYPE) * 2 + 2 to make it a safe conversion. The new patch is pasted below. I appreciate your detailed comments and analysis, and next time when I submit a patch I will be more carefully about the reviewer's comment. Thank you! Cong Index: gcc/convert.c === --- gcc/convert.c (revision 201891) +++ gcc/convert.c (working copy) @@ -135,16 +135,19 @@ convert_to_real (tree type, tree expr) CASE_MATHFN (COS) CASE_MATHFN (ERF) CASE_MATHFN (ERFC) - CASE_MATHFN (FABS) CASE_MATHFN (LOG) CASE_MATHFN (LOG10) CASE_MATHFN (LOG2) CASE_MATHFN (LOG1P) - CASE_MATHFN (LOGB) CASE_MATHFN (SIN) - CASE_MATHFN (SQRT) CASE_MATHFN (TAN) CASE_MATHFN (TANH) +/* The above functions are not safe to do this conversion. */ +if (!flag_unsafe_math_optimizations) + break; + CASE_MATHFN (SQRT) + CASE_MATHFN (FABS) + CASE_MATHFN (LOGB) #undef CASE_MATHFN { tree arg0 = strip_float_extensions (CALL_EXPR_ARG (expr, 0)); @@ -155,6 +158,27 @@ convert_to_real (tree type, tree expr) if (TYPE_PRECISION (TREE_TYPE (arg0)) > TYPE_PRECISION (type)) newtype = TREE_TYPE (arg0); + /* We consider to convert + + (T1) sqrtT2 ((T2) exprT3) + to + (T1) sqrtT4 ((T4) exprT3) + + , where T1 is TYPE, T2 is ITYPE, T3 is TREE_TYPE (ARG0), + and T4 is NEWTYPE. All those types are of floating point types. + T4 (NEWTYPE) should be narrower than T2 (ITYPE). This conversion + is safe only if P1 >= P2*2+2, where P1 and P2 are precisions of + T2 and T4. See the following URL for a reference: + http://stackoverflow.com/questions/9235456/determining-floating-point-square-root + */ + if (fcode == BUILT_IN_SQRT || fcode == BUILT_IN_SQRTL) + { + int p1 = REAL_MODE_FORMAT (TYPE_MODE (itype))->p; + int p2 = REAL_MODE_FORMAT (TYPE_MODE (newtype))->p; + if (p1 < p2 * 2 + 2 && !flag_unsafe_math_optimizations) +break; + } + /* Be careful about integer to fp conversions. These may overflow still. */ if (FLOAT_TYPE_P (TREE_TYPE (arg0)) Index: gcc/testsuite/gcc.c-torture/execute/20030125-1.c === --- gcc/testsuite/gcc.c-torture/execute/20030125-1.c (revision 201891) +++ gcc/testsuite/gcc.c-torture/execute/20030125-1.c (working copy) @@ -44,11 +44,11 @@ __attribute__ ((noinline)) double sin(double a) { - abort (); + retu
Re: [PATCH] Fixing improper conversion from sin() to sinf() in optimization mode.
On Mon, Sep 9, 2013 at 6:26 PM, Xinliang David Li wrote: > On Fri, Sep 6, 2013 at 3:24 PM, Cong Hou wrote: >> First, thank you for your detailed comments again! Then I deeply >> apologize for not explaining my patch properly and responding to your >> previous comment. I didn't understand thoroughly the problem before >> submitting the patch. >> >> Previously I only considered the following three conversions for sqrt(): >> >> >> 1: (float) sqrt ((double) float_val) -> sqrtf (float_val) >> 2: (float) sqrtl ((long double) float_val) -> sqrtf (float_val) >> 3: (double) sqrtl ((long double) double_val) -> sqrt (double_val) >> >> >> We have four types here: >> >> TYPE is the type to which the result of the function call is converted. >> ITYPE is the type of the math call expression. >> TREE_TYPE(arg0) is the type of the function argument (before type >> conversion). >> NEWTYPE is chosen from TYPE and TREE_TYPE(arg0) with higher precision. >> It will be the type of the new math call expression after conversion. >> >> For all three cases above, TYPE is always the same as NEWTYPE. That is >> why I only considered TYPE during the precision comparison. ITYPE can >> only be double_type_node or long_double_type_node depending on the >> type of the math function. That is why I explicitly used those two >> types instead of ITYPE (no correctness issue). But you are right, >> ITYPE is more elegant and better here. >> >> After further analysis, I found I missed two more cases. Note that we >> have the following conditions according to the code in convert.c: >> >> TYPE_PRECISION(NEWTYPE) >= TYPE_PRECISION(TYPE) >> TYPE_PRECISION(NEWTYPE) >= TYPE_PRECISION(TREE_TYPE(arg0)) >> TYPE_PRECISION (NEWTYPE) < TYPE_PRECISION (ITYPE) >> >> the last condition comes from the fact that we only consider >> converting a math function call into another one with narrower type. >> Therefore we have >> >> TYPE_PRECISION(TYPE) < TYPE_PRECISION (ITYPE) >> TYPE_PRECISION(TREE_TYPE(arg0)) < TYPE_PRECISION (ITYPE) >> >> So for sqrt(), TYPE and TREE_TYPE(arg0) can only be float, and for >> sqrtl(), TYPE and TREE_TYPE(arg0) can be either float or double with >> four possible combinations. Therefore we have two more conversions to >> consider besides the three ones I mentioned above: >> >> >> 4: (float) sqrtl ((long double) double_val) -> (float) sqrt (double_val) >> 5: (double) sqrtl ((long double) float_val) -> sqrt ((double) float_val) >> >> >> For the first conversion here, TYPE (float) is different from NEWTYPE >> (double), and my previous patch doesn't handle this case.The correct >> way is to compare precisions of ITYPE and NEWTYPE now. >> >> To sum up, we are converting the expression >> >> (TYPE) sqrtITYPE ((ITYPE) expr) >> >> to >> >> (TYPE) sqrtNEWTYPE ((NEWTYPE) expr) >> >> and we require >> >> PRECISION (ITYPE) >= PRECISION (NEWTYPE) * 2 + 2 >> >> to make it a safe conversion. >> >> >> The new patch is pasted below. >> >> I appreciate your detailed comments and analysis, and next time when I >> submit a patch I will be more carefully about the reviewer's comment. >> >> >> Thank you! >> >> Cong >> >> >> >> Index: gcc/convert.c >> === >> --- gcc/convert.c (revision 201891) >> +++ gcc/convert.c (working copy) >> @@ -135,16 +135,19 @@ convert_to_real (tree type, tree expr) >>CASE_MATHFN (COS) >>CASE_MATHFN (ERF) >>CASE_MATHFN (ERFC) >> - CASE_MATHFN (FABS) >>CASE_MATHFN (LOG) >>CASE_MATHFN (LOG10) >>CASE_MATHFN (LOG2) >>CASE_MATHFN (LOG1P) >> - CASE_MATHFN (LOGB) >>CASE_MATHFN (SIN) >> - CASE_MATHFN (SQRT) >>CASE_MATHFN (TAN) >>CASE_MATHFN (TANH) >> +/* The above functions are not safe to do this conversion. */ >> +if (!flag_unsafe_math_optimizations) >> + break; >> + CASE_MATHFN (SQRT) >> + CASE_MATHFN (FABS) >> + CASE_MATHFN (LOGB) >> #undef CASE_MATHFN >> { >>tree arg0 = strip_float_extensions (CALL_EXPR_ARG (expr, 0)); >> @@ -155,6 +158,27 @@ convert_to_real (tree type, tree expr) >>if (TYPE_PRECISION (TREE_TYPE (arg0)) > TYPE_PRECISION (type)) >> newtype = TREE_TYPE (arg0); >> >> + /* We consider to convert >> + >> + (T1)
[PATCH] [vectorizer] Fixing a bug in tree-vect-patterns.c in GCC vectorizer.
Hi There is a bug in the function vect_recog_dot_prod_pattern() in tree-vect-patterns.c. This function checks if a loop is of dot production pattern. Specifically, according to the comment of this function: /* Try to find the following pattern: type x_t, y_t; TYPE1 prod; TYPE2 sum = init; loop: sum_0 = phi S1 x_t = ... S2 y_t = ... S3 x_T = (TYPE1) x_t; S4 y_T = (TYPE1) y_t; S5 prod = x_T * y_T; [S6 prod = (TYPE2) prod; #optional] S7 sum_1 = prod + sum_0; where 'TYPE1' is exactly double the size of type 'type', and 'TYPE2' is the same size of 'TYPE1' or bigger. This is a special case of a reduction computation. */ This function should check if x_t and y_t have the same type (type) which has the half size of TYPE1. The corresponding code is shown below: oprnd0 = gimple_assign_rhs1 (stmt); oprnd1 = gimple_assign_rhs2 (stmt); if (!types_compatible_p (TREE_TYPE (oprnd0), prod_type) || !types_compatible_p (TREE_TYPE (oprnd1), prod_type)) return NULL; if (!type_conversion_p (oprnd0, stmt, true, &half_type0, &def_stmt, &promotion) || !promotion) return NULL; oprnd00 = gimple_assign_rhs1 (def_stmt); /*==V see here! */ if (!type_conversion_p (oprnd0, stmt, true, &half_type1, &def_stmt, &promotion) || !promotion) return NULL; oprnd01 = gimple_assign_rhs1 (def_stmt); if (!types_compatible_p (half_type0, half_type1)) return NULL; if (TYPE_PRECISION (prod_type) != TYPE_PRECISION (half_type0) * 2) return NULL; Here the function uses x_T (oprnd0) to check the type of y_t, which is incorrect. The fix is simple: just replace it by oprnd1. The failed test case for this bug is shown below: int foo(short *a, int *b, int n) { int sum = 0; for (int i = 0; i < n; ++i) sum += a[i] * b[i]; return sum; } thanks, Cong Index: gcc/tree-vect-patterns.c === --- gcc/tree-vect-patterns.c (revision 200988) +++ gcc/tree-vect-patterns.c (working copy) @@ -397,7 +397,7 @@ vect_recog_dot_prod_pattern (vec || !promotion) return NULL; oprnd00 = gimple_assign_rhs1 (def_stmt); - if (!type_conversion_p (oprnd0, stmt, true, &half_type1, &def_stmt, + if (!type_conversion_p (oprnd1, stmt, true, &half_type1, &def_stmt, &promotion) || !promotion) return NULL;
Re: [PATCH] [vectorizer] Fixing a bug in tree-vect-patterns.c in GCC vectorizer.
A new test case is added to testsuite/gcc.dg/vect, which will fail without this patch and pass with it. Bootstrap also get passed. No additional test failure is introduced. The new test case includes a dot product on two arrays with short and int types. The loop will still be vectorized (using punpcklwd on array with short type), but should not be recognized as a dot-product pattern. thanks, Cong Index: gcc/tree-vect-patterns.c === --- gcc/tree-vect-patterns.c (revision 202572) +++ gcc/tree-vect-patterns.c (working copy) @@ -397,7 +397,7 @@ vect_recog_dot_prod_pattern (vec || !promotion) return NULL; oprnd00 = gimple_assign_rhs1 (def_stmt); - if (!type_conversion_p (oprnd0, stmt, true, &half_type1, &def_stmt, + if (!type_conversion_p (oprnd1, stmt, true, &half_type1, &def_stmt, &promotion) || !promotion) return NULL; Index: gcc/ChangeLog === --- gcc/ChangeLog (revision 202572) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,9 @@ +2013-09-13 Cong Hou + + * tree-vect-patterns.c (vect_recog_dot_prod_pattern): Fix a bug + when checking the dot production pattern. The type of rhs operand + of multiply is now checked correctly. + 2013-09-13 Jan Hubicka PR middle-end/58094 Index: gcc/testsuite/gcc.dg/vect/vect-reduc-dot-s16c.c === --- gcc/testsuite/gcc.dg/vect/vect-reduc-dot-s16c.c (revision 0) +++ gcc/testsuite/gcc.dg/vect/vect-reduc-dot-s16c.c (revision 0) @@ -0,0 +1,73 @@ +/* { dg-require-effective-target vect_int } */ + +#include +#include "tree-vect.h" + +#define N 64 +#define DOT 43680 + +signed short X[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__))); +signed int Y[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__))); + +/* (short, int)->int->int dot product. + Not detected as a dot-product pattern. */ + +__attribute__ ((noinline)) int +foo (int len) +{ + int i; + int result = 0; + + for (i = 0; i < len; i++) +{ + result += (X[i] * Y[i]); +} + return result; +} + + +/* (int, short)->int->int dot product. + Not detected as a dot-product pattern. */ + +__attribute__ ((noinline)) int +bar (int len) +{ + int i; + int result = 0; + + for (i = 0; i < len; i++) +{ + result += (Y[i] * X[i]); +} + return result; +} + +int +main (void) +{ + int i; + int dot; + + check_vect (); + + for (i = 0; i < N; i++) +{ + X[i] = i; + Y[i] = N - i; + __asm__ volatile (""); +} + + dot = foo (N); + if (dot != DOT) +abort (); + + dot = bar (N); + if (dot != DOT) +abort (); + + return 0; +} + +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 2 "vect" { target vect_unpack } } } */ +/* { dg-final { cleanup-tree-dump "vect" } } */ + Index: gcc/testsuite/ChangeLog === --- gcc/testsuite/ChangeLog (revision 202572) +++ gcc/testsuite/ChangeLog (working copy) @@ -1,3 +1,9 @@ +2013-09-13 Cong Hou + + * gcc.dg/vect/vect-reduc-dot-s16c.c: Add a test case with dot product + on two arrays with short and int types. This should not be recognized + as a dot product pattern. + 2013-09-13 Kai Tietz gcc.target/i386/pr57848.c: New file. On Wed, Sep 11, 2013 at 6:55 PM, Xinliang David Li wrote: > Can you add a test case to the regression suite? > > When the type of arguments are unsigned short/unsigned int, GCC does > not vectorize the loop anymore -- this is worth a separate bug to > track. punpcklwd instruction can be used to do zero extension of the > short type. > > David > > On Wed, Sep 11, 2013 at 6:16 PM, Cong Hou wrote: >> Hi >> >> There is a bug in the function vect_recog_dot_prod_pattern() in >> tree-vect-patterns.c. This function checks if a loop is of dot >> production pattern. Specifically, according to the comment of this >> function: >> >> /* >> Try to find the following pattern: >> >> type x_t, y_t; >> TYPE1 prod; >> TYPE2 sum = init; >>loop: >> sum_0 = phi >> S1 x_t = ... >> S2 y_t = ... >> S3 x_T = (TYPE1) x_t; >> S4 y_T = (TYPE1) y_t; >> S5 prod = x_T * y_T; >> [S6 prod = (TYPE2) prod; #optional] >> S7 sum_1 = prod + sum_0; >> >>where 'TYPE1' is exactly double the size of type 'type', and >> 'TYPE2' is the same size of 'TYPE1' or bigger. This is a special case >> of a reduction computation. >> */ >> >> This function should check if x_t and y_t have the same type (ty
Re: [PATCH] Fixing improper conversion from sin() to sinf() in optimization mode.
Any comment or more suggestions on this patch? thanks, Cong On Mon, Sep 9, 2013 at 7:28 PM, Cong Hou wrote: > On Mon, Sep 9, 2013 at 6:26 PM, Xinliang David Li wrote: >> On Fri, Sep 6, 2013 at 3:24 PM, Cong Hou wrote: >>> First, thank you for your detailed comments again! Then I deeply >>> apologize for not explaining my patch properly and responding to your >>> previous comment. I didn't understand thoroughly the problem before >>> submitting the patch. >>> >>> Previously I only considered the following three conversions for sqrt(): >>> >>> >>> 1: (float) sqrt ((double) float_val) -> sqrtf (float_val) >>> 2: (float) sqrtl ((long double) float_val) -> sqrtf (float_val) >>> 3: (double) sqrtl ((long double) double_val) -> sqrt (double_val) >>> >>> >>> We have four types here: >>> >>> TYPE is the type to which the result of the function call is converted. >>> ITYPE is the type of the math call expression. >>> TREE_TYPE(arg0) is the type of the function argument (before type >>> conversion). >>> NEWTYPE is chosen from TYPE and TREE_TYPE(arg0) with higher precision. >>> It will be the type of the new math call expression after conversion. >>> >>> For all three cases above, TYPE is always the same as NEWTYPE. That is >>> why I only considered TYPE during the precision comparison. ITYPE can >>> only be double_type_node or long_double_type_node depending on the >>> type of the math function. That is why I explicitly used those two >>> types instead of ITYPE (no correctness issue). But you are right, >>> ITYPE is more elegant and better here. >>> >>> After further analysis, I found I missed two more cases. Note that we >>> have the following conditions according to the code in convert.c: >>> >>> TYPE_PRECISION(NEWTYPE) >= TYPE_PRECISION(TYPE) >>> TYPE_PRECISION(NEWTYPE) >= TYPE_PRECISION(TREE_TYPE(arg0)) >>> TYPE_PRECISION (NEWTYPE) < TYPE_PRECISION (ITYPE) >>> >>> the last condition comes from the fact that we only consider >>> converting a math function call into another one with narrower type. >>> Therefore we have >>> >>> TYPE_PRECISION(TYPE) < TYPE_PRECISION (ITYPE) >>> TYPE_PRECISION(TREE_TYPE(arg0)) < TYPE_PRECISION (ITYPE) >>> >>> So for sqrt(), TYPE and TREE_TYPE(arg0) can only be float, and for >>> sqrtl(), TYPE and TREE_TYPE(arg0) can be either float or double with >>> four possible combinations. Therefore we have two more conversions to >>> consider besides the three ones I mentioned above: >>> >>> >>> 4: (float) sqrtl ((long double) double_val) -> (float) sqrt (double_val) >>> 5: (double) sqrtl ((long double) float_val) -> sqrt ((double) float_val) >>> >>> >>> For the first conversion here, TYPE (float) is different from NEWTYPE >>> (double), and my previous patch doesn't handle this case.The correct >>> way is to compare precisions of ITYPE and NEWTYPE now. >>> >>> To sum up, we are converting the expression >>> >>> (TYPE) sqrtITYPE ((ITYPE) expr) >>> >>> to >>> >>> (TYPE) sqrtNEWTYPE ((NEWTYPE) expr) >>> >>> and we require >>> >>> PRECISION (ITYPE) >= PRECISION (NEWTYPE) * 2 + 2 >>> >>> to make it a safe conversion. >>> >>> >>> The new patch is pasted below. >>> >>> I appreciate your detailed comments and analysis, and next time when I >>> submit a patch I will be more carefully about the reviewer's comment. >>> >>> >>> Thank you! >>> >>> Cong >>> >>> >>> >>> Index: gcc/convert.c >>> === >>> --- gcc/convert.c (revision 201891) >>> +++ gcc/convert.c (working copy) >>> @@ -135,16 +135,19 @@ convert_to_real (tree type, tree expr) >>>CASE_MATHFN (COS) >>>CASE_MATHFN (ERF) >>>CASE_MATHFN (ERFC) >>> - CASE_MATHFN (FABS) >>>CASE_MATHFN (LOG) >>>CASE_MATHFN (LOG10) >>>CASE_MATHFN (LOG2) >>>CASE_MATHFN (LOG1P) >>> - CASE_MATHFN (LOGB) >>>CASE_MATHFN (SIN) >>> - CASE_MATHFN (SQRT) >>>CASE_MATHFN (TAN) >>>CASE_MATHFN (TANH) >>> +/* The above functions are not safe to do this conversion. */ >>> +if (!fla
[PATCH] Bug fix: *var and MEM[(const int *)var] (var has int* type) are not treated as the same data ref.
(I have also created this issue in bug reports: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58513) First look at the code below: int op(const int* a, const int* b) { return *a+*b; } void foo(int*a, int b) { int i; for (i = 0; i < 10; ++i) a[i] = op(a+i, &b); } GCC will generate the following GIMPLE for this loop after inlining op(): : # i_15 = PHI # ivtmp_23 = PHI _4 = (long unsigned int) i_15; _5 = _4 * 4; _7 = a_6(D) + _5; _10 = MEM[(const int *)_7]; _11 = _10 + b_12(D); *_7 = _11; i_9 = i_15 + 1; ivtmp_22 = ivtmp_23 - 1; if (ivtmp_22 != 0) goto ; else goto ; Here each element of the array a is loaded by MEM[(const int *)_7] and stored by *_7, which are the only two data refs in the loop body. The GCC vectorizer needs to check the possible aliasing between data refs with potential data dependence. Here those two data refs are actually the same one, but GCC could not recognize this fact. As a result, the aliasing checking predicate will always return false at runtime (GCC 4.9 could eliminate this generated branch at the end of the vectorization pass). The reason why GCC thinks that MEM[(const int *)_7] and *_7 are two different data refs is that there is a possible defect in the function operand_equal_p(), which is used to compare two data refs. The current implementation uses == to compare the types of the second argument of MEM_REF operator, which is too strict. Using types_compatible_p() instead can fix the issue above. I also included a test case for this bug fix. Bootstrapping and "make check" are both passed. thanks, Cong Index: gcc/testsuite/gcc.dg/alias-14.c === --- gcc/testsuite/gcc.dg/alias-14.c (revision 0) +++ gcc/testsuite/gcc.dg/alias-14.c (revision 0) @@ -0,0 +1,24 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -ftree-vectorize" } */ + +int op (const int* x, const int* y) +{ + return *x + *y; +} + +/* After inlining op() the type of the data ref is converted from int* into + const int&, resulting in two data refs MEM[(const int *)DR] and *DR for read + and write, where DR represents the address of a[i] here. They are still + the same data ref and no alias exists in the loop. The vectorizer should + succesffuly vectorize this loop. */ + +void foo(int* a, int b) +{ + int i; + for (i = 0; i < 10; ++i) +a[i] = op(a + i, &b); +} + + +/* { dg-final { scan-assembler-times "paddd" 1 { target x86_64-*-* } } } */ + Index: gcc/fold-const.c === --- gcc/fold-const.c (revision 202662) +++ gcc/fold-const.c (working copy) @@ -2693,8 +2693,9 @@ operand_equal_p (const_tree arg0, const_ && operand_equal_p (TYPE_SIZE (TREE_TYPE (arg0)), TYPE_SIZE (TREE_TYPE (arg1)), flags))) && types_compatible_p (TREE_TYPE (arg0), TREE_TYPE (arg1)) - && (TYPE_MAIN_VARIANT (TREE_TYPE (TREE_OPERAND (arg0, 1))) - == TYPE_MAIN_VARIANT (TREE_TYPE (TREE_OPERAND (arg1, 1 + && types_compatible_p ( + TYPE_MAIN_VARIANT (TREE_TYPE (TREE_OPERAND (arg0, 1))), + TYPE_MAIN_VARIANT (TREE_TYPE (TREE_OPERAND (arg1, 1 && OP_SAME (0) && OP_SAME (1)); case ARRAY_REF:
Re: [PATCH] Bug fix: *var and MEM[(const int *)var] (var has int* type) are not treated as the same data ref.
Nice fix! I noticed that this patch is already combined to the trunk. Thank you very much, Richard! Cong On Tue, Sep 24, 2013 at 1:49 AM, Richard Biener wrote: > On Tue, 24 Sep 2013, Richard Biener wrote: > >> On Tue, 24 Sep 2013, Jakub Jelinek wrote: >> >> > Hi! >> > >> > On Mon, Sep 23, 2013 at 05:26:13PM -0700, Cong Hou wrote: >> > >> > Missing ChangeLog entry. >> > >> > > --- gcc/testsuite/gcc.dg/alias-14.c (revision 0) >> > > +++ gcc/testsuite/gcc.dg/alias-14.c (revision 0) >> > >> > Vectorizer tests should go into gcc.dg/vect/ instead, or, if they are >> > for a single target (but there is no reason why this should be a single >> > target), into gcc.target//. >> > >> > > --- gcc/fold-const.c (revision 202662) >> > > +++ gcc/fold-const.c (working copy) >> > > @@ -2693,8 +2693,9 @@ operand_equal_p (const_tree arg0, const_ >> > > && operand_equal_p (TYPE_SIZE (TREE_TYPE (arg0)), >> > > TYPE_SIZE (TREE_TYPE (arg1)), flags))) >> > >&& types_compatible_p (TREE_TYPE (arg0), TREE_TYPE (arg1)) >> > > - && (TYPE_MAIN_VARIANT (TREE_TYPE (TREE_OPERAND (arg0, 1))) >> > > - == TYPE_MAIN_VARIANT (TREE_TYPE (TREE_OPERAND (arg1, 1 >> > > + && types_compatible_p ( >> > > + TYPE_MAIN_VARIANT (TREE_TYPE (TREE_OPERAND (arg0, 1))), >> > > + TYPE_MAIN_VARIANT (TREE_TYPE (TREE_OPERAND (arg1, 1 >> > >&& OP_SAME (0) && OP_SAME (1)); >> > >> > This looks wrong. types_compatible_p will happily return true say >> > for unsigned long and unsigned long long types on x86_64, because >> > they are both integral types with the same precision, but the second >> > argument of MEM_REF contains aliasing information, where the distinction >> > between the two is important. >> > So, while == comparison of main variant is too strict, types_compatible_p >> > is too weak, so I guess you need to write a new predicate that will either >> > handle the == and a few special cases that are safe to be handled, or >> > look for what exactly we use the type of the second MEM_REF argument >> > and check those properties. We certainly need that >> > get_deref_alias_set_1 and get_deref_alias_set return the same values >> > for both the types, but whether that is the only information we are using, >> > not sure, CCing Richard. >> >> Using TYPE_MAIN_VARIANT is exactly correct - this is the best we >> can do that will work with all frontends. TYPE_MAIN_VARIANT >> guarantees that the alias-sets stay the same: >> >> /* If the innermost reference is a MEM_REF that has a >> conversion embedded treat it like a VIEW_CONVERT_EXPR above, >> using the memory access type for determining the alias-set. */ >> if (TREE_CODE (inner) == MEM_REF >> && TYPE_MAIN_VARIANT (TREE_TYPE (inner)) >> != TYPE_MAIN_VARIANT >>(TREE_TYPE (TREE_TYPE (TREE_OPERAND (inner, 1) >>return get_deref_alias_set (TREE_OPERAND (inner, 1)); >> >> so we cannot change the compatibility checks without touching the >> alias-set deriving code. For the testcase in question we have >> MEM[(const int &)_7] vs. MEM[(int *)_7] and unfortunately pointer >> and reference types are not variant types. >> >> We also cannot easily resort to pointed-to types as our all-beloved >> ref-all qualification is on the pointer type rather than on the >> pointed-to type. >> >> But yes, we could implement a more complicated predicate >> >> bool >> alias_ptr_types_compatible_p (const_tree t1, const_tree t2) >> { >> t1 = TYPE_MAIN_VARIANT (t1); >> t2 = TYPE_MAIN_VARIANT (t2); >> if (t1 == t2) >> return true; >> >> if (TYPE_REF_CAN_ALIAS_ALL (t1) >> || TYPE_REF_CAN_ALIAS_ALL (t2)) >> return false; >> >> return (TYPE_MAIN_VARIANT (TREE_TYPE (t1)) >> == TYPE_MAIN_VARIANT (TREE_TYPE (t2))); >> } >> >> Note that the fold-const code in question is >> >> return ((TYPE_SIZE (TREE_TYPE (arg0)) == TYPE_SIZE (TREE_TYPE >> (arg1)) >>|| (TYPE_SIZE (TREE_TYPE (arg0)) >>&& TYPE_SIZE (TREE_TYPE (arg1)) >>&& operand_equal_p (TYPE_SIZE (TREE_TYPE (arg0)), >>TYPE_SIZE (TREE_TYPE (arg1)), &g
[PATCH] Relax the requirement of reduction pattern in GCC vectorizer.
The current GCC vectorizer requires the following pattern as a simple reduction computation: loop_header: a1 = phi < a0, a2 > a3 = ... a2 = operation (a3, a1) But a3 can also be defined outside of the loop. For example, the following loop can benefit from vectorization but the GCC vectorizer fails to vectorize it: int foo(int v) { int s = 1; ++v; for (int i = 0; i < 10; ++i) s *= v; return s; } This patch relaxes the original requirement by also considering the following pattern: a3 = ... loop_header: a1 = phi < a0, a2 > a2 = operation (a3, a1) A test case is also added. The patch is tested on x86-64. thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 39c786e..45c1667 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,9 @@ +2013-09-27 Cong Hou + + * tree-vect-loop.c: Relax the requirement of the reduction + pattern so that one operand of the reduction operation can + come from outside of the loop. + 2013-09-25 Tom Tromey * Makefile.in (PARTITION_H, LTO_SYMTAB_H, COMMON_TARGET_DEF_H) diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 09644d2..90496a2 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,7 @@ +2013-09-27 Cong Hou + + * gcc.dg/vect/vect-reduc-pattern-3.c: New test. + 2013-09-25 Marek Polacek PR sanitizer/58413 diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index 2871ba1..3c51c3b 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -2091,6 +2091,13 @@ vect_is_slp_reduction (loop_vec_info loop_info, gimple phi, gimple first_stmt) a3 = ... a2 = operation (a3, a1) + or + + a3 = ... + loop_header: + a1 = phi < a0, a2 > + a2 = operation (a3, a1) + such that: 1. operation is commutative and associative and it is safe to change the order of the computation (if CHECK_REDUCTION is true) @@ -2451,6 +2458,7 @@ vect_is_simple_reduction_1 (loop_vec_info loop_info, gimple phi, if (def2 && def2 == phi && (code == COND_EXPR || !def1 || gimple_nop_p (def1) + || !flow_bb_inside_loop_p (loop, gimple_bb (def1)) || (def1 && flow_bb_inside_loop_p (loop, gimple_bb (def1)) && (is_gimple_assign (def1) || is_gimple_call (def1) @@ -2469,6 +2477,7 @@ vect_is_simple_reduction_1 (loop_vec_info loop_info, gimple phi, if (def1 && def1 == phi && (code == COND_EXPR || !def2 || gimple_nop_p (def2) + || !flow_bb_inside_loop_p (loop, gimple_bb (def2)) || (def2 && flow_bb_inside_loop_p (loop, gimple_bb (def2)) && (is_gimple_assign (def2) || is_gimple_call (def2) diff --git gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c new file mode 100644 index 000..06a9416 --- /dev/null +++ gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c @@ -0,0 +1,41 @@ +/* { dg-require-effective-target vect_int } */ + +#include +#include "tree-vect.h" + +#define N 10 +#define RES 1024 + +/* A reduction pattern in which there is no data ref in + the loop and one operand is defined outside of the loop. */ + +__attribute__ ((noinline)) int +foo (int v) +{ + int i; + int result = 1; + + ++v; + for (i = 0; i < N; i++) +result *= v; + + return result; +} + +int +main (void) +{ + int res; + + check_vect (); + + res = foo (1); + if (res != RES) +abort (); + + return 0; +} + +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ +/* { dg-final { cleanup-tree-dump "vect" } } */ +
[PATCH] Improving uniform_vector_p() function.
The current uniform_vector_p() function only returns non-NULL when the vector is directly a uniform vector. For example, for the following gimple code: vect_cst_.15_91 = {_9, _9, _9, _9, _9, _9, _9, _9}; The current implementation can only detect that {_9, _9, _9, _9, _9, _9, _9, _9} is a uniform vector, but fails to recognize vect_cst_.15_91 is also one. This simple patch searches through assignment chains to find more uniform vectors. thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 45c1667..b42f8a9 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,9 @@ +2013-10-01 Cong Hou + + * tree.c: Improve the function uniform_vector_p() so that a + vector assigned with a uniform vector is also treated as a + uniform vector. + diff --git a/gcc/tree.c b/gcc/tree.c index 1c881e4..1d6d894 100644 --- a/gcc/tree.c +++ b/gcc/tree.c @@ -10297,6 +10297,17 @@ uniform_vector_p (const_tree vec) return first; } + if (TREE_CODE (vec) == SSA_NAME) +{ + gimple def = SSA_NAME_DEF_STMT (vec); + if (gimple_code (def) == GIMPLE_ASSIGN) +{ + tree rhs = gimple_op (def, 1); + if (VECTOR_TYPE_P (TREE_TYPE (rhs))) +return uniform_vector_p (rhs); +} +} + return NULL_TREE; }
Re: [PATCH] Improving uniform_vector_p() function.
Actually I will introduce optimizations in the next patch. Currently the function uniform_vector_p () is rarely used in GCC, but there are certainly some optimization opportunities with the help of this function. For example, when we widen a vector with 8 identical element of short type to two vectors of int type, GCC emits the following code: vect_cst_.15_91 = {_9, _9, _9, _9, _9, _9, _9, _9}; vect__10.16_92 = [vec_unpack_lo_expr] vect_cst_.15_91; vect__10.16_93 = [vec_unpack_hi_expr] vect_cst_.15_91; When vect_cst_.15_91 is a uniform vector, we know vect__10.16_92 and vect__10.16_93 are identical so that we can remove the second [vec_unpack_hi_expr] operation: vect_cst_.15_91 = {_9, _9, _9, _9, _9, _9, _9, _9}; vect__10.16_92 = [vec_unpack_lo_expr] vect_cst_.15_91; vect__10.16_93 = vect__10.16_92; thanks, Cong On Tue, Oct 1, 2013 at 2:37 PM, Xinliang David Li wrote: > On Tue, Oct 1, 2013 at 10:31 AM, Cong Hou wrote: >> The current uniform_vector_p() function only returns non-NULL when the >> vector is directly a uniform vector. For example, for the following >> gimple code: >> >> vect_cst_.15_91 = {_9, _9, _9, _9, _9, _9, _9, _9}; >> >> >> The current implementation can only detect that {_9, _9, _9, _9, _9, >> _9, _9, _9} is a uniform vector, but fails to recognize >> vect_cst_.15_91 is also one. This simple patch searches through >> assignment chains to find more uniform vectors. >> >> >> thanks, >> Cong >> >> >> >> diff --git a/gcc/ChangeLog b/gcc/ChangeLog >> index 45c1667..b42f8a9 100644 >> --- a/gcc/ChangeLog >> +++ b/gcc/ChangeLog >> @@ -1,3 +1,9 @@ >> +2013-10-01 Cong Hou >> + >> + * tree.c: Improve the function uniform_vector_p() so that a >> + vector assigned with a uniform vector is also treated as a >> + uniform vector. >> + >> diff --git a/gcc/tree.c b/gcc/tree.c >> index 1c881e4..1d6d894 100644 >> --- a/gcc/tree.c >> +++ b/gcc/tree.c >> @@ -10297,6 +10297,17 @@ uniform_vector_p (const_tree vec) >>return first; >> } >> >> + if (TREE_CODE (vec) == SSA_NAME) >> +{ >> + gimple def = SSA_NAME_DEF_STMT (vec); >> + if (gimple_code (def) == GIMPLE_ASSIGN) > > > do this: > > if (is_gimple_assign (def) && gimple_assign_copy_p (def)) > >> +{ >> + tree rhs = gimple_op (def, 1); >> + if (VECTOR_TYPE_P (TREE_TYPE (rhs))) >> +return uniform_vector_p (rhs); >> +} >> +} >> + >>return NULL_TREE; >> } > > Do you have a test case showing what missed optimization this fix can enable ? > > David
[PATCH] Reducing number of alias checks in vectorization.
)->dest); + gsi_move_after (&si, &si_dst); + } + continue; +} + else if (!dr) + { +bool hoist = true; +for (size_t i = 0; i < gimple_num_ops (stmt); i++) +{ + tree op = gimple_op (stmt, i); + if (TREE_CODE (op) == INTEGER_CST + || TREE_CODE (op) == REAL_CST) +continue; + if (TREE_CODE (op) == SSA_NAME) + { +gimple def = SSA_NAME_DEF_STMT (op); +if (def == stmt +|| gimple_nop_p (def) +|| !flow_bb_inside_loop_p (loop, gimple_bb (def))) + continue; + } + hoist = false; + break; +} + +if (hoist) +{ + basic_block preheader = loop_preheader_edge (loop)->src; + gimple_stmt_iterator si_dst = gsi_last_bb (preheader); + gsi_move_after (&si, &si_dst); + continue; +} + } + gsi_next (&si); + } +} + /* End loop-exit-fixes after versioning. */ if (cond_expr_stmt_list) Index: gcc/ChangeLog === --- gcc/ChangeLog (revision 202663) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,8 @@ +2013-10-01 Cong Hou + + * tree-vect-loop-manip.c (vect_create_cond_for_alias_checks): Combine + alias checks if it is possible to amortize the runtime overhead. +
Re: [PATCH] Reducing number of alias checks in vectorization.
On Tue, Oct 1, 2013 at 11:35 PM, Jakub Jelinek wrote: > On Tue, Oct 01, 2013 at 07:12:54PM -0700, Cong Hou wrote: >> --- gcc/tree-vect-loop-manip.c (revision 202662) >> +++ gcc/tree-vect-loop-manip.c (working copy) > > Your mailer ate all the tabs, so the formatting of the whole patch > can't be checked. > I'll pay attention to this problem in my later patch submission. >> @@ -19,6 +19,10 @@ You should have received a copy of the G >> along with GCC; see the file COPYING3. If not see >> <http://www.gnu.org/licenses/>. */ >> >> +#include >> +#include >> +#include > > Why? GCC has it's vec.h vectors, why don't you use those? > There is even qsort method for you in there. And for pairs, you can > easily just use structs with two members as structure elements in the > vector. > GCC is now restructured using C++ and STL is one of the most important part of C++. I am new to GCC community and more familiar to STL (and I think allowing STL in GCC could attract more new developers for GCC). I agree using GCC's vec can maintain a uniform style but STL is just so powerful and easy to use... I just did a search in GCC source tree and found is not used yet. I will change std::vector to GCC's vec for now (and also qsort), but am still wondering if one day GCC would accept STL. >> +struct dr_addr_with_seg_len >> +{ >> + dr_addr_with_seg_len (data_reference* d, tree addr, tree off, tree len) >> +: dr (d), basic_addr (addr), offset (off), seg_len (len) {} >> + >> + data_reference* dr; > > Space should be before *, not after it. > >> + if (TREE_CODE (p11.offset) != INTEGER_CST >> + || TREE_CODE (p21.offset) != INTEGER_CST) >> +return p11.offset < p21.offset; > > If offset isn't INTEGER_CST, you are comparing the pointer values? > That is never a good idea, then compilation will depend on how say address > space randomization randomizes virtual address space. GCC needs to have > reproduceable compilations. I this scenario comparing pointers is safe. The sort is used to put together any two pairs of data refs which can be merged. For example, if we have (a, b) (a, c), (a, b+1), then after sorting them we should have either (a, b), (a, b+1), (a, c) or (a, c), (a, b), (a, b+1). We don't care the relative order of "non-mergable" dr pairs here. So although the sorting result may vary the final result we get should not change. > >> + if (int_cst_value (p11.offset) != int_cst_value (p21.offset)) >> +return int_cst_value (p11.offset) < int_cst_value (p21.offset); > > This is going to ICE whenever the offsets wouldn't fit into a > HOST_WIDE_INT. > > I'd say you just shouldn't put into the vector entries where offset isn't > host_integerp, those would never be merged with other checks, or something > similar. Do you mean I should use widest_int_cst_value()? Then I will replace all int_cst_value() here with it. I also changed the type of "diff" variable into HOST_WIDEST_INT. Thank you very much for your comments! Cong > > Jakub
Re: [PATCH] Relax the requirement of reduction pattern in GCC vectorizer.
Ping.. Any comment on this patch? thanks, Cong On Sat, Sep 28, 2013 at 9:34 AM, Xinliang David Li wrote: > You can also add a test case of this form: > > int foo( int t, int n, int *dst) > { >int j = 0; >int s = 1; >t++; >for (j = 0; j < n; j++) > { > dst[j] = t; > s *= t; > } > >return s; > } > > where without the fix the loop vectorization is missed. > > David > > On Fri, Sep 27, 2013 at 6:28 PM, Cong Hou wrote: >> The current GCC vectorizer requires the following pattern as a simple >> reduction computation: >> >>loop_header: >> a1 = phi < a0, a2 > >> a3 = ... >> a2 = operation (a3, a1) >> >> But a3 can also be defined outside of the loop. For example, the >> following loop can benefit from vectorization but the GCC vectorizer >> fails to vectorize it: >> >> >> int foo(int v) >> { >> int s = 1; >> ++v; >> for (int i = 0; i < 10; ++i) >> s *= v; >> return s; >> } >> >> >> This patch relaxes the original requirement by also considering the >> following pattern: >> >> >>a3 = ... >>loop_header: >> a1 = phi < a0, a2 > >> a2 = operation (a3, a1) >> >> >> A test case is also added. The patch is tested on x86-64. >> >> >> thanks, >> Cong >> >> >> >> diff --git a/gcc/ChangeLog b/gcc/ChangeLog >> index 39c786e..45c1667 100644 >> --- a/gcc/ChangeLog >> +++ b/gcc/ChangeLog >> @@ -1,3 +1,9 @@ >> +2013-09-27 Cong Hou >> + >> + * tree-vect-loop.c: Relax the requirement of the reduction >> + pattern so that one operand of the reduction operation can >> + come from outside of the loop. >> + >> 2013-09-25 Tom Tromey >> >> * Makefile.in (PARTITION_H, LTO_SYMTAB_H, COMMON_TARGET_DEF_H) >> diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog >> index 09644d2..90496a2 100644 >> --- a/gcc/testsuite/ChangeLog >> +++ b/gcc/testsuite/ChangeLog >> @@ -1,3 +1,7 @@ >> +2013-09-27 Cong Hou >> + >> + * gcc.dg/vect/vect-reduc-pattern-3.c: New test. >> + >> 2013-09-25 Marek Polacek >> >> PR sanitizer/58413 >> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c >> index 2871ba1..3c51c3b 100644 >> --- a/gcc/tree-vect-loop.c >> +++ b/gcc/tree-vect-loop.c >> @@ -2091,6 +2091,13 @@ vect_is_slp_reduction (loop_vec_info loop_info, >> gimple phi, gimple first_stmt) >> a3 = ... >> a2 = operation (a3, a1) >> >> + or >> + >> + a3 = ... >> + loop_header: >> + a1 = phi < a0, a2 > >> + a2 = operation (a3, a1) >> + >> such that: >> 1. operation is commutative and associative and it is safe to >>change the order of the computation (if CHECK_REDUCTION is true) >> @@ -2451,6 +2458,7 @@ vect_is_simple_reduction_1 (loop_vec_info >> loop_info, gimple phi, >>if (def2 && def2 == phi >>&& (code == COND_EXPR >>|| !def1 || gimple_nop_p (def1) >> + || !flow_bb_inside_loop_p (loop, gimple_bb (def1)) >>|| (def1 && flow_bb_inside_loop_p (loop, gimple_bb (def1)) >>&& (is_gimple_assign (def1) >>|| is_gimple_call (def1) >> @@ -2469,6 +2477,7 @@ vect_is_simple_reduction_1 (loop_vec_info >> loop_info, gimple phi, >>if (def1 && def1 == phi >>&& (code == COND_EXPR >>|| !def2 || gimple_nop_p (def2) >> + || !flow_bb_inside_loop_p (loop, gimple_bb (def2)) >>|| (def2 && flow_bb_inside_loop_p (loop, gimple_bb (def2)) >>&& (is_gimple_assign (def2) >>|| is_gimple_call (def2) >> diff --git gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c >> gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c >> new file mode 100644 >> index 000..06a9416 >> --- /dev/null >> +++ gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c >> @@ -0,0 +1,41 @@ >> +/* { dg-require-effective-target vect_int } */ >> + >> +#include >> +#include "tree-vect.h" >> + >> +#define N 10 >> +#define RES 1024 >> + >> +/* A reduction pattern in which there is no data ref in >> + the loop and one operand is defined outside of the loop. */ >> + >> +__attribute__ ((noinline)) int >> +foo (int v) >> +{ >> + int i; >> + int result = 1; >> + >> + ++v; >> + for (i = 0; i < N; i++) >> +result *= v; >> + >> + return result; >> +} >> + >> +int >> +main (void) >> +{ >> + int res; >> + >> + check_vect (); >> + >> + res = foo (1); >> + if (res != RES) >> +abort (); >> + >> + return 0; >> +} >> + >> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ >> +/* { dg-final { cleanup-tree-dump "vect" } } */ >> +
Re: [PATCH] Reducing number of alias checks in vectorization.
On Wed, Oct 2, 2013 at 4:24 AM, Richard Biener wrote: > On Tue, 1 Oct 2013, Cong Hou wrote: > >> When alias exists between data refs in a loop, to vectorize it GCC >> does loop versioning and adds runtime alias checks. Basically for each >> pair of data refs with possible data dependence, there will be two >> comparisons generated to make sure there is no aliasing between them >> in each iteration of the vectorized loop. If there are many such data >> refs pairs, the number of comparisons can be very large, which is a >> big overhead. >> >> However, in some cases it is possible to reduce the number of those >> comparisons. For example, for the following loop, we can detect that >> b[0] and b[1] are two consecutive member accesses so that we can >> combine the alias check between a[0:100]&b[0] and a[0:100]&b[1] into >> checking a[0:100]&b[0:2]: >> >> void foo(int*a, int* b) >> { >>for (int i = 0; i < 100; ++i) >> a[i] = b[0] + b[1]; >> } >> >> Actually, the requirement of consecutive memory accesses is too >> strict. For the following loop, we can still combine the alias checks >> between a[0:100]&b[0] and a[0:100]&b[100]: >> >> void foo(int*a, int* b) >> { >>for (int i = 0; i < 100; ++i) >> a[i] = b[0] + b[100]; >> } >> >> This is because if b[0] is not in a[0:100] and b[100] is not in >> a[0:100] then a[0:100] cannot be between b[0] and b[100]. We only need >> to check a[0:100] and b[0:101] don't overlap. >> >> More generally, consider two pairs of data refs (a, b1) and (a, b2). >> Suppose addr_b1 and addr_b2 are basic addresses of data ref b1 and b2; >> offset_b1 and offset_b2 (offset_b1 < offset_b2) are offsets of b1 and >> b2, and segment_length_a, segment_length_b1, and segment_length_b2 are >> segment length of a, b1, and b2. Then we can combine the two >> comparisons into one if the following condition is satisfied: >> >> offset_b2- offset_b1 - segment_length_b1 < segment_length_a >> >> >> This patch detects those combination opportunities to reduce the >> number of alias checks. It is tested on an x86-64 machine. > > Apart from the other comments you got (to which I agree) the patch > seems to do two things, namely also: > > + /* Extract load and store statements on pointers with zero-stride > + accesses. */ > + if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo)) > +{ > > which I'd rather see in a separate patch (and done also when > the loop doesn't require versioning for alias). > My mistake.. I am working on those two patches at the same time and pasted that one also here by mistake. I will send another patch about the "hoist" topic. > Also combining the alias checks in vect_create_cond_for_alias_checks > is nice but doesn't properly fix the use of the > vect-max-version-for-alias-checks param which currently inhibits > vectorization of the HIMENO benchmark by default (and make us look bad > compared to LLVM). > > So I believe this merging should be done incrementally when > we collect the DDRs we need to test in vect_mark_for_runtime_alias_test. > I agree that vect-max-version-for-alias-checks param should count the number of checks after the merge. However, the struct data_dependence_relation could not record the new information produced by the merge. The new information I mentioned contains the new segment length for comparisons. This length is calculated right in vect_create_cond_for_alias_checks() function. Since vect-max-version-for-alias-checks is used during analysis phase, shall we move all those (get segment length for each data ref and merge alias checks) from transformation to analysis phase? If we cannot store the result properly (data_dependence_relation is not enough), shall we do it twice in both phases? I also noticed a possible bug in the function vect_same_range_drs() called by vect_prune_runtime_alias_test_list(). For the following code I get two pairs of data refs after vect_prune_runtime_alias_test_list(), but in vect_create_cond_for_alias_checks() after detecting grouped accesses I got two identical pairs of data refs. The consequence is two identical alias checks are produced. void yuv2yuyv_ref (int *d, int *src, int n) { char *dest = (char *)d; int i; for(i=0;i>16; dest[i*4 + 1] = (src[i*2 + 1])>>8; dest[i*4 + 2] = (src[i*2 + 0])>>16; dest[i*4 + 3] = (src[i*2 + 0])>>0; } } I think the solution to this problem is changing GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt_i)) == GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt_j) into STMT_VINFO_DATA_REF (vinfo_for_stmt (GROUP_FIRST_ELEMENT (v
Re: [PATCH] Reducing number of alias checks in vectorization.
On Wed, Oct 2, 2013 at 2:18 PM, Xinliang David Li wrote: > On Wed, Oct 2, 2013 at 4:24 AM, Richard Biener wrote: >> On Tue, 1 Oct 2013, Cong Hou wrote: >> >>> When alias exists between data refs in a loop, to vectorize it GCC >>> does loop versioning and adds runtime alias checks. Basically for each >>> pair of data refs with possible data dependence, there will be two >>> comparisons generated to make sure there is no aliasing between them >>> in each iteration of the vectorized loop. If there are many such data >>> refs pairs, the number of comparisons can be very large, which is a >>> big overhead. >>> >>> However, in some cases it is possible to reduce the number of those >>> comparisons. For example, for the following loop, we can detect that >>> b[0] and b[1] are two consecutive member accesses so that we can >>> combine the alias check between a[0:100]&b[0] and a[0:100]&b[1] into >>> checking a[0:100]&b[0:2]: >>> >>> void foo(int*a, int* b) >>> { >>>for (int i = 0; i < 100; ++i) >>> a[i] = b[0] + b[1]; >>> } >>> >>> Actually, the requirement of consecutive memory accesses is too >>> strict. For the following loop, we can still combine the alias checks >>> between a[0:100]&b[0] and a[0:100]&b[100]: >>> >>> void foo(int*a, int* b) >>> { >>>for (int i = 0; i < 100; ++i) >>> a[i] = b[0] + b[100]; >>> } >>> >>> This is because if b[0] is not in a[0:100] and b[100] is not in >>> a[0:100] then a[0:100] cannot be between b[0] and b[100]. We only need >>> to check a[0:100] and b[0:101] don't overlap. >>> >>> More generally, consider two pairs of data refs (a, b1) and (a, b2). >>> Suppose addr_b1 and addr_b2 are basic addresses of data ref b1 and b2; >>> offset_b1 and offset_b2 (offset_b1 < offset_b2) are offsets of b1 and >>> b2, and segment_length_a, segment_length_b1, and segment_length_b2 are >>> segment length of a, b1, and b2. Then we can combine the two >>> comparisons into one if the following condition is satisfied: >>> >>> offset_b2- offset_b1 - segment_length_b1 < segment_length_a >>> >>> >>> This patch detects those combination opportunities to reduce the >>> number of alias checks. It is tested on an x86-64 machine. >> >> Apart from the other comments you got (to which I agree) the patch >> seems to do two things, namely also: >> >> + /* Extract load and store statements on pointers with zero-stride >> + accesses. */ >> + if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo)) >> +{ >> >> which I'd rather see in a separate patch (and done also when >> the loop doesn't require versioning for alias). > > yes. > >> >> Also combining the alias checks in vect_create_cond_for_alias_checks >> is nice but doesn't properly fix the use of the >> vect-max-version-for-alias-checks param > > Yes. The handling of this should be moved to > 'vect_prune_runtime_alias_test_list' to avoid premature decisions. > > > >>which currently inhibits >> vectorization of the HIMENO benchmark by default (and make us look bad >> compared to LLVM). > > Here is a small reproducible: > > struct A { > int *base; > int offset; > int offset2; > int offset3; > int offset4; > int offset5; > int offset6; > int offset7; > int offset8; > }; > > void foo (struct A * ar1, struct A* ar2) > { > int i; > for (i = 0; i < 1; i++) > { >ar1->base[i] = 2*ar2->base[i] + ar2->offset + ar2->offset2 > + ar2->offset3 + ar2->offset4 + ar2->offset5 + ar2->offset6; /* + > ar2->offset7 + ar2->offset8;*/ > } > } > > GCC trunk won't vectorize it at O2 due to the limit. > > > There is another problem we should be tracking: GCC no longer > vectorize the loop (with large > --param=vect-max-version-for-alias-checks=40) when -fno-strict-alias > is specified. However with additional runtime alias check, the loop > should be vectorizable. The problem can be reproduced by the following loop: void foo (int* a, int** b) { int i; for (i = 0; i < 1000; ++i) a[i] = (*b)[i]; } When -fno-strict-aliasing is specified, the basic address of (*b)[i] which is *b could be modified by a[i] if alias exists between them. This forbids GCC from making the basic address of (*b)[i] a loop invariant, and hence could not do
Re: [PATCH] Reducing number of alias checks in vectorization.
Forget to mention that the alias check merger can reduce the number of checks from 7 to 2 for this example: struct A { int *base; int offset; int offset2; int offset3; int offset4; int offset5; int offset6; int offset7; int offset8; }; void foo (struct A * ar1, struct A* ar2) { int i; for (i = 0; i < 1; i++) { ar1->base[i] = 2*ar2->base[i] + ar2->offset + ar2->offset2 + ar2->offset3 + ar2->offset4 + ar2->offset5 + ar2->offset6; /* + ar2->offset7 + ar2->offset8;*/ } } thanks, Cong On Wed, Oct 2, 2013 at 2:18 PM, Xinliang David Li wrote: > On Wed, Oct 2, 2013 at 4:24 AM, Richard Biener wrote: >> On Tue, 1 Oct 2013, Cong Hou wrote: >> >>> When alias exists between data refs in a loop, to vectorize it GCC >>> does loop versioning and adds runtime alias checks. Basically for each >>> pair of data refs with possible data dependence, there will be two >>> comparisons generated to make sure there is no aliasing between them >>> in each iteration of the vectorized loop. If there are many such data >>> refs pairs, the number of comparisons can be very large, which is a >>> big overhead. >>> >>> However, in some cases it is possible to reduce the number of those >>> comparisons. For example, for the following loop, we can detect that >>> b[0] and b[1] are two consecutive member accesses so that we can >>> combine the alias check between a[0:100]&b[0] and a[0:100]&b[1] into >>> checking a[0:100]&b[0:2]: >>> >>> void foo(int*a, int* b) >>> { >>>for (int i = 0; i < 100; ++i) >>> a[i] = b[0] + b[1]; >>> } >>> >>> Actually, the requirement of consecutive memory accesses is too >>> strict. For the following loop, we can still combine the alias checks >>> between a[0:100]&b[0] and a[0:100]&b[100]: >>> >>> void foo(int*a, int* b) >>> { >>>for (int i = 0; i < 100; ++i) >>> a[i] = b[0] + b[100]; >>> } >>> >>> This is because if b[0] is not in a[0:100] and b[100] is not in >>> a[0:100] then a[0:100] cannot be between b[0] and b[100]. We only need >>> to check a[0:100] and b[0:101] don't overlap. >>> >>> More generally, consider two pairs of data refs (a, b1) and (a, b2). >>> Suppose addr_b1 and addr_b2 are basic addresses of data ref b1 and b2; >>> offset_b1 and offset_b2 (offset_b1 < offset_b2) are offsets of b1 and >>> b2, and segment_length_a, segment_length_b1, and segment_length_b2 are >>> segment length of a, b1, and b2. Then we can combine the two >>> comparisons into one if the following condition is satisfied: >>> >>> offset_b2- offset_b1 - segment_length_b1 < segment_length_a >>> >>> >>> This patch detects those combination opportunities to reduce the >>> number of alias checks. It is tested on an x86-64 machine. >> >> Apart from the other comments you got (to which I agree) the patch >> seems to do two things, namely also: >> >> + /* Extract load and store statements on pointers with zero-stride >> + accesses. */ >> + if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo)) >> +{ >> >> which I'd rather see in a separate patch (and done also when >> the loop doesn't require versioning for alias). > > yes. > >> >> Also combining the alias checks in vect_create_cond_for_alias_checks >> is nice but doesn't properly fix the use of the >> vect-max-version-for-alias-checks param > > Yes. The handling of this should be moved to > 'vect_prune_runtime_alias_test_list' to avoid premature decisions. > > > >>which currently inhibits >> vectorization of the HIMENO benchmark by default (and make us look bad >> compared to LLVM). > > Here is a small reproducible: > > struct A { > int *base; > int offset; > int offset2; > int offset3; > int offset4; > int offset5; > int offset6; > int offset7; > int offset8; > }; > > void foo (struct A * ar1, struct A* ar2) > { > int i; > for (i = 0; i < 1; i++) > { >ar1->base[i] = 2*ar2->base[i] + ar2->offset + ar2->offset2 > + ar2->offset3 + ar2->offset4 + ar2->offset5 + ar2->offset6; /* + > ar2->offset7 + ar2->offset8;*/ > } > } > > GCC trunk won't vectorize it at O2 due to the limit. > > > There is another problem we should be tracking: GCC no longer > vectorize
Re: [PATCH] Reducing number of alias checks in vectorization.
On Wed, Oct 2, 2013 at 2:47 PM, Xinliang David Li wrote: > I think you need to augment (using a wrapper class) the DDR to capture > more information about aliased memory pairs. It should be flexible > enough to handle the following cases (you don't have to handle all > cases in your first patch, but keep those in mind). In order to bring the information in this augmented structure from the analysis phase to transformation phase, should we add one more member to loop_vec_info? Note that currently almost all vectorization related information is contained in that struct. > > 1) All accesses in the same group have constant offsets: > > b[i], b[i+1], b[i+2] etc This is the easy case. > > 2) Accesses in the same group may have offset which is specified by a > unsigned value: > >unsigned N = ... > >b[i], b[i+N] If the value of N or its upper bound (see the next case) is unknown at compile time, we could not merge the alias checks for a & b[i] and a & b[i+N]. This is because the segment of a may exist between b[i] and b[i+N]. > > 3) Accesses have offset with value range > 0: > >for (j = 0; j < 1; j++) >for (i = 0; i < ...; i++) > { > b[i] > b[i + j ] // j > 0 > } > If we know j is greater than 0 and has a constant upper bound, we can utilize this information during alias checks merging. For an induction variable j, its upper bound can be queried easily. What if j is not an induction variable: unsigned j = ...; if (j < 1000) { for (i = 0; i < ...; i++) { b[i] b[i + j ] } } In current GCC implementation, how to get the upper bound of j here? Should we search the control dependent predicate of the loop to see if we are lucky to get the upper bound of j? > > 4) base addresses are assigned from the same buffer: > > b1 = &buffer[0]; > b2 = &buffer[1]; > b3 = &buffer[2]; > > for (...) > { > ..b1[i].. > ..b2[i].. > .. >} This case helped to find a bug in my patch. Here the basic address of b1 is an addr_expr &buffer instead of buffer. I should not compare the pointer values of two basic addresses any more but should use operand_equal_p(). Then Jakub is right: I should not sort all ddr pairs by comparing pointer values. I once wrote a comparison function and will consider to use that for sorting. > > 5) More elaborate case: > >for (i = 0; i< 3; i++) > base[i] = &buffer[i*N]; > > b1 = base[0]; > b2 = base[1]; > ... > for () > { >.. b1[i].. > .. > } After loop unrolling this case becomes the same as the last one. thanks, Cong > > David > > > On Wed, Oct 2, 2013 at 2:34 PM, Cong Hou wrote: >> On Wed, Oct 2, 2013 at 4:24 AM, Richard Biener wrote: >>> On Tue, 1 Oct 2013, Cong Hou wrote: >>> >>>> When alias exists between data refs in a loop, to vectorize it GCC >>>> does loop versioning and adds runtime alias checks. Basically for each >>>> pair of data refs with possible data dependence, there will be two >>>> comparisons generated to make sure there is no aliasing between them >>>> in each iteration of the vectorized loop. If there are many such data >>>> refs pairs, the number of comparisons can be very large, which is a >>>> big overhead. >>>> >>>> However, in some cases it is possible to reduce the number of those >>>> comparisons. For example, for the following loop, we can detect that >>>> b[0] and b[1] are two consecutive member accesses so that we can >>>> combine the alias check between a[0:100]&b[0] and a[0:100]&b[1] into >>>> checking a[0:100]&b[0:2]: >>>> >>>> void foo(int*a, int* b) >>>> { >>>>for (int i = 0; i < 100; ++i) >>>> a[i] = b[0] + b[1]; >>>> } >>>> >>>> Actually, the requirement of consecutive memory accesses is too >>>> strict. For the following loop, we can still combine the alias checks >>>> between a[0:100]&b[0] and a[0:100]&b[100]: >>>> >>>> void foo(int*a, int* b) >>>> { >>>>for (int i = 0; i < 100; ++i) >>>> a[i] = b[0] + b[100]; >>>> } >>>> >>>> This is because if b[0] is not in a[0:100] and b[100] is not in >>>> a[0:100] then a[0:100] cannot be between b[0] and b[100]. We only need >>>> to check a[0:100] and b[0:101] don't overlap. >>>> >>&
Re: [PATCH] Reducing number of alias checks in vectorization.
I noticed that there is a "struct dataref_aux" defined in tree-vectorizer.h which is specific to the vectorizer pass and is stored in (void*)aux in "struct data_reference". Can we add one more field "segment_length" to dataref_aux so that we can pass this information for merging alias checks? Then we can avoid to modify or create other structures. thanks, Cong On Wed, Oct 2, 2013 at 2:34 PM, Cong Hou wrote: > On Wed, Oct 2, 2013 at 4:24 AM, Richard Biener wrote: >> On Tue, 1 Oct 2013, Cong Hou wrote: >> >>> When alias exists between data refs in a loop, to vectorize it GCC >>> does loop versioning and adds runtime alias checks. Basically for each >>> pair of data refs with possible data dependence, there will be two >>> comparisons generated to make sure there is no aliasing between them >>> in each iteration of the vectorized loop. If there are many such data >>> refs pairs, the number of comparisons can be very large, which is a >>> big overhead. >>> >>> However, in some cases it is possible to reduce the number of those >>> comparisons. For example, for the following loop, we can detect that >>> b[0] and b[1] are two consecutive member accesses so that we can >>> combine the alias check between a[0:100]&b[0] and a[0:100]&b[1] into >>> checking a[0:100]&b[0:2]: >>> >>> void foo(int*a, int* b) >>> { >>>for (int i = 0; i < 100; ++i) >>> a[i] = b[0] + b[1]; >>> } >>> >>> Actually, the requirement of consecutive memory accesses is too >>> strict. For the following loop, we can still combine the alias checks >>> between a[0:100]&b[0] and a[0:100]&b[100]: >>> >>> void foo(int*a, int* b) >>> { >>>for (int i = 0; i < 100; ++i) >>> a[i] = b[0] + b[100]; >>> } >>> >>> This is because if b[0] is not in a[0:100] and b[100] is not in >>> a[0:100] then a[0:100] cannot be between b[0] and b[100]. We only need >>> to check a[0:100] and b[0:101] don't overlap. >>> >>> More generally, consider two pairs of data refs (a, b1) and (a, b2). >>> Suppose addr_b1 and addr_b2 are basic addresses of data ref b1 and b2; >>> offset_b1 and offset_b2 (offset_b1 < offset_b2) are offsets of b1 and >>> b2, and segment_length_a, segment_length_b1, and segment_length_b2 are >>> segment length of a, b1, and b2. Then we can combine the two >>> comparisons into one if the following condition is satisfied: >>> >>> offset_b2- offset_b1 - segment_length_b1 < segment_length_a >>> >>> >>> This patch detects those combination opportunities to reduce the >>> number of alias checks. It is tested on an x86-64 machine. >> >> Apart from the other comments you got (to which I agree) the patch >> seems to do two things, namely also: >> >> + /* Extract load and store statements on pointers with zero-stride >> + accesses. */ >> + if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo)) >> +{ >> >> which I'd rather see in a separate patch (and done also when >> the loop doesn't require versioning for alias). >> > > > My mistake.. I am working on those two patches at the same time and > pasted that one also here by mistake. I will send another patch about > the "hoist" topic. > > >> Also combining the alias checks in vect_create_cond_for_alias_checks >> is nice but doesn't properly fix the use of the >> vect-max-version-for-alias-checks param which currently inhibits >> vectorization of the HIMENO benchmark by default (and make us look bad >> compared to LLVM). >> >> So I believe this merging should be done incrementally when >> we collect the DDRs we need to test in vect_mark_for_runtime_alias_test. >> > > > I agree that vect-max-version-for-alias-checks param should count the > number of checks after the merge. However, the struct > data_dependence_relation could not record the new information produced > by the merge. The new information I mentioned contains the new segment > length for comparisons. This length is calculated right in > vect_create_cond_for_alias_checks() function. Since > vect-max-version-for-alias-checks is used during analysis phase, shall > we move all those (get segment length for each data ref and merge > alias checks) from transformation to analysis phase? If we cannot > store the result properly (data_dependence_relation is not enough), > shall we do it twice in both phases? > > I also noticed a possibl
Re: [PATCH] Reducing number of alias checks in vectorization.
On Thu, Oct 3, 2013 at 2:06 PM, Joseph S. Myers wrote: > On Tue, 1 Oct 2013, Cong Hou wrote: > >> +#include >> +#include >> +#include >> + >> #include "config.h" > > Whatever the other issues about including these headers at all, any system > header (C or C++) must always be included *after* config.h, as config.h > may define feature test macros that are only properly effective if defined > before any system headers are included, and these macros (affecting such > things as the size of off_t) need to be consistent throughout GCC. > OK. Actually I did meet some conflicts when I put those three C++ headers after all other includes. Thank you for the comments. Cong > -- > Joseph S. Myers > jos...@codesourcery.com
Re: [PATCH] Reducing number of alias checks in vectorization.
Forget about this "aux" idea as the segment length for one data ref can be different in different dr pairs. In my patch I created a struct as shown below: struct dr_addr_with_seg_len { data_reference *dr; tree basic_addr; tree offset; tree seg_len; }; Note that basic_addr and offset can always obtained from dr, but we need to store two segment lengths for each dr pair. It is improper to add a field to data_dependence_relation as it is defined outside of vectorizer. We can change the type (a new one combining data_dependence_relation and segment length) of may_alias_ddrs in loop_vec_info to include such information, but we have to add a new type to tree-vectorizer.h which is only used in two places - still too much. One possible solution is that we create a local struct as shown above and a new function which returns the merged alias check information. This function will be called twice: once during analysis phase and once in transformation phase. Then we don't have to store the merged alias check information during those two phases. The additional time cost is minimal as there will not be too many data dependent dr pairs in a loop. Any comment? thanks, Cong On Thu, Oct 3, 2013 at 10:57 AM, Cong Hou wrote: > I noticed that there is a "struct dataref_aux" defined in > tree-vectorizer.h which is specific to the vectorizer pass and is > stored in (void*)aux in "struct data_reference". Can we add one more > field "segment_length" to dataref_aux so that we can pass this > information for merging alias checks? Then we can avoid to modify or > create other structures. > > > thanks, > Cong > > > On Wed, Oct 2, 2013 at 2:34 PM, Cong Hou wrote: >> On Wed, Oct 2, 2013 at 4:24 AM, Richard Biener wrote: >>> On Tue, 1 Oct 2013, Cong Hou wrote: >>> >>>> When alias exists between data refs in a loop, to vectorize it GCC >>>> does loop versioning and adds runtime alias checks. Basically for each >>>> pair of data refs with possible data dependence, there will be two >>>> comparisons generated to make sure there is no aliasing between them >>>> in each iteration of the vectorized loop. If there are many such data >>>> refs pairs, the number of comparisons can be very large, which is a >>>> big overhead. >>>> >>>> However, in some cases it is possible to reduce the number of those >>>> comparisons. For example, for the following loop, we can detect that >>>> b[0] and b[1] are two consecutive member accesses so that we can >>>> combine the alias check between a[0:100]&b[0] and a[0:100]&b[1] into >>>> checking a[0:100]&b[0:2]: >>>> >>>> void foo(int*a, int* b) >>>> { >>>>for (int i = 0; i < 100; ++i) >>>> a[i] = b[0] + b[1]; >>>> } >>>> >>>> Actually, the requirement of consecutive memory accesses is too >>>> strict. For the following loop, we can still combine the alias checks >>>> between a[0:100]&b[0] and a[0:100]&b[100]: >>>> >>>> void foo(int*a, int* b) >>>> { >>>>for (int i = 0; i < 100; ++i) >>>> a[i] = b[0] + b[100]; >>>> } >>>> >>>> This is because if b[0] is not in a[0:100] and b[100] is not in >>>> a[0:100] then a[0:100] cannot be between b[0] and b[100]. We only need >>>> to check a[0:100] and b[0:101] don't overlap. >>>> >>>> More generally, consider two pairs of data refs (a, b1) and (a, b2). >>>> Suppose addr_b1 and addr_b2 are basic addresses of data ref b1 and b2; >>>> offset_b1 and offset_b2 (offset_b1 < offset_b2) are offsets of b1 and >>>> b2, and segment_length_a, segment_length_b1, and segment_length_b2 are >>>> segment length of a, b1, and b2. Then we can combine the two >>>> comparisons into one if the following condition is satisfied: >>>> >>>> offset_b2- offset_b1 - segment_length_b1 < segment_length_a >>>> >>>> >>>> This patch detects those combination opportunities to reduce the >>>> number of alias checks. It is tested on an x86-64 machine. >>> >>> Apart from the other comments you got (to which I agree) the patch >>> seems to do two things, namely also: >>> >>> + /* Extract load and store statements on pointers with zero-stride >>> + accesses. */ >>> + if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo)) >>> +{ >>> >>> which I'd rather see in a separate patch (and done a
[PATCH] Hoist loop invariant statements containing data refs with zero-step during loop-versioning in vectorization.
During loop versioning in vectorization, the alias check guarantees that any load of a data reference with zero-step is a loop invariant, which can be hoisted outside of the loop. After hoisting the load statement, there may exist more loop invariant statements. This patch tries to find all those statements and hoists them before the loop. An example is shown below: for (i = 0; i < N; ++i) a[i] = *b + 1; After loop versioning the loop to be vectorized is guarded by if (b + 1 < a && a + N < b) which means there is no aliasing between *b and a[i]. The GIMPLE code of the loop body is: : # i_18 = PHI <0(4), i_29(6)> # ivtmp_22 = PHI <1(4), ivtmp_30(6)> _23 = (long unsigned int) i_18; _24 = _23 * 4; _25 = a_6(D) + _24; _26 = *b_8(D);=> loop invariant _27 = _26 + 1;=> loop invariant *_25 = _27; i_29 = i_18 + 1; ivtmp_30 = ivtmp_22 - 1; if (ivtmp_30 != 0) goto ; else goto ; After hoisting loop invariant statements: _26 = *b_8(D); _27 = _26 + 1; : # i_18 = PHI <0(4), i_29(6)> # ivtmp_22 = PHI <1(4), ivtmp_30(6)> _23 = (long unsigned int) i_18; _24 = _23 * 4; _25 = a_6(D) + _24; *_25 = _27; i_29 = i_18 + 1; ivtmp_30 = ivtmp_22 - 1; if (ivtmp_30 != 0) goto ; else goto ; This patch is related to the bug report http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58508 thanks, Cong diff --git gcc/testsuite/gcc.dg/vect/pr58508.c gcc/testsuite/gcc.dg/vect/pr58508.c new file mode 100644 index 000..cb22b50 --- /dev/null +++ gcc/testsuite/gcc.dg/vect/pr58508.c @@ -0,0 +1,20 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details" } */ + + +/* The GCC vectorizer generates loop versioning for the following loop + since there may exist aliasing between A and B. The predicate checks + if A may alias with B across all iterations. Then for the loop in + the true body, we can assert that *B is a loop invariant so that + we can hoist the load of *B before the loop body. */ + +void foo (int* a, int* b) +{ + int i; + for (i = 0; i < 10; ++i) +a[i] = *b + 1; +} + + +/* { dg-final { scan-tree-dump-times "hoist" 2 "vect" } } */ +/* { dg-final { cleanup-tree-dump "vect" } } */
Re: [PATCH] Fixing improper conversion from sin() to sinf() in optimization mode.
Ping... thanks, Cong On Fri, Sep 20, 2013 at 9:49 AM, Cong Hou wrote: > Any comment or more suggestions on this patch? > > > thanks, > Cong > > On Mon, Sep 9, 2013 at 7:28 PM, Cong Hou wrote: >> On Mon, Sep 9, 2013 at 6:26 PM, Xinliang David Li wrote: >>> On Fri, Sep 6, 2013 at 3:24 PM, Cong Hou wrote: >>>> First, thank you for your detailed comments again! Then I deeply >>>> apologize for not explaining my patch properly and responding to your >>>> previous comment. I didn't understand thoroughly the problem before >>>> submitting the patch. >>>> >>>> Previously I only considered the following three conversions for sqrt(): >>>> >>>> >>>> 1: (float) sqrt ((double) float_val) -> sqrtf (float_val) >>>> 2: (float) sqrtl ((long double) float_val) -> sqrtf (float_val) >>>> 3: (double) sqrtl ((long double) double_val) -> sqrt (double_val) >>>> >>>> >>>> We have four types here: >>>> >>>> TYPE is the type to which the result of the function call is converted. >>>> ITYPE is the type of the math call expression. >>>> TREE_TYPE(arg0) is the type of the function argument (before type >>>> conversion). >>>> NEWTYPE is chosen from TYPE and TREE_TYPE(arg0) with higher precision. >>>> It will be the type of the new math call expression after conversion. >>>> >>>> For all three cases above, TYPE is always the same as NEWTYPE. That is >>>> why I only considered TYPE during the precision comparison. ITYPE can >>>> only be double_type_node or long_double_type_node depending on the >>>> type of the math function. That is why I explicitly used those two >>>> types instead of ITYPE (no correctness issue). But you are right, >>>> ITYPE is more elegant and better here. >>>> >>>> After further analysis, I found I missed two more cases. Note that we >>>> have the following conditions according to the code in convert.c: >>>> >>>> TYPE_PRECISION(NEWTYPE) >= TYPE_PRECISION(TYPE) >>>> TYPE_PRECISION(NEWTYPE) >= TYPE_PRECISION(TREE_TYPE(arg0)) >>>> TYPE_PRECISION (NEWTYPE) < TYPE_PRECISION (ITYPE) >>>> >>>> the last condition comes from the fact that we only consider >>>> converting a math function call into another one with narrower type. >>>> Therefore we have >>>> >>>> TYPE_PRECISION(TYPE) < TYPE_PRECISION (ITYPE) >>>> TYPE_PRECISION(TREE_TYPE(arg0)) < TYPE_PRECISION (ITYPE) >>>> >>>> So for sqrt(), TYPE and TREE_TYPE(arg0) can only be float, and for >>>> sqrtl(), TYPE and TREE_TYPE(arg0) can be either float or double with >>>> four possible combinations. Therefore we have two more conversions to >>>> consider besides the three ones I mentioned above: >>>> >>>> >>>> 4: (float) sqrtl ((long double) double_val) -> (float) sqrt (double_val) >>>> 5: (double) sqrtl ((long double) float_val) -> sqrt ((double) float_val) >>>> >>>> >>>> For the first conversion here, TYPE (float) is different from NEWTYPE >>>> (double), and my previous patch doesn't handle this case.The correct >>>> way is to compare precisions of ITYPE and NEWTYPE now. >>>> >>>> To sum up, we are converting the expression >>>> >>>> (TYPE) sqrtITYPE ((ITYPE) expr) >>>> >>>> to >>>> >>>> (TYPE) sqrtNEWTYPE ((NEWTYPE) expr) >>>> >>>> and we require >>>> >>>> PRECISION (ITYPE) >= PRECISION (NEWTYPE) * 2 + 2 >>>> >>>> to make it a safe conversion. >>>> >>>> >>>> The new patch is pasted below. >>>> >>>> I appreciate your detailed comments and analysis, and next time when I >>>> submit a patch I will be more carefully about the reviewer's comment. >>>> >>>> >>>> Thank you! >>>> >>>> Cong >>>> >>>> >>>> >>>> Index: gcc/convert.c >>>> === >>>> --- gcc/convert.c (revision 201891) >>>> +++ gcc/convert.c (working copy) >>>> @@ -135,16 +135,19 @@ convert_to_real (tree type, tree expr) >>>>CASE_MATHFN (COS) >>>>CASE_MATHFN (ERF) >
Re: [PATCH] Fixing improper conversion from sin() to sinf() in optimization mode.
You are right. I am not an expert on numerical analysis, but I tested your case and it proves the number 4 conversion is not safe. Now we have four conversions which are safe once the precision requirement is satisfied. I added a condition if (type != newtype) to remove the unsafe one, as in this case once more conversion is added which leads to the unsafe issue. If you think this condition does not make sense please let me know. The new patch is shown below (the attached file has tabs). Thank you very much! thanks, Cong Index: gcc/convert.c === --- gcc/convert.c (revision 203250) +++ gcc/convert.c (working copy) @@ -135,16 +135,19 @@ convert_to_real (tree type, tree expr) CASE_MATHFN (COS) CASE_MATHFN (ERF) CASE_MATHFN (ERFC) - CASE_MATHFN (FABS) CASE_MATHFN (LOG) CASE_MATHFN (LOG10) CASE_MATHFN (LOG2) CASE_MATHFN (LOG1P) - CASE_MATHFN (LOGB) CASE_MATHFN (SIN) - CASE_MATHFN (SQRT) CASE_MATHFN (TAN) CASE_MATHFN (TANH) +/* The above functions are not safe to do this conversion. */ +if (!flag_unsafe_math_optimizations) + break; + CASE_MATHFN (SQRT) + CASE_MATHFN (FABS) + CASE_MATHFN (LOGB) #undef CASE_MATHFN { tree arg0 = strip_float_extensions (CALL_EXPR_ARG (expr, 0)); @@ -155,13 +158,43 @@ convert_to_real (tree type, tree expr) if (TYPE_PRECISION (TREE_TYPE (arg0)) > TYPE_PRECISION (type)) newtype = TREE_TYPE (arg0); + /* We consider to convert + + (T1) sqrtT2 ((T2) exprT3) + to + (T1) sqrtT4 ((T4) exprT3) + + , where T1 is TYPE, T2 is ITYPE, T3 is TREE_TYPE (ARG0), + and T4 is NEWTYPE. All those types are of floating point types. + T4 (NEWTYPE) should be narrower than T2 (ITYPE). This conversion + is safe only if P1 >= P2*2+2, where P1 and P2 are precisions of + T2 and T4. See the following URL for a reference: + http://stackoverflow.com/questions/9235456/determining-floating-point-square-root + */ + if ((fcode == BUILT_IN_SQRT || fcode == BUILT_IN_SQRTL) + && !flag_unsafe_math_optimizations) + { + /* The following conversion is unsafe even the precision condition + below is satisfied: + + (float) sqrtl ((long double) double_val) -> (float) sqrt (double_val) +*/ + if (type != newtype) +break; + + int p1 = REAL_MODE_FORMAT (TYPE_MODE (itype))->p; + int p2 = REAL_MODE_FORMAT (TYPE_MODE (newtype))->p; + if (p1 < p2 * 2 + 2) +break; + } + /* Be careful about integer to fp conversions. These may overflow still. */ if (FLOAT_TYPE_P (TREE_TYPE (arg0)) && TYPE_PRECISION (newtype) < TYPE_PRECISION (itype) && (TYPE_MODE (newtype) == TYPE_MODE (double_type_node) || TYPE_MODE (newtype) == TYPE_MODE (float_type_node))) -{ + { tree fn = mathfn_built_in (newtype, fcode); if (fn) Index: gcc/ChangeLog === --- gcc/ChangeLog (revision 203250) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,9 @@ +2013-10-07 Cong Hou + + * convert.c (convert_to_real): Forbid unsafe math function + conversions including sin/cos/log etc. Add precision check + for sqrt. + 2013-10-07 Bill Schmidt * config/rs6000/rs6000.c (altivec_expand_vec_perm_const_le): New. Index: gcc/testsuite/ChangeLog === --- gcc/testsuite/ChangeLog (revision 203250) +++ gcc/testsuite/ChangeLog (working copy) @@ -1,3 +1,7 @@ +2013-10-07 Cong Hou + + * gcc.c-torture/execute/20030125-1.c: Update. + 2013-10-07 Bill Schmidt * gcc.target/powerpc/pr43154.c: Skip for ppc64 little endian. Index: gcc/testsuite/gcc.c-torture/execute/20030125-1.c === --- gcc/testsuite/gcc.c-torture/execute/20030125-1.c (revision 203250) +++ gcc/testsuite/gcc.c-torture/execute/20030125-1.c (working copy) @@ -44,11 +44,11 @@ __attribute__ ((noinline)) double sin(double a) { - abort (); + return a; } __attribute__ ((noinline)) float sinf(float a) { - return a; + abort (); } On Thu, Oct 3, 2013 at 5:06 PM, Joseph S. Myers wrote: > On Fri, 6 Sep 2013, Cong Hou wrote: > >> 4: (float) sqrtl ((long double) double_val) -> (float) sqrt (double_val) > > I don't believe this case is in fact safe even if precision (long double) >>= precision (double) * 2 + 2 (when your patch would allow it). > > The result that precision (double) * 2 + 2 is sufficient for the result of > rounding the long double value to double to be the same as the result of > rounding once from infinite precision to double would I think also mean > the same when rounding of the infinite-precision result to float happens > once - that is, if instead of (float) sqrt (double_val) you have fsqrt > (double_val) (fsqrt being the proposed function in draft TS 18661-1 for > computing a square root
Fwd: [PATCH] Reducing number of alias checks in vectorization.
Sorry for forgetting using plain-text mode. Resend it. -- Forwarded message -- From: Cong Hou Date: Mon, Oct 14, 2013 at 3:29 PM Subject: Re: [PATCH] Reducing number of alias checks in vectorization. To: Richard Biener , GCC Patches Cc: Jakub Jelinek I have made a new patch for this issue according to your comments. There are several modifications to my previous patch: 1. Remove the use of STL features such as vector and sort. Use GCC's vec and qsort instead. 2. Comparisons between tree nodes are not based on their addresses any more. Use compare_tree() function instead. 3. The function vect_create_cond_for_alias_checks() now returns the number of alias checks. If its second parameter cond_expr is NULL, then this function only calculate the number of alias checks after the merging and won't generate comparison expressions. 4. The function vect_prune_runtime_alias_test_list() now uses vect_create_cond_for_alias_checks() to get the number of alias checks. The patch is attached as a text file. Please give me your comment on this patch. Thank you! Cong On Thu, Oct 3, 2013 at 2:35 PM, Cong Hou wrote: > > Forget about this "aux" idea as the segment length for one data ref > can be different in different dr pairs. > > In my patch I created a struct as shown below: > > struct dr_addr_with_seg_len > { > data_reference *dr; > tree basic_addr; > tree offset; > tree seg_len; > }; > > > Note that basic_addr and offset can always obtained from dr, but we > need to store two segment lengths for each dr pair. It is improper to > add a field to data_dependence_relation as it is defined outside of > vectorizer. We can change the type (a new one combining > data_dependence_relation and segment length) of may_alias_ddrs in > loop_vec_info to include such information, but we have to add a new > type to tree-vectorizer.h which is only used in two places - still too > much. > > One possible solution is that we create a local struct as shown above > and a new function which returns the merged alias check information. > This function will be called twice: once during analysis phase and > once in transformation phase. Then we don't have to store the merged > alias check information during those two phases. The additional time > cost is minimal as there will not be too many data dependent dr pairs > in a loop. > > Any comment? > > > thanks, > Cong > > > On Thu, Oct 3, 2013 at 10:57 AM, Cong Hou wrote: > > I noticed that there is a "struct dataref_aux" defined in > > tree-vectorizer.h which is specific to the vectorizer pass and is > > stored in (void*)aux in "struct data_reference". Can we add one more > > field "segment_length" to dataref_aux so that we can pass this > > information for merging alias checks? Then we can avoid to modify or > > create other structures. > > > > > > thanks, > > Cong > > > > > > On Wed, Oct 2, 2013 at 2:34 PM, Cong Hou wrote: > >> On Wed, Oct 2, 2013 at 4:24 AM, Richard Biener wrote: > >>> On Tue, 1 Oct 2013, Cong Hou wrote: > >>> > >>>> When alias exists between data refs in a loop, to vectorize it GCC > >>>> does loop versioning and adds runtime alias checks. Basically for each > >>>> pair of data refs with possible data dependence, there will be two > >>>> comparisons generated to make sure there is no aliasing between them > >>>> in each iteration of the vectorized loop. If there are many such data > >>>> refs pairs, the number of comparisons can be very large, which is a > >>>> big overhead. > >>>> > >>>> However, in some cases it is possible to reduce the number of those > >>>> comparisons. For example, for the following loop, we can detect that > >>>> b[0] and b[1] are two consecutive member accesses so that we can > >>>> combine the alias check between a[0:100]&b[0] and a[0:100]&b[1] into > >>>> checking a[0:100]&b[0:2]: > >>>> > >>>> void foo(int*a, int* b) > >>>> { > >>>>for (int i = 0; i < 100; ++i) > >>>> a[i] = b[0] + b[1]; > >>>> } > >>>> > >>>> Actually, the requirement of consecutive memory accesses is too > >>>> strict. For the following loop, we can still combine the alias checks > >>>> between a[0:100]&b[0] and a[0:100]&b[100]: > >>>> > >>>> void foo(int*a, int* b) > >>>> { > >>>>for (int i = 0; i < 100; ++i) > >&
Re: [PATCH] Hoist loop invariant statements containing data refs with zero-step during loop-versioning in vectorization.
Any comment on this patch? thanks, Cong On Thu, Oct 3, 2013 at 3:59 PM, Cong Hou wrote: > During loop versioning in vectorization, the alias check guarantees > that any load of a data reference with zero-step is a loop invariant, > which can be hoisted outside of the loop. After hoisting the load > statement, there may exist more loop invariant statements. This patch > tries to find all those statements and hoists them before the loop. > > An example is shown below: > > > for (i = 0; i < N; ++i) > a[i] = *b + 1; > > > After loop versioning the loop to be vectorized is guarded by > > if (b + 1 < a && a + N < b) > > which means there is no aliasing between *b and a[i]. The GIMPLE code > of the loop body is: > > : > # i_18 = PHI <0(4), i_29(6)> > # ivtmp_22 = PHI <1(4), ivtmp_30(6)> > _23 = (long unsigned int) i_18; > _24 = _23 * 4; > _25 = a_6(D) + _24; > _26 = *b_8(D);=> loop invariant > _27 = _26 + 1;=> loop invariant > *_25 = _27; > i_29 = i_18 + 1; > ivtmp_30 = ivtmp_22 - 1; > if (ivtmp_30 != 0) > goto ; > else > goto ; > > > After hoisting loop invariant statements: > > > _26 = *b_8(D); > _27 = _26 + 1; > > : > # i_18 = PHI <0(4), i_29(6)> > # ivtmp_22 = PHI <1(4), ivtmp_30(6)> > _23 = (long unsigned int) i_18; > _24 = _23 * 4; > _25 = a_6(D) + _24; > *_25 = _27; > i_29 = i_18 + 1; > ivtmp_30 = ivtmp_22 - 1; > if (ivtmp_30 != 0) > goto ; > else > goto ; > > > This patch is related to the bug report > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58508 > > > thanks, > Cong
Re: [PATCH] Relax the requirement of reduction pattern in GCC vectorizer.
Ping... thanks, Cong On Wed, Oct 2, 2013 at 11:18 AM, Cong Hou wrote: > Ping.. Any comment on this patch? > > > thanks, > Cong > > > On Sat, Sep 28, 2013 at 9:34 AM, Xinliang David Li wrote: >> You can also add a test case of this form: >> >> int foo( int t, int n, int *dst) >> { >>int j = 0; >>int s = 1; >>t++; >>for (j = 0; j < n; j++) >> { >> dst[j] = t; >> s *= t; >> } >> >>return s; >> } >> >> where without the fix the loop vectorization is missed. >> >> David >> >> On Fri, Sep 27, 2013 at 6:28 PM, Cong Hou wrote: >>> The current GCC vectorizer requires the following pattern as a simple >>> reduction computation: >>> >>>loop_header: >>> a1 = phi < a0, a2 > >>> a3 = ... >>> a2 = operation (a3, a1) >>> >>> But a3 can also be defined outside of the loop. For example, the >>> following loop can benefit from vectorization but the GCC vectorizer >>> fails to vectorize it: >>> >>> >>> int foo(int v) >>> { >>> int s = 1; >>> ++v; >>> for (int i = 0; i < 10; ++i) >>> s *= v; >>> return s; >>> } >>> >>> >>> This patch relaxes the original requirement by also considering the >>> following pattern: >>> >>> >>>a3 = ... >>>loop_header: >>> a1 = phi < a0, a2 > >>> a2 = operation (a3, a1) >>> >>> >>> A test case is also added. The patch is tested on x86-64. >>> >>> >>> thanks, >>> Cong >>> >>> >>> >>> diff --git a/gcc/ChangeLog b/gcc/ChangeLog >>> index 39c786e..45c1667 100644 >>> --- a/gcc/ChangeLog >>> +++ b/gcc/ChangeLog >>> @@ -1,3 +1,9 @@ >>> +2013-09-27 Cong Hou >>> + >>> + * tree-vect-loop.c: Relax the requirement of the reduction >>> + pattern so that one operand of the reduction operation can >>> + come from outside of the loop. >>> + >>> 2013-09-25 Tom Tromey >>> >>> * Makefile.in (PARTITION_H, LTO_SYMTAB_H, COMMON_TARGET_DEF_H) >>> diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog >>> index 09644d2..90496a2 100644 >>> --- a/gcc/testsuite/ChangeLog >>> +++ b/gcc/testsuite/ChangeLog >>> @@ -1,3 +1,7 @@ >>> +2013-09-27 Cong Hou >>> + >>> + * gcc.dg/vect/vect-reduc-pattern-3.c: New test. >>> + >>> 2013-09-25 Marek Polacek >>> >>> PR sanitizer/58413 >>> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c >>> index 2871ba1..3c51c3b 100644 >>> --- a/gcc/tree-vect-loop.c >>> +++ b/gcc/tree-vect-loop.c >>> @@ -2091,6 +2091,13 @@ vect_is_slp_reduction (loop_vec_info loop_info, >>> gimple phi, gimple first_stmt) >>> a3 = ... >>> a2 = operation (a3, a1) >>> >>> + or >>> + >>> + a3 = ... >>> + loop_header: >>> + a1 = phi < a0, a2 > >>> + a2 = operation (a3, a1) >>> + >>> such that: >>> 1. operation is commutative and associative and it is safe to >>>change the order of the computation (if CHECK_REDUCTION is true) >>> @@ -2451,6 +2458,7 @@ vect_is_simple_reduction_1 (loop_vec_info >>> loop_info, gimple phi, >>>if (def2 && def2 == phi >>>&& (code == COND_EXPR >>>|| !def1 || gimple_nop_p (def1) >>> + || !flow_bb_inside_loop_p (loop, gimple_bb (def1)) >>>|| (def1 && flow_bb_inside_loop_p (loop, gimple_bb (def1)) >>>&& (is_gimple_assign (def1) >>>|| is_gimple_call (def1) >>> @@ -2469,6 +2477,7 @@ vect_is_simple_reduction_1 (loop_vec_info >>> loop_info, gimple phi, >>>if (def1 && def1 == phi >>>&& (code == COND_EXPR >>>|| !def2 || gimple_nop_p (def2) >>> + || !flow_bb_inside_loop_p (loop, gimple_bb (def2)) >>>|| (def2 && flow_bb_inside_loop_p (loop, gimple_bb (def2)) >>>&& (is_gimple_assign (def2) >>>|| is_gimple_call (def2) >>> diff --git gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c >>> gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c >>> new file mode 100644 >>> index 000..06a9416 >>> --- /dev/null >>> +++ gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c >>> @@ -0,0 +1,41 @@ >>> +/* { dg-require-effective-target vect_int } */ >>> + >>> +#include >>> +#include "tree-vect.h" >>> + >>> +#define N 10 >>> +#define RES 1024 >>> + >>> +/* A reduction pattern in which there is no data ref in >>> + the loop and one operand is defined outside of the loop. */ >>> + >>> +__attribute__ ((noinline)) int >>> +foo (int v) >>> +{ >>> + int i; >>> + int result = 1; >>> + >>> + ++v; >>> + for (i = 0; i < N; i++) >>> +result *= v; >>> + >>> + return result; >>> +} >>> + >>> +int >>> +main (void) >>> +{ >>> + int res; >>> + >>> + check_vect (); >>> + >>> + res = foo (1); >>> + if (res != RES) >>> +abort (); >>> + >>> + return 0; >>> +} >>> + >>> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ >>> +/* { dg-final { cleanup-tree-dump "vect" } } */ >>> +
Re: [PATCH] Relax the requirement of reduction pattern in GCC vectorizer.
I have corrected the ChangeLog format, and committed this patch. Thank you! Cong On Tue, Oct 15, 2013 at 6:38 AM, Richard Biener wrote: > On Sat, Sep 28, 2013 at 3:28 AM, Cong Hou wrote: >> The current GCC vectorizer requires the following pattern as a simple >> reduction computation: >> >>loop_header: >> a1 = phi < a0, a2 > >> a3 = ... >> a2 = operation (a3, a1) >> >> But a3 can also be defined outside of the loop. For example, the >> following loop can benefit from vectorization but the GCC vectorizer >> fails to vectorize it: >> >> >> int foo(int v) >> { >> int s = 1; >> ++v; >> for (int i = 0; i < 10; ++i) >> s *= v; >> return s; >> } >> >> >> This patch relaxes the original requirement by also considering the >> following pattern: >> >> >>a3 = ... >>loop_header: >> a1 = phi < a0, a2 > >> a2 = operation (a3, a1) >> >> >> A test case is also added. The patch is tested on x86-64. >> >> >> thanks, >> Cong >> >> >> >> diff --git a/gcc/ChangeLog b/gcc/ChangeLog >> index 39c786e..45c1667 100644 >> --- a/gcc/ChangeLog >> +++ b/gcc/ChangeLog >> @@ -1,3 +1,9 @@ >> +2013-09-27 Cong Hou >> + >> + * tree-vect-loop.c: Relax the requirement of the reduction > > ChangeLog format is > > * tree-vect-loop.c (vect_is_simple_reduction_1): Relax the > requirement of the reduction. > > Ok with that change. > > Thanks, > Richard. > >> + pattern so that one operand of the reduction operation can >> + come from outside of the loop. >> + >> 2013-09-25 Tom Tromey >> >> * Makefile.in (PARTITION_H, LTO_SYMTAB_H, COMMON_TARGET_DEF_H) >> diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog >> index 09644d2..90496a2 100644 >> --- a/gcc/testsuite/ChangeLog >> +++ b/gcc/testsuite/ChangeLog >> @@ -1,3 +1,7 @@ >> +2013-09-27 Cong Hou >> + >> + * gcc.dg/vect/vect-reduc-pattern-3.c: New test. >> + >> 2013-09-25 Marek Polacek >> >> PR sanitizer/58413 >> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c >> index 2871ba1..3c51c3b 100644 >> --- a/gcc/tree-vect-loop.c >> +++ b/gcc/tree-vect-loop.c >> @@ -2091,6 +2091,13 @@ vect_is_slp_reduction (loop_vec_info loop_info, >> gimple phi, gimple first_stmt) >> a3 = ... >> a2 = operation (a3, a1) >> >> + or >> + >> + a3 = ... >> + loop_header: >> + a1 = phi < a0, a2 > >> + a2 = operation (a3, a1) >> + >> such that: >> 1. operation is commutative and associative and it is safe to >>change the order of the computation (if CHECK_REDUCTION is true) >> @@ -2451,6 +2458,7 @@ vect_is_simple_reduction_1 (loop_vec_info >> loop_info, gimple phi, >>if (def2 && def2 == phi >>&& (code == COND_EXPR >>|| !def1 || gimple_nop_p (def1) >> + || !flow_bb_inside_loop_p (loop, gimple_bb (def1)) >>|| (def1 && flow_bb_inside_loop_p (loop, gimple_bb (def1)) >>&& (is_gimple_assign (def1) >>|| is_gimple_call (def1) >> @@ -2469,6 +2477,7 @@ vect_is_simple_reduction_1 (loop_vec_info >> loop_info, gimple phi, >>if (def1 && def1 == phi >>&& (code == COND_EXPR >>|| !def2 || gimple_nop_p (def2) >> + || !flow_bb_inside_loop_p (loop, gimple_bb (def2)) >>|| (def2 && flow_bb_inside_loop_p (loop, gimple_bb (def2)) >>&& (is_gimple_assign (def2) >>|| is_gimple_call (def2) >> diff --git gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c >> gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c >> new file mode 100644 >> index 000..06a9416 >> --- /dev/null >> +++ gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c >> @@ -0,0 +1,41 @@ >> +/* { dg-require-effective-target vect_int } */ >> + >> +#include >> +#include "tree-vect.h" >> + >> +#define N 10 >> +#define RES 1024 >> + >> +/* A reduction pattern in which there is no data ref in >> + the loop and one operand is defined outside of the loop. */ >> + >> +__attribute__ ((noinline)) int >> +foo (int v) >> +{ >> + int i; >> + int result = 1; >> + >> + ++v; >> + for (i = 0; i < N; i++) >> +result *= v; >> + >> + return result; >> +} >> + >> +int >> +main (void) >> +{ >> + int res; >> + >> + check_vect (); >> + >> + res = foo (1); >> + if (res != RES) >> +abort (); >> + >> + return 0; >> +} >> + >> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ >> +/* { dg-final { cleanup-tree-dump "vect" } } */ >> +
Re: [PATCH] Hoist loop invariant statements containing data refs with zero-step during loop-versioning in vectorization.
Thank you for your reminder, Jeff! I just noticed Richard's comment. I have modified the patch according to that. The new patch is attached. thanks, Cong On Tue, Oct 15, 2013 at 12:33 PM, Jeff Law wrote: > On 10/14/13 17:31, Cong Hou wrote: >> >> Any comment on this patch? > > Richi replied in the BZ you opened. > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58508 > > Essentially he said emit the load on the edge rather than in the block > itself. > jeff > diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 8a38316..2637309 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,8 @@ +2013-10-15 Cong Hou + + * tree-vect-loop-manip.c (vect_loop_versioning): Hoist loop invariant + statement that contains data refs with zero-step. + 2013-10-14 David Malcolm * dumpfile.h (gcc::dump_manager): New class, to hold state diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 075d071..9d0f4a5 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,7 @@ +2013-10-15 Cong Hou + + * gcc.dg/vect/pr58508.c: New test. + 2013-10-14 Tobias Burnus PR fortran/58658 diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c b/gcc/testsuite/gcc.dg/vect/pr58508.c new file mode 100644 index 000..cb22b50 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr58508.c @@ -0,0 +1,20 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details" } */ + + +/* The GCC vectorizer generates loop versioning for the following loop + since there may exist aliasing between A and B. The predicate checks + if A may alias with B across all iterations. Then for the loop in + the true body, we can assert that *B is a loop invariant so that + we can hoist the load of *B before the loop body. */ + +void foo (int* a, int* b) +{ + int i; + for (i = 0; i < 10; ++i) +a[i] = *b + 1; +} + + +/* { dg-final { scan-tree-dump-times "hoist" 2 "vect" } } */ +/* { dg-final { cleanup-tree-dump "vect" } } */ diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c index 574446a..f4fdec2 100644 --- a/gcc/tree-vect-loop-manip.c +++ b/gcc/tree-vect-loop-manip.c @@ -2477,6 +2477,92 @@ vect_loop_versioning (loop_vec_info loop_vinfo, adjust_phi_and_debug_stmts (orig_phi, e, PHI_RESULT (new_phi)); } + + /* Extract load and store statements on pointers with zero-stride + accesses. */ + if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo)) +{ + /* In the loop body, we iterate each statement to check if it is a load +or store. Then we check the DR_STEP of the data reference. If +DR_STEP is zero, then we will hoist the load statement to the loop +preheader, and move the store statement to the loop exit. */ + + for (gimple_stmt_iterator si = gsi_start_bb (loop->header); + !gsi_end_p (si);) + { + gimple stmt = gsi_stmt (si); + stmt_vec_info stmt_info = vinfo_for_stmt (stmt); + struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info); + + if (dr && integer_zerop (DR_STEP (dr))) + { + if (DR_IS_READ (dr)) + { + if (dump_enabled_p ()) + { + dump_printf_loc + (MSG_NOTE, vect_location, + "hoist the statement to outside of the loop "); + dump_gimple_stmt (MSG_NOTE, TDF_SLIM, stmt, 0); + dump_printf (MSG_NOTE, "\n"); + } + + gsi_remove (&si, false); + gsi_insert_on_edge_immediate (loop_preheader_edge (loop), stmt); + } + /* TODO: We also consider vectorizing loops containing zero-step +data refs as writes. For example: + +int a[N], *s; +for (i = 0; i < N; i++) + *s += a[i]; + +In this case the write to *s can be also moved after the +loop. */ + + continue; + } + else if (!dr) + { + bool hoist = true; + for (size_t i = 0; i < gimple_num_ops (stmt); i++) + { + tree op = gimple_op (stmt, i); + if (TREE_CODE (op) == INTEGER_CST + || TREE_CODE (op) == REAL_CST) + continue; + if (TREE_CODE (op) == SSA_NAME) + { + gimple def = SSA_NAME_DEF_STMT (op); + if (def == stmt + || gimple_nop_p (def) + || !flow_bb_inside_loop_p (loop, gimple_bb (def))) + continue; + } + hoist = false; + break; + } + + if (hoist) + { +
Re: [PATCH] Hoist loop invariant statements containing data refs with zero-step during loop-versioning in vectorization.
On Wed, Oct 16, 2013 at 2:02 AM, Richard Biener wrote: > On Tue, 15 Oct 2013, Cong Hou wrote: > >> Thank you for your reminder, Jeff! I just noticed Richard's comment. I >> have modified the patch according to that. >> >> The new patch is attached. > > (posting patches inline is easier for review, now you have to deal > with no quoting markers ;)) > > Comments inline. > > diff --git a/gcc/ChangeLog b/gcc/ChangeLog > index 8a38316..2637309 100644 > --- a/gcc/ChangeLog > +++ b/gcc/ChangeLog > @@ -1,3 +1,8 @@ > +2013-10-15 Cong Hou > + > + * tree-vect-loop-manip.c (vect_loop_versioning): Hoist loop invariant > + statement that contains data refs with zero-step. > + > 2013-10-14 David Malcolm > > * dumpfile.h (gcc::dump_manager): New class, to hold state > diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog > index 075d071..9d0f4a5 100644 > --- a/gcc/testsuite/ChangeLog > +++ b/gcc/testsuite/ChangeLog > @@ -1,3 +1,7 @@ > +2013-10-15 Cong Hou > + > + * gcc.dg/vect/pr58508.c: New test. > + > 2013-10-14 Tobias Burnus > > PR fortran/58658 > diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c > b/gcc/testsuite/gcc.dg/vect/pr58508.c > new file mode 100644 > index 000..cb22b50 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/vect/pr58508.c > @@ -0,0 +1,20 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details" } */ > + > + > +/* The GCC vectorizer generates loop versioning for the following loop > + since there may exist aliasing between A and B. The predicate checks > + if A may alias with B across all iterations. Then for the loop in > + the true body, we can assert that *B is a loop invariant so that > + we can hoist the load of *B before the loop body. */ > + > +void foo (int* a, int* b) > +{ > + int i; > + for (i = 0; i < 10; ++i) > +a[i] = *b + 1; > +} > + > + > +/* { dg-final { scan-tree-dump-times "hoist" 2 "vect" } } */ > +/* { dg-final { cleanup-tree-dump "vect" } } */ > diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c > index 574446a..f4fdec2 100644 > --- a/gcc/tree-vect-loop-manip.c > +++ b/gcc/tree-vect-loop-manip.c > @@ -2477,6 +2477,92 @@ vect_loop_versioning (loop_vec_info loop_vinfo, >adjust_phi_and_debug_stmts (orig_phi, e, PHI_RESULT (new_phi)); > } > > > Note that applying this kind of transform at this point invalidates > some of the earlier analysis the vectorizer performed (namely the > def-kind which now effectively gets vect_external_def from > vect_internal_def). In this case it doesn't seem to cause any > issues (we re-compute the def-kind everytime we need it (how wasteful)). > > + /* Extract load and store statements on pointers with zero-stride > + accesses. */ > + if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo)) > +{ > + /* In the loop body, we iterate each statement to check if it is a load > +or store. Then we check the DR_STEP of the data reference. If > +DR_STEP is zero, then we will hoist the load statement to the loop > +preheader, and move the store statement to the loop exit. */ > > We don't move the store yet. Micha has a patch pending that enables > vectorization of zero-step stores. > > + for (gimple_stmt_iterator si = gsi_start_bb (loop->header); > + !gsi_end_p (si);) > > While technically ok now (vectorized loops contain a single basic block) > please use LOOP_VINFO_BBS () to get at the vector of basic-blcoks > and iterate over them like other code does. Have done it. > > + { > + gimple stmt = gsi_stmt (si); > + stmt_vec_info stmt_info = vinfo_for_stmt (stmt); > + struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info); > + > + if (dr && integer_zerop (DR_STEP (dr))) > + { > + if (DR_IS_READ (dr)) > + { > + if (dump_enabled_p ()) > + { > + dump_printf_loc > + (MSG_NOTE, vect_location, > + "hoist the statement to outside of the loop "); > > "hoisting out of the vectorized loop: " > > + dump_gimple_stmt (MSG_NOTE, TDF_SLIM, stmt, 0); > + dump_printf (MSG_NOTE, "\n"); > + } > + > + gsi_remove (&si, false); > + gsi_insert_on_edge_immediate (loop_preheader_edge (loop), > stmt); > > Note that this will result in a b
Re: [PATCH] Hoist loop invariant statements containing data refs with zero-step during loop-versioning in vectorization.
I tested this case with -fno-tree-loop-im and -fno-tree-pre, and it seems that GCC could hoist j+1 outside of the i loop: t3.c:5:5: note: hoisting out of the vectorized loop: _10 = (sizetype) j_25; t3.c:5:5: note: hoisting out of the vectorized loop: _11 = _10 + 1; t3.c:5:5: note: hoisting out of the vectorized loop: _12 = _11 * 4; t3.c:5:5: note: hoisting out of the vectorized loop: _14 = b_13(D) + _12; t3.c:5:5: note: hoisting out of the vectorized loop: _15 = *_14; t3.c:5:5: note: hoisting out of the vectorized loop: _16 = _15 + 1; But your suggestion is still nice as it can remove a branch and make the code more brief. I have updated the patch and also included the nested loop example into the test case. Thank you! Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 8a38316..2637309 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,8 @@ +2013-10-15 Cong Hou + + * tree-vect-loop-manip.c (vect_loop_versioning): Hoist loop invariant + statement that contains data refs with zero-step. + 2013-10-14 David Malcolm * dumpfile.h (gcc::dump_manager): New class, to hold state diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 075d071..9d0f4a5 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,7 @@ +2013-10-15 Cong Hou + + * gcc.dg/vect/pr58508.c: New test. + 2013-10-14 Tobias Burnus PR fortran/58658 diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c b/gcc/testsuite/gcc.dg/vect/pr58508.c new file mode 100644 index 000..6484a65 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr58508.c @@ -0,0 +1,70 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details" } */ + + +/* The GCC vectorizer generates loop versioning for the following loop + since there may exist aliasing between A and B. The predicate checks + if A may alias with B across all iterations. Then for the loop in + the true body, we can assert that *B is a loop invariant so that + we can hoist the load of *B before the loop body. */ + +void test1 (int* a, int* b) +{ + int i; + for (i = 0; i < 10; ++i) +a[i] = *b + 1; +} + +/* A test case with nested loops. The load of b[j+1] in the inner + loop should be hoisted. */ + +void test2 (int* a, int* b) +{ + int i, j; + for (j = 0; j < 10; ++j) +for (i = 0; i < 10; ++i) + a[i] = b[j+1] + 1; +} + +/* A test case with ifcvt transformation. */ + +void test3 (int* a, int* b) +{ + int i, t; + for (i = 0; i < 1; ++i) +{ + if (*b > 0) + t = *b * 2; + else + t = *b / 2; + a[i] = t; +} +} + +/* A test case in which the store in the loop can be moved outside + in the versioned loop with alias checks. Note this loop won't + be vectorized. */ + +void test4 (int* a, int* b) +{ + int i; + for (i = 0; i < 10; ++i) +*a += b[i]; +} + +/* A test case in which the load and store in the loop to b + can be moved outside in the versioned loop with alias checks. + Note this loop won't be vectorized. */ + +void test5 (int* a, int* b) +{ + int i; + for (i = 0; i < 10; ++i) +{ + *b += a[i]; + a[i] = *b; +} +} + +/* { dg-final { scan-tree-dump-times "hoist" 8 "vect" } } */ +/* { dg-final { cleanup-tree-dump "vect" } } */ diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c index 574446a..1cc563c 100644 --- a/gcc/tree-vect-loop-manip.c +++ b/gcc/tree-vect-loop-manip.c @@ -2477,6 +2477,73 @@ vect_loop_versioning (loop_vec_info loop_vinfo, adjust_phi_and_debug_stmts (orig_phi, e, PHI_RESULT (new_phi)); } + + /* Extract load statements on memrefs with zero-stride accesses. */ + + if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo)) +{ + /* In the loop body, we iterate each statement to check if it is a load. + Then we check the DR_STEP of the data reference. If DR_STEP is zero, + then we will hoist the load statement to the loop preheader. */ + + basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo); + int nbbs = loop->num_nodes; + + for (int i = 0; i < nbbs; ++i) + { + for (gimple_stmt_iterator si = gsi_start_bb (bbs[i]); + !gsi_end_p (si);) +{ + gimple stmt = gsi_stmt (si); + stmt_vec_info stmt_info = vinfo_for_stmt (stmt); + struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info); + + if (is_gimple_assign (stmt) + && (!dr + || (DR_IS_READ (dr) && integer_zerop (DR_STEP (dr) + { + bool hoist = true; + ssa_op_iter iter; + tree var; + + /* We hoist a statement if all SSA uses in it are defined + outside of the loop. */ + FOR_EACH_SSA_TREE_OPERAND (var, stmt, iter, SSA_OP_USE) +{ + gimple def = SSA_NAME_DEF_STMT (var); + if (!gimple_nop_p (def) + && flow_bb_inside_loop_p (loop, gimple_bb (def))) + { + hoist = false; + break; + } +} + + if (hoist) +{ + if (dr)
Re: [PATCH] Fixing improper conversion from sin() to sinf() in optimization mode.
Ping? thanks, Cong On Mon, Oct 7, 2013 at 10:15 AM, Cong Hou wrote: > You are right. I am not an expert on numerical analysis, but I tested > your case and it proves the number 4 conversion is not safe. > > Now we have four conversions which are safe once the precision > requirement is satisfied. I added a condition if (type != newtype) to > remove the unsafe one, as in this case once more conversion is added > which leads to the unsafe issue. If you think this condition does not > make sense please let me know. > > The new patch is shown below (the attached file has tabs). > > Thank you very much! > > > > thanks, > Cong > > > > Index: gcc/convert.c > === > --- gcc/convert.c (revision 203250) > +++ gcc/convert.c (working copy) > @@ -135,16 +135,19 @@ convert_to_real (tree type, tree expr) >CASE_MATHFN (COS) >CASE_MATHFN (ERF) >CASE_MATHFN (ERFC) > - CASE_MATHFN (FABS) >CASE_MATHFN (LOG) >CASE_MATHFN (LOG10) >CASE_MATHFN (LOG2) >CASE_MATHFN (LOG1P) > - CASE_MATHFN (LOGB) >CASE_MATHFN (SIN) > - CASE_MATHFN (SQRT) >CASE_MATHFN (TAN) >CASE_MATHFN (TANH) > +/* The above functions are not safe to do this conversion. */ > +if (!flag_unsafe_math_optimizations) > + break; > + CASE_MATHFN (SQRT) > + CASE_MATHFN (FABS) > + CASE_MATHFN (LOGB) > #undef CASE_MATHFN > { >tree arg0 = strip_float_extensions (CALL_EXPR_ARG (expr, 0)); > @@ -155,13 +158,43 @@ convert_to_real (tree type, tree expr) >if (TYPE_PRECISION (TREE_TYPE (arg0)) > TYPE_PRECISION (type)) > newtype = TREE_TYPE (arg0); > > + /* We consider to convert > + > + (T1) sqrtT2 ((T2) exprT3) > + to > + (T1) sqrtT4 ((T4) exprT3) > + > + , where T1 is TYPE, T2 is ITYPE, T3 is TREE_TYPE (ARG0), > + and T4 is NEWTYPE. All those types are of floating point types. > + T4 (NEWTYPE) should be narrower than T2 (ITYPE). This conversion > + is safe only if P1 >= P2*2+2, where P1 and P2 are precisions of > + T2 and T4. See the following URL for a reference: > + > http://stackoverflow.com/questions/9235456/determining-floating-point-square-root > + */ > + if ((fcode == BUILT_IN_SQRT || fcode == BUILT_IN_SQRTL) > + && !flag_unsafe_math_optimizations) > + { > + /* The following conversion is unsafe even the precision condition > + below is satisfied: > + > + (float) sqrtl ((long double) double_val) -> (float) sqrt (double_val) > +*/ > + if (type != newtype) > +break; > + > + int p1 = REAL_MODE_FORMAT (TYPE_MODE (itype))->p; > + int p2 = REAL_MODE_FORMAT (TYPE_MODE (newtype))->p; > + if (p1 < p2 * 2 + 2) > +break; > + } > + >/* Be careful about integer to fp conversions. > These may overflow still. */ >if (FLOAT_TYPE_P (TREE_TYPE (arg0)) >&& TYPE_PRECISION (newtype) < TYPE_PRECISION (itype) >&& (TYPE_MODE (newtype) == TYPE_MODE (double_type_node) >|| TYPE_MODE (newtype) == TYPE_MODE (float_type_node))) > -{ > + { >tree fn = mathfn_built_in (newtype, fcode); > >if (fn) > Index: gcc/ChangeLog > === > --- gcc/ChangeLog (revision 203250) > +++ gcc/ChangeLog (working copy) > @@ -1,3 +1,9 @@ > +2013-10-07 Cong Hou > + > + * convert.c (convert_to_real): Forbid unsafe math function > + conversions including sin/cos/log etc. Add precision check > + for sqrt. > + > 2013-10-07 Bill Schmidt > > * config/rs6000/rs6000.c (altivec_expand_vec_perm_const_le): New. > Index: gcc/testsuite/ChangeLog > === > --- gcc/testsuite/ChangeLog (revision 203250) > +++ gcc/testsuite/ChangeLog (working copy) > @@ -1,3 +1,7 @@ > +2013-10-07 Cong Hou > + > + * gcc.c-torture/execute/20030125-1.c: Update. > + > 2013-10-07 Bill Schmidt > > * gcc.target/powerpc/pr43154.c: Skip for ppc64 little endian. > Index: gcc/testsuite/gcc.c-torture/execute/20030125-1.c > === > --- gcc/testsuite/gcc.c-torture/execute/20030125-1.c (revision 203250) > +++ gcc/testsuite/gcc.c-torture/execute/20030125-1.c (working copy) > @@ -44,11 +44,11 @@ __attribute__ ((noinline)) > double > sin(double a) > { > - abort (); > + return a; > } > __attribute__ ((noinline)) > float > sinf(float a) > { > - return a; > + abort (); > } > > > > > On Thu, Oct 3, 2013 at 5:06 PM, Joseph S. Myers > wrote: >> On Fri, 6
Re: [PATCH] Hoist loop invariant statements containing data refs with zero-step during loop-versioning in vectorization.
Jeff, thank you for installing this patch. Actually I already have the write privileges. I just came back from a trip. Thank you again! thanks, Cong On Fri, Oct 18, 2013 at 10:22 PM, Jeff Law wrote: > On 10/18/13 03:56, Richard Biener wrote: >> >> On Thu, 17 Oct 2013, Cong Hou wrote: >> >>> I tested this case with -fno-tree-loop-im and -fno-tree-pre, and it >>> seems that GCC could hoist j+1 outside of the i loop: >>> >>> t3.c:5:5: note: hoisting out of the vectorized loop: _10 = (sizetype) >>> j_25; >>> t3.c:5:5: note: hoisting out of the vectorized loop: _11 = _10 + 1; >>> t3.c:5:5: note: hoisting out of the vectorized loop: _12 = _11 * 4; >>> t3.c:5:5: note: hoisting out of the vectorized loop: _14 = b_13(D) + _12; >>> t3.c:5:5: note: hoisting out of the vectorized loop: _15 = *_14; >>> t3.c:5:5: note: hoisting out of the vectorized loop: _16 = _15 + 1; >>> >>> >>> But your suggestion is still nice as it can remove a branch and make >>> the code more brief. I have updated the patch and also included the >>> nested loop example into the test case. >> >> >> Ok if it passes bootstrap & regtest. > > Bootstrapped & regression tested on x86_64-unknown-linux-gnu. Installed on > Cong's behalf. > > Cong -- if you plan on contributing regularly to GCC, please start the > process for write privileges. This form should have everything you need: > > https://sourceware.org/cgi-bin/pdw/ps_form.cgi > > Jeff
Re: [PATCH] Hoist loop invariant statements containing data refs with zero-step during loop-versioning in vectorization.
OK. Have done that. And this is also a "patch", right? ;) thanks, Cong diff --git a/MAINTAINERS b/MAINTAINERS index 15b6cc7..a6954da 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -406,6 +406,7 @@ Fergus Hendersonf...@cs.mu.oz.au Stuart Henderson shend...@gcc.gnu.org Matthew Hiller hil...@redhat.com Manfred Hollstein m...@suse.com +Cong Hou co...@google.com Falk Hueffner f...@debian.org Andrew John Hughes gnu_and...@member.fsf.org Andy Hutchinsonhutchinsona...@aim.com On Mon, Oct 21, 2013 at 9:46 AM, Jeff Law wrote: > On 10/21/13 10:45, Cong Hou wrote: >> >> Jeff, thank you for installing this patch. Actually I already have the >> write privileges. I just came back from a trip. > > Ah. I didn't see you in the MAINTAINERS file. Can you update that file > please. > > Thanks, > jeff >
[PATCH] Vectorizing abs(char/short/int) on x86.
This patch aims at PR58762. Currently GCC could not vectorize abs() operation for integers on x86 with only SSE2 support. For int type, the reason is that the expand on abs() is not defined for vector type. This patch defines such an expand so that abs(int) will be vectorized with only SSE2. For abs(char/short), type conversions are needed as the current abs() function/operation does not accept argument of char/short type. Therefore when we want to get the absolute value of a char_val using abs (char_val), it will be converted into abs ((int) char_val). It then can be vectorized, but the generated code is not efficient as lots of packings and unpackings are envolved. But if we convert (char) abs ((int) char_val) to abs (char_val), the vectorizer will be able to generate better code. Same for short. This conversion also enables vectorizing abs(char/short) operation with PABSB and PABSW instructions in SSE3. With only SSE2 support, I developed three methods to expand abs(char/short/int) seperately: 1. For 32 bit int value x, we can get abs (x) from (((signed) x >> (W-1)) ^ x) - ((signed) x >> (W-1)). This is better than max (x, -x), which needs bit masking. 2. For 16 bit int value x, we can get abs (x) from max (x, -x), as SSE2 provides PMAXSW instruction. 3. For 8 bit int value x, we can get abs (x) from min ((unsigned char) x, (unsigned char) (-x)), as SSE2 provides PMINUB instruction. The patch is pasted below. Please point out any problem in my patch and analysis. thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 8a38316..e0f33ee 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,13 @@ +2013-10-22 Cong Hou + + PR target/58762 + * convert.c (convert_to_integer): Convert (char) abs ((int) char_val) + into abs (char_val). Also convert (short) abs ((int) short_val) + into abs (short_val). + * config/i386/i386-protos.h (ix86_expand_sse2_absvxsi2): New function. + * config/i386/i386.c (ix86_expand_sse2_absvxsi2): New function. + * config/i386/sse.md: Add SSE2 support to abs (char/int/short). + 2013-10-14 David Malcolm * dumpfile.h (gcc::dump_manager): New class, to hold state diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h index 3ab2f3a..e85f663 100644 --- a/gcc/config/i386/i386-protos.h +++ b/gcc/config/i386/i386-protos.h @@ -238,6 +238,7 @@ extern void ix86_expand_mul_widen_evenodd (rtx, rtx, rtx, bool, bool); extern void ix86_expand_mul_widen_hilo (rtx, rtx, rtx, bool, bool); extern void ix86_expand_sse2_mulv4si3 (rtx, rtx, rtx); extern void ix86_expand_sse2_mulvxdi3 (rtx, rtx, rtx); +extern void ix86_expand_sse2_absvxsi2 (rtx, rtx); /* In i386-c.c */ extern void ix86_target_macros (void); diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 02cbbbd..8050e02 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -41696,6 +41696,53 @@ ix86_expand_sse2_mulvxdi3 (rtx op0, rtx op1, rtx op2) gen_rtx_MULT (mode, op1, op2)); } +void +ix86_expand_sse2_absvxsi2 (rtx op0, rtx op1) +{ + enum machine_mode mode = GET_MODE (op0); + rtx tmp0, tmp1; + + switch (mode) +{ + /* For 32-bit signed integer X, the best way to calculate the absolute + value of X is (((signed) X >> (W-1)) ^ X) - ((signed) X >> (W-1)). */ + case V4SImode: + tmp0 = expand_simple_binop (mode, ASHIFTRT, op1, +GEN_INT (GET_MODE_BITSIZE + (GET_MODE_INNER (mode)) - 1), +NULL, 0, OPTAB_DIRECT); + if (tmp0) + tmp1 = expand_simple_binop (mode, XOR, op1, tmp0, + NULL, 0, OPTAB_DIRECT); + if (tmp0 && tmp1) + expand_simple_binop (mode, MINUS, tmp1, tmp0, + op0, 0, OPTAB_DIRECT); + break; + + /* For 16-bit signed integer X, the best way to calculate the absolute + value of X is max (X, -X), as SSE2 provides the PMAXSW insn. */ + case V8HImode: + tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0); + if (tmp0) + expand_simple_binop (mode, SMAX, op1, tmp0, op0, 0, + OPTAB_DIRECT); + break; + + /* For 8-bit signed integer X, the best way to calculate the absolute + value of X is min ((unsigned char) X, (unsigned char) (-X)), + as SSE2 provides the PMINUB insn. */ + case V16QImode: + tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0); + if (tmp0) + expand_simple_binop (V16QImode, UMIN, op1, tmp0, op0, 0, + OPTAB_DIRECT); + break; + + default: + break; +} +} + /* Expand an insert into a vector register through pinsr insn. Return true if successful. */ diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index c3f6c94..bd90f2d 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -8721,7 +8721,7 @@ (set (attr "prefix_rex") (symbol_ref "x86_extended_reg_mentioned_p (insn)")) (set_attr "mode" "DI")]) -(define_insn "abs2" +(define_insn "*abs2" [(set (match_operand:VI124_AVX2_48_AVX512F 0 "register_operand" "=v")
Re: [PATCH] Vectorizing abs(char/short/int) on x86.
On Tue, Oct 22, 2013 at 8:11 PM, wrote: > > > Sent from my iPad > >> On Oct 22, 2013, at 7:23 PM, Cong Hou wrote: >> >> This patch aims at PR58762. >> >> Currently GCC could not vectorize abs() operation for integers on x86 >> with only SSE2 support. For int type, the reason is that the expand on >> abs() is not defined for vector type. This patch defines such an >> expand so that abs(int) will be vectorized with only SSE2. >> >> For abs(char/short), type conversions are needed as the current abs() >> function/operation does not accept argument of char/short type. >> Therefore when we want to get the absolute value of a char_val using >> abs (char_val), it will be converted into abs ((int) char_val). It >> then can be vectorized, but the generated code is not efficient as >> lots of packings and unpackings are envolved. But if we convert >> (char) abs ((int) char_val) to abs (char_val), the vectorizer will be >> able to generate better code. Same for short. >> >> This conversion also enables vectorizing abs(char/short) operation >> with PABSB and PABSW instructions in SSE3. >> >> With only SSE2 support, I developed three methods to expand >> abs(char/short/int) seperately: >> >> 1. For 32 bit int value x, we can get abs (x) from (((signed) x >> >> (W-1)) ^ x) - ((signed) x >> (W-1)). This is better than max (x, -x), >> which needs bit masking. >> >> 2. For 16 bit int value x, we can get abs (x) from max (x, -x), as >> SSE2 provides PMAXSW instruction. >> >> 3. For 8 bit int value x, we can get abs (x) from min ((unsigned char) >> x, (unsigned char) (-x)), as SSE2 provides PMINUB instruction. >> >> >> The patch is pasted below. Please point out any problem in my patch >> and analysis. >> >> >> thanks, >> Cong >> >> >> >> >> diff --git a/gcc/ChangeLog b/gcc/ChangeLog >> index 8a38316..e0f33ee 100644 >> --- a/gcc/ChangeLog >> +++ b/gcc/ChangeLog >> @@ -1,3 +1,13 @@ >> +2013-10-22 Cong Hou >> + >> + PR target/58762 >> + * convert.c (convert_to_integer): Convert (char) abs ((int) char_val) >> + into abs (char_val). Also convert (short) abs ((int) short_val) >> + into abs (short_val). > > I don't like this optimization in convert. I think it should be submitted > separately and should be done in tree-ssa-forwprop. Yes. This patch can be split into two: one for vectorization and one for abs conversion. The reason why I put abs conversion to convert.c is because fabs conversion is also done there. > > Also I think you should have a generic (non x86) test case for the above > optimization. For vectorization I need to do it on x86 since the define_expand is only for it. But for abs conversion, yes, I should make a generic test case. Thank you for your comments! Cong > > Thanks, > Andrew > > >> + * config/i386/i386-protos.h (ix86_expand_sse2_absvxsi2): New function. >> + * config/i386/i386.c (ix86_expand_sse2_absvxsi2): New function. >> + * config/i386/sse.md: Add SSE2 support to abs (char/int/short). > > > >> + >> 2013-10-14 David Malcolm >> >> * dumpfile.h (gcc::dump_manager): New class, to hold state >> diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h >> index 3ab2f3a..e85f663 100644 >> --- a/gcc/config/i386/i386-protos.h >> +++ b/gcc/config/i386/i386-protos.h >> @@ -238,6 +238,7 @@ extern void ix86_expand_mul_widen_evenodd (rtx, >> rtx, rtx, bool, bool); >> extern void ix86_expand_mul_widen_hilo (rtx, rtx, rtx, bool, bool); >> extern void ix86_expand_sse2_mulv4si3 (rtx, rtx, rtx); >> extern void ix86_expand_sse2_mulvxdi3 (rtx, rtx, rtx); >> +extern void ix86_expand_sse2_absvxsi2 (rtx, rtx); >> >> /* In i386-c.c */ >> extern void ix86_target_macros (void); >> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c >> index 02cbbbd..8050e02 100644 >> --- a/gcc/config/i386/i386.c >> +++ b/gcc/config/i386/i386.c >> @@ -41696,6 +41696,53 @@ ix86_expand_sse2_mulvxdi3 (rtx op0, rtx op1, rtx >> op2) >>gen_rtx_MULT (mode, op1, op2)); >> } >> >> +void >> +ix86_expand_sse2_absvxsi2 (rtx op0, rtx op1) >> +{ >> + enum machine_mode mode = GET_MODE (op0); >> + rtx tmp0, tmp1; >> + >> + switch (mode) >> +{ >> + /* For 32-bit signed integer X, the best way to calculate the absolute >> + value of X is (((signed) X >> (W-1)) ^ X) - ((signed) X >> (W-1)). */ >> + case V4SImode: >> + tmp0 = exp
Re: [PATCH] Vectorizing abs(char/short/int) on x86.
On Wed, Oct 23, 2013 at 12:20 AM, Uros Bizjak wrote: > Hello! > >> Currently GCC could not vectorize abs() operation for integers on x86 >> with only SSE2 support. For int type, the reason is that the expand on >> abs() is not defined for vector type. This patch defines such an >> expand so that abs(int) will be vectorized with only SSE2. > > +(define_expand "abs2" > + [(set (match_operand:VI124_AVX2_48_AVX512F 0 "register_operand") > + (abs:VI124_AVX2_48_AVX512F > + (match_operand:VI124_AVX2_48_AVX512F 1 "register_operand")))] > + "TARGET_SSE2" > +{ > + if (TARGET_SSE2 && !TARGET_SSSE3) > +ix86_expand_sse2_absvxsi2 (operands[0], operands[1]); > + else if (TARGET_SSSE3) > +emit_insn (gen_rtx_SET (VOIDmode, operands[0], > +gen_rtx_ABS (mode, operands[1]))); > + DONE; > +}) > > This should be written as: > > (define_expand "abs2" > [(set (match_operand:VI124_AVX2_48_AVX512F 0 "register_operand") >(abs:VI124_AVX2_48_AVX512F > (match_operand:VI124_AVX2_48_AVX512F 1 "nonimmediate_operand")))] > "TARGET_SSE2" > { > if (!TARGET_SSSE3) > { > ix86_expand_sse2_absvxsi2 (operands[0], operands[1]); > DONE; > } > }) OK. > > Please note that operands[1] can be a memory operand, so your expander > should either handle it (this is preferred) or load the operand to the > register at the beginning of the expansion. OK. I think I don't have to make any change to ix86_expand_sse2_absvxsi2(), as operands[1] is always read-only. Right? > > +void > +ix86_expand_sse2_absvxsi2 (rtx op0, rtx op1) > > This function name implies SImode operands ... please just name it > ix86_expand_sse2_abs. Yes, my bad. At first I only considered V4SI but later forgot to rename the function. Thank you very much! Cong > > Uros.
Re: [PATCH] Vectorizing abs(char/short/int) on x86.
On Wed, Oct 23, 2013 at 8:52 AM, Joseph S. Myers wrote: > On Tue, 22 Oct 2013, Cong Hou wrote: > >> For abs(char/short), type conversions are needed as the current abs() >> function/operation does not accept argument of char/short type. >> Therefore when we want to get the absolute value of a char_val using >> abs (char_val), it will be converted into abs ((int) char_val). It >> then can be vectorized, but the generated code is not efficient as >> lots of packings and unpackings are envolved. But if we convert >> (char) abs ((int) char_val) to abs (char_val), the vectorizer will be >> able to generate better code. Same for short. > > ABS_EXPR has undefined overflow behavior. Thus, abs ((int) -128) is > defined (and we also define the subsequent conversion of +128 to signed > char, which ISO C makes implementation-defined not undefined), and > converting to an ABS_EXPR on char would wrongly make it undefined. For > such a transformation to be valid (in the absence of VRP saying that -128 > isn't a possible value) you'd need a GIMPLE representation for > ABS_EXPR, as distinct from ABS_EXPR. > You don't have the option there is for some arithmetic operations of > converting to a corresponding operation on unsigned types. > Yes, you are right. The method I use can guarantee wrapping on overflow (either shift-xor-sub or max(x, -x)). Can I just add the condition if (flag_wrapv) before the conversion I made to prevent the undefined behavior on overflow? Thank you! Cong > -- > Joseph S. Myers > jos...@codesourcery.com
Re: [PATCH] Vectorizing abs(char/short/int) on x86.
I think I did not make it clear. If GCC defines that passing 128 to a char value makes it the wrapping result -128, then the conversion from (char) abs ((int) char_val) to abs (char_val) is safe if we can guarantee abs (char(-128)) = -128 also. Then the subsequent methods used to get abs() should also guarantee wrapping on overflow. Shift-xor-sub is OK, but max(x, -x) is OK only if the result of the negation operation on -128 is also -128 (wrapping). I think that is right the behavior of SSE2 operation PSUBB ([0,...,0], [x,...,x]), as PSUBB can operate both signed/unsigned operands. thanks, Cong On Wed, Oct 23, 2013 at 9:40 PM, Cong Hou wrote: > On Wed, Oct 23, 2013 at 8:52 AM, Joseph S. Myers > wrote: >> On Tue, 22 Oct 2013, Cong Hou wrote: >> >>> For abs(char/short), type conversions are needed as the current abs() >>> function/operation does not accept argument of char/short type. >>> Therefore when we want to get the absolute value of a char_val using >>> abs (char_val), it will be converted into abs ((int) char_val). It >>> then can be vectorized, but the generated code is not efficient as >>> lots of packings and unpackings are envolved. But if we convert >>> (char) abs ((int) char_val) to abs (char_val), the vectorizer will be >>> able to generate better code. Same for short. >> >> ABS_EXPR has undefined overflow behavior. Thus, abs ((int) -128) is >> defined (and we also define the subsequent conversion of +128 to signed >> char, which ISO C makes implementation-defined not undefined), and >> converting to an ABS_EXPR on char would wrongly make it undefined. For >> such a transformation to be valid (in the absence of VRP saying that -128 >> isn't a possible value) you'd need a GIMPLE representation for >> ABS_EXPR, as distinct from ABS_EXPR. >> You don't have the option there is for some arithmetic operations of >> converting to a corresponding operation on unsigned types. >> > > Yes, you are right. The method I use can guarantee wrapping on > overflow (either shift-xor-sub or max(x, -x)). Can I just add the > condition if (flag_wrapv) before the conversion I made to prevent the > undefined behavior on overflow? > > Thank you! > > Cong > > >> -- >> Joseph S. Myers >> jos...@codesourcery.com
Re: [PATCH] Fixing improper conversion from sin() to sinf() in optimization mode.
I have updated the patch according to your suggestion, and have committed the patch as the bootstrapping and make check both get passed. Thank you for your patient help on this patch! I learned a lot from it. thanks, Cong On Wed, Oct 23, 2013 at 1:13 PM, Joseph S. Myers wrote: > On Mon, 7 Oct 2013, Cong Hou wrote: > >> + if (type != newtype) >> +break; > > That comparison would wrongly treat as different cases where the types > differ only in one being a typedef, having qualifiers, etc. - or if in > future GCC implemented proposed TS 18661-3, cases where they differ in > e.g. one being float and the other _Float32 (defined as distinct types > that are not compatible although they have the same representation and > alignment). I think the right test here, bearing in mind the _Float32 > case where types may not be compatible, is TYPE_MODE (type) != TYPE_MODE > (newtype) - if the types have the same mode, they have the same set of > values and so are not different in any way that matters for this > optimization. OK with that change. > > -- > Joseph S. Myers > jos...@codesourcery.com
Re: [PATCH] Vectorizing abs(char/short/int) on x86.
On Wed, Oct 23, 2013 at 11:18 PM, Jakub Jelinek wrote: > On Wed, Oct 23, 2013 at 09:40:21PM -0700, Cong Hou wrote: >> On Wed, Oct 23, 2013 at 8:52 AM, Joseph S. Myers >> wrote: >> > On Tue, 22 Oct 2013, Cong Hou wrote: >> > >> >> For abs(char/short), type conversions are needed as the current abs() >> >> function/operation does not accept argument of char/short type. >> >> Therefore when we want to get the absolute value of a char_val using >> >> abs (char_val), it will be converted into abs ((int) char_val). It >> >> then can be vectorized, but the generated code is not efficient as >> >> lots of packings and unpackings are envolved. But if we convert >> >> (char) abs ((int) char_val) to abs (char_val), the vectorizer will be >> >> able to generate better code. Same for short. >> > >> > ABS_EXPR has undefined overflow behavior. Thus, abs ((int) -128) is >> > defined (and we also define the subsequent conversion of +128 to signed >> > char, which ISO C makes implementation-defined not undefined), and >> > converting to an ABS_EXPR on char would wrongly make it undefined. For >> > such a transformation to be valid (in the absence of VRP saying that -128 >> > isn't a possible value) you'd need a GIMPLE representation for >> > ABS_EXPR, as distinct from ABS_EXPR. >> > You don't have the option there is for some arithmetic operations of >> > converting to a corresponding operation on unsigned types. >> > >> >> Yes, you are right. The method I use can guarantee wrapping on >> overflow (either shift-xor-sub or max(x, -x)). Can I just add the >> condition if (flag_wrapv) before the conversion I made to prevent the >> undefined behavior on overflow? > > What HW insns you expand to is one thing, but if some GCC pass assumes that > ABS_EXPR always returns non-negative value (many do, look e.g. at > tree_unary_nonnegative_warnv_p, extract_range_from_unary_expr_1, > simplify_const_relational_operation, etc., you'd need to grep for all > ABS_EXPR/ABS occurrences) and optimizes code based on that fact, you get > wrong code because (char) abs((char) -128) is well defined. > If we change ABS_EXPR/ABS definition that it is well defined on the most > negative value of the typ (resp. mode), then we loose all those > optimizations, if we do that only for the char/short types, it would be > quite weird, though we could keep the benefits, but at the RTL level we'd > need to treat that way all the modes equal to short's mode and smaller (so, > for sizeof(short) == sizeof(int) target even int's mode). I checked those functions and they all consider the possibility of overflow. For example, tree_unary_nonnegative_warnv_p only returns true for ABS_EXPR on integers if overflow is undefined. If the consequence of overflow is wrapping, I think converting (char) abs((int)-128) to abs(-128) (-128 has char type) is safe. Can we do it by checking flag_wrapv? I could also first remove the abs conversion content from this patch but only keep the content of expanding abs() for i386. I will submit it later. > > The other possibility is not to create the ABS_EXPRs of char/short anywhere, > solve the vectorization issues either through tree-vect-patterns.c or > as part of the vectorization type demotion/promotions, see the recent > discussions for that, you'd represent the short/char abs for the vectorized > loop say using the shift-xor-sub or builtin etc. and if you want to do the > same thing for scalar code, you'd just have combiner try to match some > sequence. Yes, I could do it through tree-vect-patterns.c, if the abs conversion is prohibited. Currently the only reason I need the abs conversion is for vectorization. Vectorization type demotion/promotions is interesting, but I am afraid we will face the same problem there. Thank you for your comment! Cong > > Jakub
Re: [PATCH] Vectorizing abs(char/short/int) on x86.
As there are some issues with abs() type conversions, I removed the related content from the patch but only kept the SSE2 support for abs(int). For the define_expand I added as below, the else body is there to avoid fall-through transformations to ABS operation in optabs.c. Otherwise ABS will be converted to other operations even that we have corresponding instructions from SSSE3. (define_expand "abs2" [(set (match_operand:VI124_AVX2_48_AVX512F 0 "register_operand") (abs:VI124_AVX2_48_AVX512F (match_operand:VI124_AVX2_48_AVX512F 1 "nonimmediate_operand")))] "TARGET_SSE2" { if (!TARGET_SSSE3) ix86_expand_sse2_abs (operands[0], force_reg (mode, operands[1])); else emit_insn (gen_rtx_SET (VOIDmode, operands[0], gen_rtx_ABS (mode, operands[1]))); DONE; }) The patch is attached here. Please give me your comments. thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 8a38316..84c7ab5 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,10 @@ +2013-10-22 Cong Hou + + PR target/58762 + * config/i386/i386-protos.h (ix86_expand_sse2_abs): New function. + * config/i386/i386.c (ix86_expand_sse2_abs): New function. + * config/i386/sse.md: Add SSE2 support to abs (8/16/32-bit-int). + 2013-10-14 David Malcolm * dumpfile.h (gcc::dump_manager): New class, to hold state diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h index 3ab2f3a..ca31224 100644 --- a/gcc/config/i386/i386-protos.h +++ b/gcc/config/i386/i386-protos.h @@ -238,6 +238,7 @@ extern void ix86_expand_mul_widen_evenodd (rtx, rtx, rtx, bool, bool); extern void ix86_expand_mul_widen_hilo (rtx, rtx, rtx, bool, bool); extern void ix86_expand_sse2_mulv4si3 (rtx, rtx, rtx); extern void ix86_expand_sse2_mulvxdi3 (rtx, rtx, rtx); +extern void ix86_expand_sse2_abs (rtx, rtx); /* In i386-c.c */ extern void ix86_target_macros (void); diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 02cbbbd..71905fc 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -41696,6 +41696,53 @@ ix86_expand_sse2_mulvxdi3 (rtx op0, rtx op1, rtx op2) gen_rtx_MULT (mode, op1, op2)); } +void +ix86_expand_sse2_abs (rtx op0, rtx op1) +{ + enum machine_mode mode = GET_MODE (op0); + rtx tmp0, tmp1; + + switch (mode) +{ + /* For 32-bit signed integer X, the best way to calculate the absolute + value of X is (((signed) X >> (W-1)) ^ X) - ((signed) X >> (W-1)). */ + case V4SImode: + tmp0 = expand_simple_binop (mode, ASHIFTRT, op1, +GEN_INT (GET_MODE_BITSIZE + (GET_MODE_INNER (mode)) - 1), +NULL, 0, OPTAB_DIRECT); + if (tmp0) + tmp1 = expand_simple_binop (mode, XOR, op1, tmp0, + NULL, 0, OPTAB_DIRECT); + if (tmp0 && tmp1) + expand_simple_binop (mode, MINUS, tmp1, tmp0, + op0, 0, OPTAB_DIRECT); + break; + + /* For 16-bit signed integer X, the best way to calculate the absolute + value of X is max (X, -X), as SSE2 provides the PMAXSW insn. */ + case V8HImode: + tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0); + if (tmp0) + expand_simple_binop (mode, SMAX, op1, tmp0, op0, 0, + OPTAB_DIRECT); + break; + + /* For 8-bit signed integer X, the best way to calculate the absolute + value of X is min ((unsigned char) X, (unsigned char) (-X)), + as SSE2 provides the PMINUB insn. */ + case V16QImode: + tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0); + if (tmp0) + expand_simple_binop (V16QImode, UMIN, op1, tmp0, op0, 0, + OPTAB_DIRECT); + break; + + default: + break; +} +} + /* Expand an insert into a vector register through pinsr insn. Return true if successful. */ diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index c3f6c94..b85ded4 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -8721,7 +8721,7 @@ (set (attr "prefix_rex") (symbol_ref "x86_extended_reg_mentioned_p (insn)")) (set_attr "mode" "DI")]) -(define_insn "abs2" +(define_insn "*abs2" [(set (match_operand:VI124_AVX2_48_AVX512F 0 "register_operand" "=v") (abs:VI124_AVX2_48_AVX512F (match_operand:VI124_AVX2_48_AVX512F 1 "nonimmediate_operand" "vm")))] @@ -8733,6 +8733,20 @@ (set_attr "prefix" "maybe_vex") (set_attr "mode" "")]) +(define_expand "abs2" + [(set (match_operand:VI124_AVX2_48_AVX512F 0 "register_operand") + (abs:VI124_AVX2_48_AVX512F + (match_operand:VI124_AVX2_48_AVX512F 1 "nonimmediate_operand")))] + "TARGET_SSE2" +{ + if (!TARGET_SSSE3) +ix86_expand_sse2_abs (operands[0], force_reg (mode, operands[1])); + else +emit_insn (gen_rtx_SET (VOIDmode, operands[0], +gen_rtx_ABS (mode, operands[1]))); + DONE; +}) + (define_insn "abs2" [(set (match_operand:MMXMODEI 0 "register_operand" "
Re: [PATCH] Vectorizing abs(char/short/int) on x86.
On Tue, Oct 29, 2013 at 1:38 AM, Uros Bizjak wrote: > Hello! > >> For the define_expand I added as below, the else body is there to >> avoid fall-through transformations to ABS operation in optabs.c. >> Otherwise ABS will be converted to other operations even that we have >> corresponding instructions from SSSE3. > > No, it wont be. > > Fallthrough will generate the pattern that will be matched by the insn > pattern above, just like you are doing by hand below. I think the case is special for abs(). In optabs.c, there is a function expand_abs() in which the function expand_abs_nojump() is called. This function first tries the expand function defined for the target and if it fails it will try max(v, -v) then shift-xor-sub method. If I don't generate any instruction for SSSE3, the fall-through will be max(v, -v). I have tested it on my machine. > >> (define_expand "abs2" >> [(set (match_operand:VI124_AVX2_48_AVX512F 0 "register_operand") >> (abs:VI124_AVX2_48_AVX512F >> (match_operand:VI124_AVX2_48_AVX512F 1 "nonimmediate_operand")))] >> "TARGET_SSE2" >> { >> if (!TARGET_SSSE3) >> ix86_expand_sse2_abs (operands[0], force_reg (mode, operands[1])); > > Do you really need force_reg here? You are using generic expanders in > ix86_expand_sse2_abs that can handle non-registers operands just as > well. You are right. I have removed force_reg. > >> else >> emit_insn (gen_rtx_SET (VOIDmode, operands[0], >>gen_rtx_ABS (mode, operands[1]))); >> DONE; >> }) > > Please note that your mailer mangles indents. Please indent your code > correctly. Right.. I also attached a text file in which all tabs are there. The updated patch is pasted below (and also in the attached file). Thank you very much for your comment! Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 8a38316..84c7ab5 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,10 @@ +2013-10-22 Cong Hou + + PR target/58762 + * config/i386/i386-protos.h (ix86_expand_sse2_abs): New function. + * config/i386/i386.c (ix86_expand_sse2_abs): New function. + * config/i386/sse.md: Add SSE2 support to abs (8/16/32-bit-int). + 2013-10-14 David Malcolm * dumpfile.h (gcc::dump_manager): New class, to hold state diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h index 3ab2f3a..ca31224 100644 --- a/gcc/config/i386/i386-protos.h +++ b/gcc/config/i386/i386-protos.h @@ -238,6 +238,7 @@ extern void ix86_expand_mul_widen_evenodd (rtx, rtx, rtx, bool, bool); extern void ix86_expand_mul_widen_hilo (rtx, rtx, rtx, bool, bool); extern void ix86_expand_sse2_mulv4si3 (rtx, rtx, rtx); extern void ix86_expand_sse2_mulvxdi3 (rtx, rtx, rtx); +extern void ix86_expand_sse2_abs (rtx, rtx); /* In i386-c.c */ extern void ix86_target_macros (void); diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 02cbbbd..71905fc 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -41696,6 +41696,53 @@ ix86_expand_sse2_mulvxdi3 (rtx op0, rtx op1, rtx op2) gen_rtx_MULT (mode, op1, op2)); } +void +ix86_expand_sse2_abs (rtx op0, rtx op1) +{ + enum machine_mode mode = GET_MODE (op0); + rtx tmp0, tmp1; + + switch (mode) +{ + /* For 32-bit signed integer X, the best way to calculate the absolute + value of X is (((signed) X >> (W-1)) ^ X) - ((signed) X >> (W-1)). */ + case V4SImode: + tmp0 = expand_simple_binop (mode, ASHIFTRT, op1, +GEN_INT (GET_MODE_BITSIZE + (GET_MODE_INNER (mode)) - 1), +NULL, 0, OPTAB_DIRECT); + if (tmp0) + tmp1 = expand_simple_binop (mode, XOR, op1, tmp0, + NULL, 0, OPTAB_DIRECT); + if (tmp0 && tmp1) + expand_simple_binop (mode, MINUS, tmp1, tmp0, + op0, 0, OPTAB_DIRECT); + break; + + /* For 16-bit signed integer X, the best way to calculate the absolute + value of X is max (X, -X), as SSE2 provides the PMAXSW insn. */ + case V8HImode: + tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0); + if (tmp0) + expand_simple_binop (mode, SMAX, op1, tmp0, op0, 0, + OPTAB_DIRECT); + break; + + /* For 8-bit signed integer X, the best way to calculate the absolute + value of X is min ((unsigned char) X, (unsigned char) (-X)), + as SSE2 provides the PMINUB insn. */ + case V16QImode: + tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0); + if (tmp0) + expand_simple_binop (V16QImode, UMIN, op1, tmp0, op0, 0, + OPTAB_DIRECT); + break; + + default: + break; +} +} + /* Expand an insert into a vector register through pinsr insn. Return true if successful. */ diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index c3f6c94..0d9cefe 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -8721,7 +8721,7 @@ (set (attr "prefix_rex") (
Re: [PATCH] Vectorizing abs(char/short/int) on x86.
On Tue, Oct 29, 2013 at 10:34 AM, Uros Bizjak wrote: > On Tue, Oct 29, 2013 at 6:18 PM, Cong Hou wrote: > >>>> For the define_expand I added as below, the else body is there to >>>> avoid fall-through transformations to ABS operation in optabs.c. >>>> Otherwise ABS will be converted to other operations even that we have >>>> corresponding instructions from SSSE3. >>> >>> No, it wont be. >>> >>> Fallthrough will generate the pattern that will be matched by the insn >>> pattern above, just like you are doing by hand below. >> >> >> I think the case is special for abs(). In optabs.c, there is a >> function expand_abs() in which the function expand_abs_nojump() is >> called. This function first tries the expand function defined for the >> target and if it fails it will try max(v, -v) then shift-xor-sub >> method. If I don't generate any instruction for SSSE3, the >> fall-through will be max(v, -v). I have tested it on my machine. > > Huh, strange. > > Then you can rename previous pattern to abs2_1 and call it from > the new expander instead of expanding it manually. Please also add a > small comment, describing the situation to prevent future > "optimizations" in this place. Could you tell me how to do that? The renamed pattern abs2_1 is also a "define_expand"? How to call this expander? Thank you! Cong > > Thanks, > Uros.
[PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
Hi SAD (Sum of Absolute Differences) is a common and important algorithm in image processing and other areas. SSE2 even introduced a new instruction PSADBW for it. A SAD loop can be greatly accelerated by this instruction after being vectorized. This patch introduced a new operation SAD_EXPR and a SAD pattern recognizer in vectorizer. The pattern of SAD is shown below: unsigned type x_t, y_t; signed TYPE1 diff, abs_diff; TYPE2 sum = init; loop: sum_0 = phi S1 x_t = ... S2 y_t = ... S3 x_T = (TYPE1) x_t; S4 y_T = (TYPE1) y_t; S5 diff = x_T - y_T; S6 abs_diff = ABS_EXPR ; [S7 abs_diff = (TYPE2) abs_diff; #optional] S8 sum_1 = abs_diff + sum_0; where 'TYPE1' is at least double the size of type 'type', and 'TYPE2' is the same size of 'TYPE1' or bigger. This is a special case of a reduction computation. For SSE2, type is char, and TYPE1 and TYPE2 are int. In order to express this new operation, a new expression SAD_EXPR is introduced in tree.def, and the corresponding entry in optabs is added. The patch also added the "define_expand" for SSE2 and AVX2 platforms for i386. The patch is pasted below and also attached as a text file (in which you can see tabs). Bootstrap and make check got passed on x86. Please give me your comments. thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 8a38316..d528307 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,23 @@ +2013-10-29 Cong Hou + + * tree-vect-patterns.c (vect_recog_sad_pattern): New function for SAD + pattern recognition. + (type_conversion_p): PROMOTION is true if it's a type promotion + conversion, and false otherwise. Return true if the given expression + is a type conversion one. + * tree-vectorizer.h: Adjust the number of patterns. + * tree.def: Add SAD_EXPR. + * optabs.def: Add sad_optab. + * cfgexpand.c (expand_debug_expr): Add SAD_EXPR case. + * expr.c (expand_expr_real_2): Likewise. + * gimple-pretty-print.c (dump_ternary_rhs): Likewise. + * gimple.c (get_gimple_rhs_num_ops): Likewise. + * optabs.c (optab_for_tree_code): Likewise. + * tree-cfg.c (estimate_operator_cost): Likewise. + * tree-ssa-operands.c (get_expr_operands): Likewise. + * tree-vect-loop.c (get_initial_def_for_reduction): Likewise. + * config/i386/sse.md: Add SSE2 and AVX2 expand for SAD. + 2013-10-14 David Malcolm * dumpfile.h (gcc::dump_manager): New class, to hold state diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c index 7ed29f5..9ec761a 100644 --- a/gcc/cfgexpand.c +++ b/gcc/cfgexpand.c @@ -2730,6 +2730,7 @@ expand_debug_expr (tree exp) { case COND_EXPR: case DOT_PROD_EXPR: + case SAD_EXPR: case WIDEN_MULT_PLUS_EXPR: case WIDEN_MULT_MINUS_EXPR: case FMA_EXPR: diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index c3f6c94..ca1ab70 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -6052,6 +6052,40 @@ DONE; }) +(define_expand "sadv16qi" + [(match_operand:V4SI 0 "register_operand") + (match_operand:V16QI 1 "register_operand") + (match_operand:V16QI 2 "register_operand") + (match_operand:V4SI 3 "register_operand")] + "TARGET_SSE2" +{ + rtx t1 = gen_reg_rtx (V2DImode); + rtx t2 = gen_reg_rtx (V4SImode); + emit_insn (gen_sse2_psadbw (t1, operands[1], operands[2])); + convert_move (t2, t1, 0); + emit_insn (gen_rtx_SET (VOIDmode, operands[0], + gen_rtx_PLUS (V4SImode, + operands[3], t2))); + DONE; +}) + +(define_expand "sadv32qi" + [(match_operand:V8SI 0 "register_operand") + (match_operand:V32QI 1 "register_operand") + (match_operand:V32QI 2 "register_operand") + (match_operand:V8SI 3 "register_operand")] + "TARGET_AVX2" +{ + rtx t1 = gen_reg_rtx (V4DImode); + rtx t2 = gen_reg_rtx (V8SImode); + emit_insn (gen_avx2_psadbw (t1, operands[1], operands[2])); + convert_move (t2, t1, 0); + emit_insn (gen_rtx_SET (VOIDmode, operands[0], + gen_rtx_PLUS (V8SImode, + operands[3], t2))); + DONE; +}) + (define_insn "ashr3" [(set (match_operand:VI24_AVX2 0 "register_operand" "=x,x") (ashiftrt:VI24_AVX2 diff --git a/gcc/expr.c b/gcc/expr.c index 4975a64..1db8a49 100644 --- a/gcc/expr.c +++ b/gcc/expr.c @@ -9026,6 +9026,20 @@ expand_expr_real_2 (sepops ops, rtx target, enum machine_mode tmode, return target; } + case SAD_EXPR: + { + tree oprnd0 = treeop0; + tree oprnd1 = treeop1; + tree oprnd2 = treeop2; + rtx op2; + + expand_operands (oprnd0, oprnd1, NULL_RTX, &op0, &op1, EXPAND_NORMAL); + op2 = expand_normal (oprnd2); + target = expand_widen_pattern_expr (ops, op0, op1, op2, +target, unsignedp); + return target; + } + case REALIGN_LOAD_EXPR: { tree oprnd0 = treeop0; diff --git a/gcc/gimple-pretty-print.c b/gcc/gimple-pretty-print.c index f
Re: [PATCH] Vectorizing abs(char/short/int) on x86.
I found my problem: I put DONE outside of if not inside. You are right. I have updated my patch. I appreciate your comment and test on it! thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 8a38316..84c7ab5 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,10 @@ +2013-10-22 Cong Hou + + PR target/58762 + * config/i386/i386-protos.h (ix86_expand_sse2_abs): New function. + * config/i386/i386.c (ix86_expand_sse2_abs): New function. + * config/i386/sse.md: Add SSE2 support to abs (8/16/32-bit-int). + 2013-10-14 David Malcolm * dumpfile.h (gcc::dump_manager): New class, to hold state diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h index 3ab2f3a..ca31224 100644 --- a/gcc/config/i386/i386-protos.h +++ b/gcc/config/i386/i386-protos.h @@ -238,6 +238,7 @@ extern void ix86_expand_mul_widen_evenodd (rtx, rtx, rtx, bool, bool); extern void ix86_expand_mul_widen_hilo (rtx, rtx, rtx, bool, bool); extern void ix86_expand_sse2_mulv4si3 (rtx, rtx, rtx); extern void ix86_expand_sse2_mulvxdi3 (rtx, rtx, rtx); +extern void ix86_expand_sse2_abs (rtx, rtx); /* In i386-c.c */ extern void ix86_target_macros (void); diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 02cbbbd..71905fc 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -41696,6 +41696,53 @@ ix86_expand_sse2_mulvxdi3 (rtx op0, rtx op1, rtx op2) gen_rtx_MULT (mode, op1, op2)); } +void +ix86_expand_sse2_abs (rtx op0, rtx op1) +{ + enum machine_mode mode = GET_MODE (op0); + rtx tmp0, tmp1; + + switch (mode) +{ + /* For 32-bit signed integer X, the best way to calculate the absolute + value of X is (((signed) X >> (W-1)) ^ X) - ((signed) X >> (W-1)). */ + case V4SImode: + tmp0 = expand_simple_binop (mode, ASHIFTRT, op1, +GEN_INT (GET_MODE_BITSIZE + (GET_MODE_INNER (mode)) - 1), +NULL, 0, OPTAB_DIRECT); + if (tmp0) + tmp1 = expand_simple_binop (mode, XOR, op1, tmp0, + NULL, 0, OPTAB_DIRECT); + if (tmp0 && tmp1) + expand_simple_binop (mode, MINUS, tmp1, tmp0, + op0, 0, OPTAB_DIRECT); + break; + + /* For 16-bit signed integer X, the best way to calculate the absolute + value of X is max (X, -X), as SSE2 provides the PMAXSW insn. */ + case V8HImode: + tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0); + if (tmp0) + expand_simple_binop (mode, SMAX, op1, tmp0, op0, 0, + OPTAB_DIRECT); + break; + + /* For 8-bit signed integer X, the best way to calculate the absolute + value of X is min ((unsigned char) X, (unsigned char) (-X)), + as SSE2 provides the PMINUB insn. */ + case V16QImode: + tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0); + if (tmp0) + expand_simple_binop (V16QImode, UMIN, op1, tmp0, op0, 0, + OPTAB_DIRECT); + break; + + default: + break; +} +} + /* Expand an insert into a vector register through pinsr insn. Return true if successful. */ diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index c3f6c94..46e1df4 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -8721,7 +8721,7 @@ (set (attr "prefix_rex") (symbol_ref "x86_extended_reg_mentioned_p (insn)")) (set_attr "mode" "DI")]) -(define_insn "abs2" +(define_insn "*abs2" [(set (match_operand:VI124_AVX2_48_AVX512F 0 "register_operand" "=v") (abs:VI124_AVX2_48_AVX512F (match_operand:VI124_AVX2_48_AVX512F 1 "nonimmediate_operand" "vm")))] @@ -8733,6 +8733,19 @@ (set_attr "prefix" "maybe_vex") (set_attr "mode" "")]) +(define_expand "abs2" + [(set (match_operand:VI124_AVX2_48_AVX512F 0 "register_operand") + (abs:VI124_AVX2_48_AVX512F + (match_operand:VI124_AVX2_48_AVX512F 1 "nonimmediate_operand")))] + "TARGET_SSE2" +{ + if (!TARGET_SSSE3) +{ + ix86_expand_sse2_abs (operands[0], operands[1]); + DONE; +} +}) + (define_insn "abs2" [(set (match_operand:MMXMODEI 0 "register_operand" "=y") (abs:MMXMODEI diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 075d071..cf5b942 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,8 @@ +2013-10-22 Cong Hou + + PR target/58762 + * gcc.dg/vect/pr58762.c: New test. + 2013-10-14 Tobias Burnus PR fortran/58658 diff --git a/gcc/testsuite/gcc.dg/vect/pr58762.c b/gcc/testsuite/gcc.dg/vect/pr58762.c new file mode 100644 index 000..6468d0a --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr58762.c @@ -0,0 +1,28 @@ +/* { dg-require-effective-target vect_int } */ +/* { dg-do compile } */ +/* { dg-options "-O2 -ftree-vectorize" } */ + +void test1 (char* a, char* b) +{ + int i; + for (i = 0; i < 1; ++i) +a[i] = abs (b[i]); +} + +void test2 (short* a, short* b) +{ + int i; + for (i = 0; i &l
Re: [PATCH] Vectorizing abs(char/short/int) on x86.
Forget to attach the patch file. thanks, Cong On Wed, Oct 30, 2013 at 10:01 AM, Cong Hou wrote: > I found my problem: I put DONE outside of if not inside. You are > right. I have updated my patch. > > I appreciate your comment and test on it! > > > thanks, > Cong > > > > diff --git a/gcc/ChangeLog b/gcc/ChangeLog > index 8a38316..84c7ab5 100644 > --- a/gcc/ChangeLog > +++ b/gcc/ChangeLog > @@ -1,3 +1,10 @@ > +2013-10-22 Cong Hou > + > + PR target/58762 > + * config/i386/i386-protos.h (ix86_expand_sse2_abs): New function. > + * config/i386/i386.c (ix86_expand_sse2_abs): New function. > + * config/i386/sse.md: Add SSE2 support to abs (8/16/32-bit-int). > + > 2013-10-14 David Malcolm > > * dumpfile.h (gcc::dump_manager): New class, to hold state > diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h > index 3ab2f3a..ca31224 100644 > --- a/gcc/config/i386/i386-protos.h > +++ b/gcc/config/i386/i386-protos.h > @@ -238,6 +238,7 @@ extern void ix86_expand_mul_widen_evenodd (rtx, > rtx, rtx, bool, bool); > extern void ix86_expand_mul_widen_hilo (rtx, rtx, rtx, bool, bool); > extern void ix86_expand_sse2_mulv4si3 (rtx, rtx, rtx); > extern void ix86_expand_sse2_mulvxdi3 (rtx, rtx, rtx); > +extern void ix86_expand_sse2_abs (rtx, rtx); > > /* In i386-c.c */ > extern void ix86_target_macros (void); > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c > index 02cbbbd..71905fc 100644 > --- a/gcc/config/i386/i386.c > +++ b/gcc/config/i386/i386.c > @@ -41696,6 +41696,53 @@ ix86_expand_sse2_mulvxdi3 (rtx op0, rtx op1, rtx op2) > gen_rtx_MULT (mode, op1, op2)); > } > > +void > +ix86_expand_sse2_abs (rtx op0, rtx op1) > +{ > + enum machine_mode mode = GET_MODE (op0); > + rtx tmp0, tmp1; > + > + switch (mode) > +{ > + /* For 32-bit signed integer X, the best way to calculate the absolute > + value of X is (((signed) X >> (W-1)) ^ X) - ((signed) X >> (W-1)). */ > + case V4SImode: > + tmp0 = expand_simple_binop (mode, ASHIFTRT, op1, > +GEN_INT (GET_MODE_BITSIZE > + (GET_MODE_INNER (mode)) - 1), > +NULL, 0, OPTAB_DIRECT); > + if (tmp0) > + tmp1 = expand_simple_binop (mode, XOR, op1, tmp0, > + NULL, 0, OPTAB_DIRECT); > + if (tmp0 && tmp1) > + expand_simple_binop (mode, MINUS, tmp1, tmp0, > + op0, 0, OPTAB_DIRECT); > + break; > + > + /* For 16-bit signed integer X, the best way to calculate the absolute > + value of X is max (X, -X), as SSE2 provides the PMAXSW insn. */ > + case V8HImode: > + tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0); > + if (tmp0) > + expand_simple_binop (mode, SMAX, op1, tmp0, op0, 0, > + OPTAB_DIRECT); > + break; > + > + /* For 8-bit signed integer X, the best way to calculate the absolute > + value of X is min ((unsigned char) X, (unsigned char) (-X)), > + as SSE2 provides the PMINUB insn. */ > + case V16QImode: > + tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0); > + if (tmp0) > + expand_simple_binop (V16QImode, UMIN, op1, tmp0, op0, 0, > + OPTAB_DIRECT); > + break; > + > + default: > + break; > +} > +} > + > /* Expand an insert into a vector register through pinsr insn. > Return true if successful. */ > > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md > index c3f6c94..46e1df4 100644 > --- a/gcc/config/i386/sse.md > +++ b/gcc/config/i386/sse.md > @@ -8721,7 +8721,7 @@ > (set (attr "prefix_rex") (symbol_ref "x86_extended_reg_mentioned_p > (insn)")) > (set_attr "mode" "DI")]) > > -(define_insn "abs2" > +(define_insn "*abs2" >[(set (match_operand:VI124_AVX2_48_AVX512F 0 "register_operand" "=v") > (abs:VI124_AVX2_48_AVX512F >(match_operand:VI124_AVX2_48_AVX512F 1 "nonimmediate_operand" "vm")))] > @@ -8733,6 +8733,19 @@ > (set_attr "prefix" "maybe_vex") > (set_attr "mode" "")]) > > +(define_expand "abs2" > + [(set (match_operand:VI124_AVX2_48_AVX512F 0 "register_operand") > + (abs:VI124_AVX2_48_AVX512F > + (match_operand:VI124_AVX2_48_AVX512F 1 "nonimmediate_operand")))] > + "TARGET_SSE2" > +{ > + if (!TARGET_SSSE3) > +{ > + ix86_expand_sse2_abs (operands[0], operands[1]); > + DONE; > +} > +}) > + > (define_insn "abs2" >[(set (match_operand:MMXMODEI 0 "register_operand" "=y") > (abs:MMXMODEI > diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog >
Re: [PATCH] Vectorizing abs(char/short/int) on x86.
On Wed, Oct 30, 2013 at 10:22 AM, Uros Bizjak wrote: > On Wed, Oct 30, 2013 at 6:01 PM, Cong Hou wrote: >> I found my problem: I put DONE outside of if not inside. You are >> right. I have updated my patch. > > OK, great that we put things in order ;) > > Does this patch need some extra middle-end functionality? I was not > able to vectorize char and short part of your patch. In the original patch, I converted abs() on short and char values to their own types by removing type casts. That is, originally char_val1 = abs(char_val2) will be converted to char_val1 = (char) abs((int) char_val2) in the frontend, and I would like to convert it back to char_val1 = abs(char_val2). But after several discussions, it seems this conversion has some problems such as overflow converns, and I thereby removed that part. Now you should still be able to vectorize abs(char) and abs(short) but with packing and unpacking. Later I will consider to write pattern recognizer for abs(char) and abs(short) and then the expand on abs(char)/abs(short) in this patch will be used during vectorization. > > Regarding the testcase - please put it to gcc.target/i386/ directory. > There is nothing generic in the test, as confirmed by target-dependent > scan test. You will find plenty of examples in the mentioned > directory. I'd suggest to split the testcase in three files, and to > simplify it to something like the testcase with global variables I > used earlier. I have done it. The test case is split into three for s8/s16/s32 in gcc.target/i386. Thank you! Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 8a38316..84c7ab5 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,10 @@ +2013-10-22 Cong Hou + + PR target/58762 + * config/i386/i386-protos.h (ix86_expand_sse2_abs): New function. + * config/i386/i386.c (ix86_expand_sse2_abs): New function. + * config/i386/sse.md: Add SSE2 support to abs (8/16/32-bit-int). + 2013-10-14 David Malcolm * dumpfile.h (gcc::dump_manager): New class, to hold state diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h index 3ab2f3a..ca31224 100644 --- a/gcc/config/i386/i386-protos.h +++ b/gcc/config/i386/i386-protos.h @@ -238,6 +238,7 @@ extern void ix86_expand_mul_widen_evenodd (rtx, rtx, rtx, bool, bool); extern void ix86_expand_mul_widen_hilo (rtx, rtx, rtx, bool, bool); extern void ix86_expand_sse2_mulv4si3 (rtx, rtx, rtx); extern void ix86_expand_sse2_mulvxdi3 (rtx, rtx, rtx); +extern void ix86_expand_sse2_abs (rtx, rtx); /* In i386-c.c */ extern void ix86_target_macros (void); diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 02cbbbd..71905fc 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -41696,6 +41696,53 @@ ix86_expand_sse2_mulvxdi3 (rtx op0, rtx op1, rtx op2) gen_rtx_MULT (mode, op1, op2)); } +void +ix86_expand_sse2_abs (rtx op0, rtx op1) +{ + enum machine_mode mode = GET_MODE (op0); + rtx tmp0, tmp1; + + switch (mode) +{ + /* For 32-bit signed integer X, the best way to calculate the absolute + value of X is (((signed) X >> (W-1)) ^ X) - ((signed) X >> (W-1)). */ + case V4SImode: + tmp0 = expand_simple_binop (mode, ASHIFTRT, op1, +GEN_INT (GET_MODE_BITSIZE + (GET_MODE_INNER (mode)) - 1), +NULL, 0, OPTAB_DIRECT); + if (tmp0) + tmp1 = expand_simple_binop (mode, XOR, op1, tmp0, + NULL, 0, OPTAB_DIRECT); + if (tmp0 && tmp1) + expand_simple_binop (mode, MINUS, tmp1, tmp0, + op0, 0, OPTAB_DIRECT); + break; + + /* For 16-bit signed integer X, the best way to calculate the absolute + value of X is max (X, -X), as SSE2 provides the PMAXSW insn. */ + case V8HImode: + tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0); + if (tmp0) + expand_simple_binop (mode, SMAX, op1, tmp0, op0, 0, + OPTAB_DIRECT); + break; + + /* For 8-bit signed integer X, the best way to calculate the absolute + value of X is min ((unsigned char) X, (unsigned char) (-X)), + as SSE2 provides the PMINUB insn. */ + case V16QImode: + tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0); + if (tmp0) + expand_simple_binop (V16QImode, UMIN, op1, tmp0, op0, 0, + OPTAB_DIRECT); + break; + + default: + break; +} +} + /* Expand an insert into a vector register through pinsr insn. Return true if successful. */ diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index c3f6c94..46e1df4 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -8721,7 +8721,7 @@ (set (attr "prefix_rex") (symbol_ref "x86_extended_reg_mentioned_p (insn)")) (set_attr "mode" "DI")]) -(define_insn "abs2" +(define_insn "*abs2" [(set (match_operand:VI124_AVX2_48_AVX512F 0 "register_operand" "=v") (abs:VI124_AVX2_48_AVX512F (match_operand:VI124_AVX2_48_AVX5
Re: [PATCH] Vectorizing abs(char/short/int) on x86.
Also, as the current expand for abs() on 8/16bit integer is not used at all, should I comment them temporarily now? Later I can uncomment them once I finished the pattern recognizer. thanks, Cong On Wed, Oct 30, 2013 at 10:22 AM, Uros Bizjak wrote: > On Wed, Oct 30, 2013 at 6:01 PM, Cong Hou wrote: >> I found my problem: I put DONE outside of if not inside. You are >> right. I have updated my patch. > > OK, great that we put things in order ;) > > Does this patch need some extra middle-end functionality? I was not > able to vectorize char and short part of your patch. > > Regarding the testcase - please put it to gcc.target/i386/ directory. > There is nothing generic in the test, as confirmed by target-dependent > scan test. You will find plenty of examples in the mentioned > directory. I'd suggest to split the testcase in three files, and to > simplify it to something like the testcase with global variables I > used earlier. > > Modulo testcase, the patch is OK otherwise, but middle-end parts > should be committed first. > > Thanks, > Uros.
Re: [PATCH] Vectorizing abs(char/short/int) on x86.
I have run check_GNU_style.sh on my patch. The patch is submitted. Thank you for your comments and help on this patch! thanks, Cong On Wed, Oct 30, 2013 at 11:13 AM, Uros Bizjak wrote: > On Wed, Oct 30, 2013 at 7:01 PM, Cong Hou wrote: > >>>> I found my problem: I put DONE outside of if not inside. You are >>>> right. I have updated my patch. >>> >>> OK, great that we put things in order ;) >>> >>> Does this patch need some extra middle-end functionality? I was not >>> able to vectorize char and short part of your patch. >> >> >> In the original patch, I converted abs() on short and char values to >> their own types by removing type casts. That is, originally char_val1 >> = abs(char_val2) will be converted to char_val1 = (char) abs((int) >> char_val2) in the frontend, and I would like to convert it back to >> char_val1 = abs(char_val2). But after several discussions, it seems >> this conversion has some problems such as overflow converns, and I >> thereby removed that part. >> >> Now you should still be able to vectorize abs(char) and abs(short) but >> with packing and unpacking. Later I will consider to write pattern >> recognizer for abs(char) and abs(short) and then the expand on >> abs(char)/abs(short) in this patch will be used during vectorization. > > OK, this seems reasonable. We already have "unused" SSSE3 8/16 bit abs > pattern, so I think we can commit SSE2 expanders, even if they will be > unused for now. The proposed recognizer will benefit SSE2 as well as > existing SSSE3 patterns. > >>> Regarding the testcase - please put it to gcc.target/i386/ directory. >>> There is nothing generic in the test, as confirmed by target-dependent >>> scan test. You will find plenty of examples in the mentioned >>> directory. I'd suggest to split the testcase in three files, and to >>> simplify it to something like the testcase with global variables I >>> used earlier. >> >> >> I have done it. The test case is split into three for s8/s16/s32 in >> gcc.target/i386. > > OK. > > The patch is OK for mainline, but please check formatting and > whitespace before the patch is committed. > > Thanks, > Uros.
Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
On Wed, Oct 30, 2013 at 4:27 AM, Richard Biener wrote: > On Tue, 29 Oct 2013, Cong Hou wrote: > >> Hi >> >> SAD (Sum of Absolute Differences) is a common and important algorithm >> in image processing and other areas. SSE2 even introduced a new >> instruction PSADBW for it. A SAD loop can be greatly accelerated by >> this instruction after being vectorized. This patch introduced a new >> operation SAD_EXPR and a SAD pattern recognizer in vectorizer. >> >> The pattern of SAD is shown below: >> >> unsigned type x_t, y_t; >> signed TYPE1 diff, abs_diff; >> TYPE2 sum = init; >>loop: >> sum_0 = phi >> S1 x_t = ... >> S2 y_t = ... >> S3 x_T = (TYPE1) x_t; >> S4 y_T = (TYPE1) y_t; >> S5 diff = x_T - y_T; >> S6 abs_diff = ABS_EXPR ; >> [S7 abs_diff = (TYPE2) abs_diff; #optional] >> S8 sum_1 = abs_diff + sum_0; >> >>where 'TYPE1' is at least double the size of type 'type', and 'TYPE2' is >> the >>same size of 'TYPE1' or bigger. This is a special case of a reduction >>computation. >> >> For SSE2, type is char, and TYPE1 and TYPE2 are int. >> >> >> In order to express this new operation, a new expression SAD_EXPR is >> introduced in tree.def, and the corresponding entry in optabs is >> added. The patch also added the "define_expand" for SSE2 and AVX2 >> platforms for i386. >> >> The patch is pasted below and also attached as a text file (in which >> you can see tabs). Bootstrap and make check got passed on x86. Please >> give me your comments. > > Apart from the testcase comment made earlier > > +++ b/gcc/tree-cfg.c > @@ -3797,6 +3797,7 @@ verify_gimple_assign_ternary (gimple stmt) >return false; > > case DOT_PROD_EXPR: > +case SAD_EXPR: > case REALIGN_LOAD_EXPR: >/* FIXME. */ >return false; > > please add proper verification of the operand types. OK. > > +/* Widening sad (sum of absolute differences). > + The first two arguments are of type t1 which should be unsigned > integer. > + The third argument and the result are of type t2, such that t2 is at > least > + twice the size of t1. SAD_EXPR(arg1,arg2,arg3) is equivalent to: > + tmp1 = WIDEN_MINUS_EXPR (arg1, arg2); > + tmp2 = ABS_EXPR (tmp1); > + arg3 = PLUS_EXPR (tmp2, arg3); */ > +DEFTREECODE (SAD_EXPR, "sad_expr", tcc_expression, 3) > > WIDEN_MINUS_EXPR doesn't exist so you have to explain on its > operation (it returns a signed wide difference?). Why should > the first two arguments be unsigned? I cannot see a good reason > to require that (other than that maybe the x86 target only has > support for widened unsigned difference?). So if you want to > make that restriction maybe change the name to SADU_EXPR > (sum of absolute differences of unsigned)? > > I suppose you tried introducing WIDEN_MINUS_EXPR instead and > letting combine do it's work, avoiding the very special optab? I may use the wrong representation here. I think the behavior of "WIDEN_MINUS_EXPR" in SAD is different from the general one. SAD usually works on unsigned integers (see http://en.wikipedia.org/wiki/Sum_of_absolute_differences), and before getting the difference between two unsigned integers, they are promoted to bigger signed integers. And the result of (int)(char)(1) - (int)(char)(-1) is different from (int)(unsigned char)(1) - (int)(unsigned char)(-1). So we cannot implement SAD using WIDEN_MINUS_EXPR. Also, the SSE2 instruction PSADBW also requires the operands to be unsigned 8-bit integers. I will remove the improper description as you pointed out. thanks, Cong > > Thanks, > Richard. > >> >> >> thanks, >> Cong >> >> >> >> diff --git a/gcc/ChangeLog b/gcc/ChangeLog >> index 8a38316..d528307 100644 >> --- a/gcc/ChangeLog >> +++ b/gcc/ChangeLog >> @@ -1,3 +1,23 @@ >> +2013-10-29 Cong Hou >> + >> + * tree-vect-patterns.c (vect_recog_sad_pattern): New function for SAD >> + pattern recognition. >> + (type_conversion_p): PROMOTION is true if it's a type promotion >> + conversion, and false otherwise. Return true if the given expression >> + is a type conversion one. >> + * tree-vectorizer.h: Adjust the number of patterns. >> + * tree.def: Add SAD_EXPR. >> + * optabs.def: Add sad_optab. >> + * cfgexpand.c (expand_debug_expr): Add SAD_EXPR case. >> + * expr.c (expand_expr_real_2): Likewise. >> + * gimple-pretty-print.c (dump_
Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
On Tue, Oct 29, 2013 at 4:49 PM, Ramana Radhakrishnan wrote: > Cong, > > Please don't do the following. > >>+++ b/gcc/testsuite/gcc.dg/vect/ > vect-reduc-sad.c > @@ -0,0 +1,54 @@ > +/* { dg-require-effective-target sse2 { target { i?86-*-* x86_64-*-* } } } */ > > you are adding a test to gcc.dg/vect - It's a common directory > containing tests that need to run on multiple architectures and such > tests should be keyed by the feature they enable which can be turned > on for ports that have such an instruction. > > The correct way of doing this is to key this on the feature something > like dg-require-effective-target vect_sad_char . And define the > equivalent routine in testsuite/lib/target-supports.exp and enable it > for sse2 for the x86 port. If in doubt look at > check_effective_target_vect_int and a whole family of such functions > in testsuite/lib/target-supports.exp > > This makes life easy for other port maintainers who want to turn on > this support. And for bonus points please update the testcase writing > wiki page with this information if it isn't already there. > OK, I will likely move the test case to gcc.target/i386 as currently only SSE2 provides SAD instruction. But your suggestion also helps! > You are also missing documentation updates for SAD_EXPR, md.texi for > the new standard pattern name. Shouldn't it be called sad4 > really ? > I will add the documentation for the new operation SAD_EXPR. I use sad by just following udot_prod as those two operations are quite similar: OPTAB_D (udot_prod_optab, "udot_prod$I$a") thanks, Cong > > regards > Ramana > > > > > > On Tue, Oct 29, 2013 at 10:23 PM, Cong Hou wrote: >> Hi >> >> SAD (Sum of Absolute Differences) is a common and important algorithm >> in image processing and other areas. SSE2 even introduced a new >> instruction PSADBW for it. A SAD loop can be greatly accelerated by >> this instruction after being vectorized. This patch introduced a new >> operation SAD_EXPR and a SAD pattern recognizer in vectorizer. >> >> The pattern of SAD is shown below: >> >> unsigned type x_t, y_t; >> signed TYPE1 diff, abs_diff; >> TYPE2 sum = init; >>loop: >> sum_0 = phi >> S1 x_t = ... >> S2 y_t = ... >> S3 x_T = (TYPE1) x_t; >> S4 y_T = (TYPE1) y_t; >> S5 diff = x_T - y_T; >> S6 abs_diff = ABS_EXPR ; >> [S7 abs_diff = (TYPE2) abs_diff; #optional] >> S8 sum_1 = abs_diff + sum_0; >> >>where 'TYPE1' is at least double the size of type 'type', and 'TYPE2' is >> the >>same size of 'TYPE1' or bigger. This is a special case of a reduction >>computation. >> >> For SSE2, type is char, and TYPE1 and TYPE2 are int. >> >> >> In order to express this new operation, a new expression SAD_EXPR is >> introduced in tree.def, and the corresponding entry in optabs is >> added. The patch also added the "define_expand" for SSE2 and AVX2 >> platforms for i386. >> >> The patch is pasted below and also attached as a text file (in which >> you can see tabs). Bootstrap and make check got passed on x86. Please >> give me your comments. >> >> >> >> thanks, >> Cong >> >> >> >> diff --git a/gcc/ChangeLog b/gcc/ChangeLog >> index 8a38316..d528307 100644 >> --- a/gcc/ChangeLog >> +++ b/gcc/ChangeLog >> @@ -1,3 +1,23 @@ >> +2013-10-29 Cong Hou >> + >> + * tree-vect-patterns.c (vect_recog_sad_pattern): New function for SAD >> + pattern recognition. >> + (type_conversion_p): PROMOTION is true if it's a type promotion >> + conversion, and false otherwise. Return true if the given expression >> + is a type conversion one. >> + * tree-vectorizer.h: Adjust the number of patterns. >> + * tree.def: Add SAD_EXPR. >> + * optabs.def: Add sad_optab. >> + * cfgexpand.c (expand_debug_expr): Add SAD_EXPR case. >> + * expr.c (expand_expr_real_2): Likewise. >> + * gimple-pretty-print.c (dump_ternary_rhs): Likewise. >> + * gimple.c (get_gimple_rhs_num_ops): Likewise. >> + * optabs.c (optab_for_tree_code): Likewise. >> + * tree-cfg.c (estimate_operator_cost): Likewise. >> + * tree-ssa-operands.c (get_expr_operands): Likewise. >> + * tree-vect-loop.c (get_initial_def_for_reduction): Likewise. >> + * config/i386/sse.md: Add SSE2 and AVX2 expand for SAD. >> + >> 2013-10-14 David Malcolm >> >> * dumpfile.h (gcc::dump_manager
Re: [PATCH] Vectorizing abs(char/short/int) on x86.
This update makes it more safe. You showed me how to write better expand code. Thank you for the improvement! thanks, Cong On Thu, Oct 31, 2013 at 11:43 AM, Uros Bizjak wrote: > On Wed, Oct 30, 2013 at 9:02 PM, Cong Hou wrote: >> I have run check_GNU_style.sh on my patch. >> >> The patch is submitted. Thank you for your comments and help on this patch! > > I have committed a couple of fixes/improvements to your expander in > i386.c. There is no need to check for the result of > expand_simple_binop. Also, there is no guarantee that > expand_simple_binop will expand to the target. It can return different > RTX. Also, unhandled modes are now marked with gcc_unreachable. > > 2013-10-31 Uros Bizjak > > * config/i386/i386.c (ix86_expand_sse2_abs): Rename function arguments. > Use gcc_unreachable for unhandled modes. Do not check results of > expand_simple_binop. If not expanded to target, move the result. > > Tested on x86_64-pc-linux-gnu and committed. > > Uros.
[PATCH] Handling == or != comparisons that may affect range test optimization.
(This patch is for the bug 58728: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58728) As in the bug report, consider the following loop: int foo(unsigned int n) { if (n != 0) if (n != 1) if (n != 2) if (n != 3) if (n != 4) return ++n; return n; } The range test optimization should be able to merge all those five conditions into one in reassoc pass, but I fails to do so. The reason is that the phi arg of n is replaced by the constant it compares to in case of == or != comparisons (in vrp pass). GCC checks there is no side effect on n between any two neighboring conditions by examining if they defined the same phi arg in the join node. But as the phi arg is replace by a constant, the check fails. This patch deals with this situation by considering the existence of == or != comparisons, which is attached below (a text file is also attached with proper tabs). Bootstrap and make check both get passed. Any comment? thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 8a38316..9247222 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,11 @@ +2013-10-31 Cong Hou + + PR tree-optimization/58728 + * tree-ssa-reassoc.c (suitable_cond_bb): Consider the situtation + that ==/!= comparisons between a variable and a constant may lead + to that the later phi arg of the variable is substitued by the + constant from prior passes, during range test optimization. + 2013-10-14 David Malcolm * dumpfile.h (gcc::dump_manager): New class, to hold state diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 075d071..44a5e70 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,8 @@ +2013-10-31 Cong Hou + + PR tree-optimization/58728 + * gcc.dg/tree-ssa/pr58728: New test. + 2013-10-14 Tobias Burnus PR fortran/58658 diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr58728.c b/gcc/testsuite/gcc.dg/tree-ssa/pr58728.c new file mode 100644 index 000..312aebc --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr58728.c @@ -0,0 +1,25 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-reassoc1-details" } */ + +int foo (unsigned int n) +{ + if (n != 0) +if (n != 1) + return ++n; + return n; +} + +int bar (unsigned int n) +{ + if (n == 0) +; + else if (n == 1) +; + else +return ++n; + return n; +} + + +/* { dg-final { scan-tree-dump-times "Optimizing range tests" 2 "reassoc1" } } */ +/* { dg-final { cleanup-tree-dump "reassoc1" } } */ diff --git a/gcc/tree-ssa-reassoc.c b/gcc/tree-ssa-reassoc.c index 6859518..bccf99f 100644 --- a/gcc/tree-ssa-reassoc.c +++ b/gcc/tree-ssa-reassoc.c @@ -2426,11 +2426,70 @@ suitable_cond_bb (basic_block bb, basic_block test_bb, basic_block *other_bb, for (gsi = gsi_start_phis (e->dest); !gsi_end_p (gsi); gsi_next (&gsi)) { gimple phi = gsi_stmt (gsi); + tree phi_arg = gimple_phi_arg_def (phi, e->dest_idx); + tree phi_arg2 = gimple_phi_arg_def (phi, e2->dest_idx); + /* If both BB and TEST_BB end with GIMPLE_COND, all PHI arguments corresponding to BB and TEST_BB predecessor must be the same. */ - if (!operand_equal_p (gimple_phi_arg_def (phi, e->dest_idx), -gimple_phi_arg_def (phi, e2->dest_idx), 0)) - { + if (!operand_equal_p (phi_arg, phi_arg2, 0)) + { + /* If the condition in BB or TEST_BB is an NE or EQ comparison like + if (n != N) or if (n == N), it is possible that the corresponding + def of n in the phi function is replaced by N. We should still allow + range test optimization in this case. */ + + tree lhs = NULL, rhs = NULL, + lhs2 = NULL, rhs2 = NULL; + bool is_eq_expr = is_cond && (gimple_cond_code (stmt) == NE_EXPR + || gimple_cond_code (stmt) == EQ_EXPR) + && TREE_CODE (phi_arg) == INTEGER_CST; + + if (is_eq_expr) + { +lhs = gimple_cond_lhs (stmt); +rhs = gimple_cond_rhs (stmt); + +if (operand_equal_p (lhs, phi_arg, 0)) + { + tree t = lhs; + lhs = rhs; + rhs = t; + } +if (operand_equal_p (rhs, phi_arg, 0) + && operand_equal_p (lhs, phi_arg2, 0)) + continue; + } + + gimple stmt2 = last_stmt (test_bb); + bool is_eq_expr2 = gimple_code (stmt2) == GIMPLE_COND + && (gimple_cond_code (stmt2) == NE_EXPR + || gimple_cond_code (stmt2) == EQ_EXPR) + && TREE_CODE (phi_arg2) == INTEGER_CST; + + if (is_eq_expr2) + { +lhs2 = gimple_cond_lhs (stmt2); +rhs2 = gimple_cond_rhs (stmt2); + +if (operand_equal_p (lhs2, phi_arg2, 0)) + { + tree t = lhs2; + lhs2 = rhs2; + rhs2 = t; + } +if (operand_equal_p (rhs2, phi_arg2, 0) + && operand_equal_p (lhs2, phi_arg, 0)) + continue; + } + + if (is_eq_expr && is_eq_expr2) + { +if (operand_equal_p (rhs, phi_arg, 0) + && operand_equal_p (rhs2, phi_arg2, 0) + && operand_equal_p (lhs, lhs2, 0)) + continue; + } + /* Otherwise, if one of the blocks doesn't end wit
[PATCH] Small fix: add { dg-require-effective-target vect_int } to testsuite/gcc.dg/vect/pr58508.c
It seems that on some platforms the loops in testsuite/gcc.dg/vect/pr58508.c may be unable to be vectorized. This small patch added { dg-require-effective-target vect_int } to make sure all loops can be vectorized. thanks, Cong diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 9d0f4a5..3d9916d 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,7 @@ +2013-10-29 Cong Hou + + * gcc.dg/vect/pr58508.c: Update. + 2013-10-15 Cong Hou * gcc.dg/vect/pr58508.c: New test. diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c b/gcc/testsuite/gcc.dg/vect/pr58508.c index 6484a65..fff7a04 100644 --- a/gcc/testsuite/gcc.dg/vect/pr58508.c +++ b/gcc/testsuite/gcc.dg/vect/pr58508.c @@ -1,3 +1,4 @@ +/* { dg-require-effective-target vect_int } */ /* { dg-do compile } */ /* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details" } */
Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
On Mon, Nov 4, 2013 at 2:06 AM, James Greenhalgh wrote: > On Fri, Nov 01, 2013 at 04:48:53PM +0000, Cong Hou wrote: >> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi >> index 2a5a2e1..8f5d39a 100644 >> --- a/gcc/doc/md.texi >> +++ b/gcc/doc/md.texi >> @@ -4705,6 +4705,16 @@ wider mode, is computed and added to operand 3. >> Operand 3 is of a mode equal or >> wider than the mode of the product. The result is placed in operand 0, which >> is of the same mode as operand 3. >> >> +@cindex @code{ssad@var{m}} instruction pattern >> +@item @samp{ssad@var{m}} >> +@cindex @code{usad@var{m}} instruction pattern >> +@item @samp{usad@var{m}} >> +Compute the sum of absolute differences of two signed/unsigned elements. >> +Operand 1 and operand 2 are of the same mode. Their absolute difference, >> which >> +is of a wider mode, is computed and added to operand 3. Operand 3 is of a >> mode >> +equal or wider than the mode of the absolute difference. The result is >> placed >> +in operand 0, which is of the same mode as operand 3. >> + >> @cindex @code{ssum_widen@var{m3}} instruction pattern >> @item @samp{ssum_widen@var{m3}} >> @cindex @code{usum_widen@var{m3}} instruction pattern >> diff --git a/gcc/expr.c b/gcc/expr.c >> index 4975a64..1db8a49 100644 > > I'm not sure I follow, and if I do - I don't think it matches what > you have implemented for i386. > > From your text description I would guess the series of operations to be: > > v1 = widen (operands[1]) > v2 = widen (operands[2]) > v3 = abs (v1 - v2) > operands[0] = v3 + operands[3] > > But if I understand the behaviour of PSADBW correctly, what you have > actually implemented is: > > v1 = widen (operands[1]) > v2 = widen (operands[2]) > v3 = abs (v1 - v2) > v4 = reduce_plus (v3) > operands[0] = v4 + operands[3] > > To my mind, synthesizing the reduce_plus step will be wasteful for targets > who do not get this for free with their Absolute Difference step. Imagine a > simple loop where we have synthesized the reduce_plus, we compute partial > sums each loop iteration, though we would be better to leave the reduce_plus > step until after the loop. "REDUC_PLUS_EXPR" would be the appropriate > Tree code for this. What do you mean when you use "synthesizing" here? For each pattern, the only synthesized operation is the one being returned from the pattern recognizer. In this case, it is USAD_EXPR. The recognition of reduce sum is necessary as we need corresponding prolog and epilog for reductions, which is already done before pattern recognition. Note that reduction is not a pattern but is a type of vector definition. A vectorization pattern can still be a reduction operation as long as STMT_VINFO_RELATED_STMT of this pattern is a reduction operation. You can check the other two reduction patterns: widen_sum_pattern and dot_prod_pattern for reference. Thank you for your comment! Cong > > I would prefer to see this Tree code not imply the reduce_plus. > > Thanks, > James >
Re: [PATCH] Fixing PR59006 and PR58921 by delaying loop invariant hoisting in vectorizer.
On Wed, Nov 27, 2013 at 1:53 AM, Richard Biener wrote: > On Fri, 22 Nov 2013, Cong Hou wrote: > >> Hi >> >> Currently in GCC vectorization, some loop invariant may be detected >> after aliasing checks, which can be hoisted outside of the loop. The >> current method in GCC may break the information built during the >> analysis phase, causing some crash (see PR59006 and PR58921). >> >> This patch improves the loop invariant hoisting by delaying it until >> all statements are vectorized, thereby keeping all built information. >> But those loop invariant statements won't be vectorized, and if a >> variable is defined by one of those loop invariant, it is treated as >> an external definition. >> >> Bootstrapped and testes on an x86-64 machine. > > Hmm. I'm still thinking that we should handle this during the regular > transform step. > > Like with the following incomplete patch. Missing is adjusting > the rest of the vectorizable_* functions to handle the case where all defs > are dt_external or constant by setting their own STMT_VINFO_DEF_TYPE to > dt_external. From the gcc.dg/vect/pr58508.c we get only 4 hoists > instead of 8 because of this (I think). > > Also gcc.dg/vect/pr52298.c ICEs for yet unanalyzed reason. > > I can take over the bug if you like. > > Thanks, > Richard. > > Index: gcc/tree-vect-data-refs.c > === > *** gcc/tree-vect-data-refs.c (revision 205435) > --- gcc/tree-vect-data-refs.c (working copy) > *** again: > *** 3668,3673 > --- 3668,3682 > } > STMT_VINFO_STRIDE_LOAD_P (stmt_info) = true; > } > + else if (loop_vinfo > + && integer_zerop (DR_STEP (dr))) > + { > + /* All loads from a non-varying address will be disambiguated > +by data-ref analysis or via a runtime alias check and thus > +they will become invariant. Force them to be vectorized > +as external. */ > + STMT_VINFO_DEF_TYPE (stmt_info) = vect_external_def; > + } > } > > /* If we stopped analysis at the first dataref we could not analyze I agree that setting the statement that loads a data-ref with zero step as vect_external_def early at this point is a good idea. This avoids two loop analyses seeing inconsistent def-info if we do this later. Note with this change the following loop in PR59006 will not be vectorized: int a[8], b; void fn1(void) { int c; for (; b; b++) { int d = a[b]; c = a[0] ? d : 0; a[b] = c; } } This is because the load to a[0] is now treated as an external def, in which case vectype cannot be found for the condition of the conditional expression, while vectorizable_condition requires that comp_vectype should be set properly. We can treat it as a missed optimization. > Index: gcc/tree-vect-loop-manip.c > === > *** gcc/tree-vect-loop-manip.c (revision 205435) > --- gcc/tree-vect-loop-manip.c (working copy) > *** vect_loop_versioning (loop_vec_info loop > *** 2269,2275 > > /* Extract load statements on memrefs with zero-stride accesses. */ > > ! if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo)) > { > /* In the loop body, we iterate each statement to check if it is a > load. > Then we check the DR_STEP of the data reference. If DR_STEP is zero, > --- 2269,2275 > > /* Extract load statements on memrefs with zero-stride accesses. */ > > ! if (0 && LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo)) > { > /* In the loop body, we iterate each statement to check if it is a > load. > Then we check the DR_STEP of the data reference. If DR_STEP is zero, > Index: gcc/tree-vect-loop.c > === > *** gcc/tree-vect-loop.c(revision 205435) > --- gcc/tree-vect-loop.c(working copy) > *** vect_transform_loop (loop_vec_info loop_ > *** 5995,6000 > --- 5995,6020 > } > } > > + /* If the stmt is loop invariant simply move it. */ > + if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_external_def) > + { > + if (dump_enabled_p ()) > + { > + dump_printf_loc (MSG_NOTE, vect_location, > + "hoisting out of the vectorized loop: "); > + dump_gimple_stmt (MSG_NOTE, TDF_SLIM, stmt, 0); > + dump_printf (MSG_NOTE, "
Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.
Any comment on this patch? thanks, Cong On Fri, Nov 22, 2013 at 11:40 AM, Cong Hou wrote: > On Fri, Nov 22, 2013 at 3:57 AM, Marc Glisse wrote: >> On Thu, 21 Nov 2013, Cong Hou wrote: >> >>> On Thu, Nov 21, 2013 at 4:39 PM, Marc Glisse wrote: >>>> >>>> On Thu, 21 Nov 2013, Cong Hou wrote: >>>> >>>>> While I added the new define_insn_and_split for vec_merge, a bug is >>>>> exposed: in config/i386/sse.md, [ define_expand "xop_vmfrcz2" ] >>>>> only takes one input, but the corresponding builtin functions have two >>>>> inputs, which are shown in i386.c: >>>>> >>>>> { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv4sf2, >>>>> "__builtin_ia32_vfrczss", IX86_BUILTIN_VFRCZSS, UNKNOWN, >>>>> (int)MULTI_ARG_2_SF }, >>>>> { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv2df2, >>>>> "__builtin_ia32_vfrczsd", IX86_BUILTIN_VFRCZSD, UNKNOWN, >>>>> (int)MULTI_ARG_2_DF }, >>>>> >>>>> In consequence, the ix86_expand_multi_arg_builtin() function tries to >>>>> check two args but based on the define_expand of xop_vmfrcz2, >>>>> the content of insn_data[CODE_FOR_xop_vmfrczv4sf2].operand[2] may be >>>>> incorrect (because it only needs one input). >>>>> >>>>> The patch below fixed this issue. >>>>> >>>>> Bootstrapped and tested on ax x86-64 machine. Note that this patch >>>>> should be applied before the one I sent earlier (sorry for sending >>>>> them in wrong order). >>>> >>>> >>>> >>>> This is PR 56788. Your patch seems strange to me and I don't think it >>>> fixes the real issue, but I'll let more knowledgeable people answer. >>> >>> >>> >>> Thank you for pointing out the bug report. This patch is not intended >>> to fix PR56788. >> >> >> IMHO, if PR56788 was fixed, you wouldn't have this issue, and if PR56788 >> doesn't get fixed, I'll post a patch to remove _mm_frcz_sd and the >> associated builtin, which would solve your issue as well. > > > I agree. Then I will wait until your patch is merged to the trunk, > otherwise my patch could not pass the test. > > >> >> >>> For your function: >>> >>> #include >>> __m128d f(__m128d x, __m128d y){ >>> return _mm_frcz_sd(x,y); >>> } >>> >>> Note that the second parameter is ignored intentionally, but the >>> prototype of this function contains two parameters. My fix is >>> explicitly telling GCC that the optab xop_vmfrczv4sf3 should have >>> three operands instead of two, to let it have the correct information >>> in insn_data[CODE_FOR_xop_vmfrczv4sf3].operand[2] which is used to >>> match the type of the second parameter in the builtin function in >>> ix86_expand_multi_arg_builtin(). >> >> >> I disagree that this is intentional, it is a bug. AFAIK there is no AMD >> documentation that could be used as a reference for what _mm_frcz_sd is >> supposed to do. The only existing documentations are by Microsoft (which >> does *not* ignore the second argument) and by LLVM (which has a single >> argument). Whatever we chose for _mm_frcz_sd, the builtin should take a >> single argument, and if necessary we'll use 2 builtins to implement >> _mm_frcz_sd. >> > > > I also only found the one by Microsoft.. If the second argument is > ignored, we could just remove it, as long as there is no "standard" > that requires two arguments. Hopefully it won't break current projects > using _mm_frcz_sd. > > Thank you for your comments! > > > Cong > > >> -- >> Marc Glisse
Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
Hi Richard Could you please take a look at this patch and see if it is ready for the trunk? The patch is pasted as a text file here again. Thank you very much! Cong On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou wrote: > Hi James > > Sorry for the late reply. > > > On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh > wrote: >>> On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou wrote: >>> > Thank you for your detailed explanation. >>> > >>> > Once GCC detects a reduction operation, it will automatically >>> > accumulate all elements in the vector after the loop. In the loop the >>> > reduction variable is always a vector whose elements are reductions of >>> > corresponding values from other vectors. Therefore in your case the >>> > only instruction you need to generate is: >>> > >>> > VABAL ops[3], ops[1], ops[2] >>> > >>> > It is OK if you accumulate the elements into one in the vector inside >>> > of the loop (if one instruction can do this), but you have to make >>> > sure other elements in the vector should remain zero so that the final >>> > result is correct. >>> > >>> > If you are confused about the documentation, check the one for >>> > udot_prod (just above usad in md.texi), as it has very similar >>> > behavior as usad. Actually I copied the text from there and did some >>> > changes. As those two instruction patterns are both for vectorization, >>> > their behavior should not be difficult to explain. >>> > >>> > If you have more questions or think that the documentation is still >>> > improper please let me know. >> >> Hi Cong, >> >> Thanks for your reply. >> >> I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and >> DOT_PROD_EXPR and I see that the same ambiguity exists for >> DOT_PROD_EXPR. Can you please add a note in your tree.def >> that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either: >> >> tmp = WIDEN_MINUS_EXPR (arg1, arg2) >> tmp2 = ABS_EXPR (tmp) >> arg3 = PLUS_EXPR (tmp2, arg3) >> >> or: >> >> tmp = WIDEN_MINUS_EXPR (arg1, arg2) >> tmp2 = ABS_EXPR (tmp) >> arg3 = WIDEN_SUM_EXPR (tmp2, arg3) >> >> Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a >> a value of the same (widened) type as arg3. >> > > > I have added it, although we currently don't have WIDEN_MINUS_EXPR (I > mentioned it in tree.def). > > >> Also, while looking for the history of DOT_PROD_EXPR I spotted this >> patch: >> >> [autovect] [patch] detect mult-hi and sad patterns >> http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html >> >> I wonder what the reason was for that patch to be dropped? >> > > It has been 8 years.. I have no idea why this patch is not accepted > finally. There is even no reply in that thread. But I believe the SAD > pattern is very important to be recognized. ARM also provides > instructions for it. > > > Thank you for your comment again! > > > thanks, > Cong > > > >> Thanks, >> James >> diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 6bdaa31..37ff6c4 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,4 +1,24 @@ -2013-11-01 Trevor Saunders +2013-10-29 Cong Hou + + * tree-vect-patterns.c (vect_recog_sad_pattern): New function for SAD + pattern recognition. + (type_conversion_p): PROMOTION is true if it's a type promotion + conversion, and false otherwise. Return true if the given expression + is a type conversion one. + * tree-vectorizer.h: Adjust the number of patterns. + * tree.def: Add SAD_EXPR. + * optabs.def: Add sad_optab. + * cfgexpand.c (expand_debug_expr): Add SAD_EXPR case. + * expr.c (expand_expr_real_2): Likewise. + * gimple-pretty-print.c (dump_ternary_rhs): Likewise. + * gimple.c (get_gimple_rhs_num_ops): Likewise. + * optabs.c (optab_for_tree_code): Likewise. + * tree-cfg.c (estimate_operator_cost): Likewise. + * tree-ssa-operands.c (get_expr_operands): Likewise. + * tree-vect-loop.c (get_initial_def_for_reduction): Likewise. + * config/i386/sse.md: Add SSE2 and AVX2 expand for SAD. + * doc/generic.texi: Add document for SAD_EXPR. + * doc/md.texi: Add document for ssad and usad. * function.c (reorder_blocks): Convert block_stack to a stack_vec. * gimplify.c (gimplify_compound_lval): Likewise. diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c index fb05ce7..1f824fb 100644 --- a/gcc/cfgexpand.c +++ b/gcc/cfgex
[PATCH] Enhancing the widen-mult pattern in vectorization.
Hi The current widen-mult pattern only considers two operands with the same size. However, operands with different sizes can also benefit from this pattern. The following loop shows such an example: char a[N]; short b[N]; int c[N]; for (int i = 0; i < N; ++i) c[i] = a[i] * b[i]; In this case, we can convert a[i] into short type then perform widen-mult on b[i] and the converted value: for (int i = 0; i < N; ++i) { short t = a[i]; c[i] = t w* b[i]; } This patch adds such support. In addition, the following loop fails to be recognized as a widen-mult pattern because the widening operation from char to int is not directly supported by the target: char a[N], b[N]; int c[N]; for (int i = 0; i < N; ++i) c[i] = a[i] * b[i]; In this case, we can still perform widen-mult on a[i] and b[i], and get a result of short type, then convert it to int: char a[N], b[N]; int c[N]; for (int i = 0; i < N; ++i) { short t = a[i] w* b[i]; c[i] = (int) t; } Currently GCC does not allow multi-step conversions for binary widening operations. This pattern removes this restriction and use VEC_UNPACK_LO_EXPR/VEC_UNPACK_HI_EXPR to arrange data after the widen-mult is performed for the widen-mult pattern. This can reduce several unpacking instructions (for this example, the number of packings/unpackings is reduced from 12 to 8. For SSE2, the inefficient multiplication between two V4SI vectors can also be avoided). Bootstrapped and tested on an x86_64 machine. thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index f298c0b..44ed204 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,12 @@ +2013-12-02 Cong Hou + + * tree-vect-patterns.c (vect_recog_widen_mult_pattern): Enhance + the widen-mult pattern by handling two operands with different + sizes. + * tree-vect-stmts.c (vectorizable_conversion): Allow multi-steps + conversions after widening mult operation. + (supportable_widening_operation): Likewise. + 2013-11-22 Jakub Jelinek PR sanitizer/59061 diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 12d2c90..611ae1c 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,8 @@ +2013-12-02 Cong Hou + + * gcc.dg/vect/vect-widen-mult-u8-s16-s32.c: New test. + * gcc.dg/vect/vect-widen-mult-u8-u32.c: New test. + 2013-11-22 Jakub Jelinek * c-c++-common/asan/no-redundant-instrumentation-7.c: Fix diff --git a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c new file mode 100644 index 000..9f9081b --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c @@ -0,0 +1,48 @@ +/* { dg-require-effective-target vect_int } */ + +#include +#include "tree-vect.h" + +#define N 64 + +unsigned char X[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__))); +short Y[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__))); +int result[N]; + +/* unsigned char * short -> int widening-mult. */ +__attribute__ ((noinline)) int +foo1(int len) { + int i; + + for (i=0; i +#include "tree-vect.h" + +#define N 64 + +unsigned char X[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__))); +unsigned char Y[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__))); +unsigned int result[N]; + +/* unsigned char-> unsigned int widening-mult. */ +__attribute__ ((noinline)) int +foo1(int len) { + int i; + + for (i=0; i *stmts, return NULL; } + /* If the two arguments have different sizes, convert the one with + the smaller type into the larger type. */ + if (TYPE_PRECISION (half_type0) != TYPE_PRECISION (half_type1)) +{ + tree* oprnd = NULL; + gimple def_stmt = NULL; + + if (TYPE_PRECISION (half_type0) < TYPE_PRECISION (half_type1)) + { + def_stmt = def_stmt0; + half_type0 = half_type1; + oprnd = &oprnd0; + } + else + { + def_stmt = def_stmt1; + half_type1 = half_type0; + oprnd = &oprnd1; + } + + if (STMT_VINFO_RELATED_STMT (vinfo_for_stmt (def_stmt))) + { + gimple new_stmt = STMT_VINFO_RELATED_STMT (vinfo_for_stmt (def_stmt)); + /* Check if the already created pattern stmt is what we need. */ + if (!is_gimple_assign (new_stmt) + || gimple_assign_rhs_code (new_stmt) != NOP_EXPR + || TREE_TYPE (gimple_assign_lhs (new_stmt)) != half_type0) +return NULL; + + stmts->safe_push (def_stmt); + *oprnd = gimple_assign_lhs (new_stmt); + } + else + { + tree old_oprnd = gimple_assign_rhs1 (def_stmt); + tree new_oprnd = make_ssa_name (half_type0, NULL); + gimple new_stmt = gimple_build_assign_with_ops (NOP_EXPR, new_oprnd, + old_oprnd, NULL_TREE); + STMT_VINFO_RELATED_STMT (vinfo_for_stmt (def_stmt)) = new_stmt; + stmts->safe_push (def_stmt); + *oprnd = new_oprnd; + } +} + /* Handle unsigned case. Look for S6 u_prod_T = (unsigned TYPE) prod_T; Use unsigned TYPE as the type for WIDEN_MULT_EXPR. */ diff --git a/gcc/tree-vect-stmts.c b/gc
Re: [PATCH] Hoist loop invariant statements containing data refs with zero-step during loop-versioning in vectorization.
Hi Richard You mentioned that Micha has a patch pending that enables of zero-step stores. What is the status of this patch? I could not find it through searching "Micha". Thank you! Cong On Wed, Oct 16, 2013 at 2:02 AM, Richard Biener wrote: > On Tue, 15 Oct 2013, Cong Hou wrote: > >> Thank you for your reminder, Jeff! I just noticed Richard's comment. I >> have modified the patch according to that. >> >> The new patch is attached. > > (posting patches inline is easier for review, now you have to deal > with no quoting markers ;)) > > Comments inline. > > diff --git a/gcc/ChangeLog b/gcc/ChangeLog > index 8a38316..2637309 100644 > --- a/gcc/ChangeLog > +++ b/gcc/ChangeLog > @@ -1,3 +1,8 @@ > +2013-10-15 Cong Hou > + > + * tree-vect-loop-manip.c (vect_loop_versioning): Hoist loop invariant > + statement that contains data refs with zero-step. > + > 2013-10-14 David Malcolm > > * dumpfile.h (gcc::dump_manager): New class, to hold state > diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog > index 075d071..9d0f4a5 100644 > --- a/gcc/testsuite/ChangeLog > +++ b/gcc/testsuite/ChangeLog > @@ -1,3 +1,7 @@ > +2013-10-15 Cong Hou > + > + * gcc.dg/vect/pr58508.c: New test. > + > 2013-10-14 Tobias Burnus > > PR fortran/58658 > diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c > b/gcc/testsuite/gcc.dg/vect/pr58508.c > new file mode 100644 > index 000..cb22b50 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/vect/pr58508.c > @@ -0,0 +1,20 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details" } */ > + > + > +/* The GCC vectorizer generates loop versioning for the following loop > + since there may exist aliasing between A and B. The predicate checks > + if A may alias with B across all iterations. Then for the loop in > + the true body, we can assert that *B is a loop invariant so that > + we can hoist the load of *B before the loop body. */ > + > +void foo (int* a, int* b) > +{ > + int i; > + for (i = 0; i < 10; ++i) > +a[i] = *b + 1; > +} > + > + > +/* { dg-final { scan-tree-dump-times "hoist" 2 "vect" } } */ > +/* { dg-final { cleanup-tree-dump "vect" } } */ > diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c > index 574446a..f4fdec2 100644 > --- a/gcc/tree-vect-loop-manip.c > +++ b/gcc/tree-vect-loop-manip.c > @@ -2477,6 +2477,92 @@ vect_loop_versioning (loop_vec_info loop_vinfo, >adjust_phi_and_debug_stmts (orig_phi, e, PHI_RESULT (new_phi)); > } > > > Note that applying this kind of transform at this point invalidates > some of the earlier analysis the vectorizer performed (namely the > def-kind which now effectively gets vect_external_def from > vect_internal_def). In this case it doesn't seem to cause any > issues (we re-compute the def-kind everytime we need it (how wasteful)). > > + /* Extract load and store statements on pointers with zero-stride > + accesses. */ > + if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo)) > +{ > + /* In the loop body, we iterate each statement to check if it is a load > +or store. Then we check the DR_STEP of the data reference. If > +DR_STEP is zero, then we will hoist the load statement to the loop > +preheader, and move the store statement to the loop exit. */ > > We don't move the store yet. Micha has a patch pending that enables > vectorization of zero-step stores. > > + for (gimple_stmt_iterator si = gsi_start_bb (loop->header); > + !gsi_end_p (si);) > > While technically ok now (vectorized loops contain a single basic block) > please use LOOP_VINFO_BBS () to get at the vector of basic-blcoks > and iterate over them like other code does. > > + { > + gimple stmt = gsi_stmt (si); > + stmt_vec_info stmt_info = vinfo_for_stmt (stmt); > + struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info); > + > + if (dr && integer_zerop (DR_STEP (dr))) > + { > + if (DR_IS_READ (dr)) > + { > + if (dump_enabled_p ()) > + { > + dump_printf_loc > + (MSG_NOTE, vect_location, > + "hoist the statement to outside of the loop "); > > "hoisting out of the vectorized loop: " > > + dump_gimple_stmt (MSG_NOTE, TDF_SLIM, stmt, 0); > + dump_printf (MSG_NOTE, "\n"); > +
Re: [PATCH] Enhancing the widen-mult pattern in vectorization.
After further reviewing this patch, I found I don't have to change the code in tree-vect-stmts.c to allow further type conversion after widen-mult operation. Instead, I detect the following pattern in vect_recog_widen_mult_pattern(): T1 a, b; ai = (T2) a; bi = (T2) b; c = ai * bi; where T2 is more that double the size of T1. (e.g. T1 is char and T2 is int). In this case I just create a new type T3 whose size is double of the size of T1, then get an intermediate result of type T3 from widen-mult. Then I add a new statement to STMT_VINFO_PATTERN_DEF_SEQ converting the result into type T2. This strategy makes the patch more clean. Bootstrapped and tested on an x86-64 machine. thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index f298c0b..12990b2 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,10 @@ +2013-12-02 Cong Hou + + * tree-vect-patterns.c (vect_recog_widen_mult_pattern): Enhance + the widen-mult pattern by handling two operands with different + sizes, and operands whose size is smaller than half of the result + type. + 2013-11-22 Jakub Jelinek PR sanitizer/59061 diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 12d2c90..611ae1c 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,8 @@ +2013-12-02 Cong Hou + + * gcc.dg/vect/vect-widen-mult-u8-s16-s32.c: New test. + * gcc.dg/vect/vect-widen-mult-u8-u32.c: New test. + 2013-11-22 Jakub Jelinek * c-c++-common/asan/no-redundant-instrumentation-7.c: Fix diff --git a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c new file mode 100644 index 000..9f9081b --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c @@ -0,0 +1,48 @@ +/* { dg-require-effective-target vect_int } */ + +#include +#include "tree-vect.h" + +#define N 64 + +unsigned char X[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__))); +short Y[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__))); +int result[N]; + +/* unsigned char * short -> int widening-mult. */ +__attribute__ ((noinline)) int +foo1(int len) { + int i; + + for (i=0; i +#include "tree-vect.h" + +#define N 64 + +unsigned char X[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__))); +unsigned char Y[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__))); +unsigned int result[N]; + +/* unsigned char-> unsigned int widening-mult. */ +__attribute__ ((noinline)) int +foo1(int len) { + int i; + + for (i=0; i + If the result of WIDEN_MULT needs to be converted to a larger type, the + returned stmt will be this type conversion stmt. */ static gimple @@ -606,8 +610,8 @@ vect_recog_widen_mult_pattern (vec *stmts, gimple def_stmt0, def_stmt1; tree oprnd0, oprnd1; tree type, half_type0, half_type1; - gimple pattern_stmt; - tree vectype, vectype_out = NULL_TREE; + gimple new_stmt = NULL, pattern_stmt = NULL; + tree vectype, vecitype; tree var; enum tree_code dummy_code; int dummy_int; @@ -661,6 +665,33 @@ vect_recog_widen_mult_pattern (vec *stmts, return NULL; } + /* If the two arguments have different sizes, convert the one with + the smaller type into the larger type. */ + if (TYPE_PRECISION (half_type0) != TYPE_PRECISION (half_type1)) +{ + tree* oprnd = NULL; + gimple def_stmt = NULL; + + if (TYPE_PRECISION (half_type0) < TYPE_PRECISION (half_type1)) + { + def_stmt = def_stmt0; + half_type0 = half_type1; + oprnd = &oprnd0; + } + else + { + def_stmt = def_stmt1; + half_type1 = half_type0; + oprnd = &oprnd1; + } + +tree old_oprnd = gimple_assign_rhs1 (def_stmt); +tree new_oprnd = make_ssa_name (half_type0, NULL); +new_stmt = gimple_build_assign_with_ops (NOP_EXPR, new_oprnd, + old_oprnd, NULL_TREE); +*oprnd = new_oprnd; +} + /* Handle unsigned case. Look for S6 u_prod_T = (unsigned TYPE) prod_T; Use unsigned TYPE as the type for WIDEN_MULT_EXPR. */ @@ -692,6 +723,15 @@ vect_recog_widen_mult_pattern (vec *stmts, if (!types_compatible_p (half_type0, half_type1)) return NULL; + /* If TYPE is more than twice larger than HALF_TYPE, we use WIDEN_MULT + to get an intermediate result of type ITYPE. In this case we need + to build a statement to convert this intermediate result to type TYPE. */ + tree itype = type; + if (TYPE_PRECISION (type) > TYPE_PRECISION (half_type0) * 2) +itype = build_nonstandard_integer_type + (GET_MODE_BITSIZE (TYPE_MODE (half_type0)) * 2, + TYPE_UNSIGNED (type)); + /* Pattern detected. */ if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, @@ -699,23 +739,56 @@ vect_recog_widen_mult_pattern (vec *stmts, /* Check target support */ vectype = get_vectype_for_scalar_type (half_type0); - vectype_out = ge
Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
Ping? thanks, Cong On Mon, Dec 2, 2013 at 5:06 PM, Cong Hou wrote: > Hi Richard > > Could you please take a look at this patch and see if it is ready for > the trunk? The patch is pasted as a text file here again. > > Thank you very much! > > > Cong > > > On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou wrote: >> Hi James >> >> Sorry for the late reply. >> >> >> On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh >> wrote: >>>> On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou wrote: >>>> > Thank you for your detailed explanation. >>>> > >>>> > Once GCC detects a reduction operation, it will automatically >>>> > accumulate all elements in the vector after the loop. In the loop the >>>> > reduction variable is always a vector whose elements are reductions of >>>> > corresponding values from other vectors. Therefore in your case the >>>> > only instruction you need to generate is: >>>> > >>>> > VABAL ops[3], ops[1], ops[2] >>>> > >>>> > It is OK if you accumulate the elements into one in the vector inside >>>> > of the loop (if one instruction can do this), but you have to make >>>> > sure other elements in the vector should remain zero so that the final >>>> > result is correct. >>>> > >>>> > If you are confused about the documentation, check the one for >>>> > udot_prod (just above usad in md.texi), as it has very similar >>>> > behavior as usad. Actually I copied the text from there and did some >>>> > changes. As those two instruction patterns are both for vectorization, >>>> > their behavior should not be difficult to explain. >>>> > >>>> > If you have more questions or think that the documentation is still >>>> > improper please let me know. >>> >>> Hi Cong, >>> >>> Thanks for your reply. >>> >>> I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and >>> DOT_PROD_EXPR and I see that the same ambiguity exists for >>> DOT_PROD_EXPR. Can you please add a note in your tree.def >>> that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either: >>> >>> tmp = WIDEN_MINUS_EXPR (arg1, arg2) >>> tmp2 = ABS_EXPR (tmp) >>> arg3 = PLUS_EXPR (tmp2, arg3) >>> >>> or: >>> >>> tmp = WIDEN_MINUS_EXPR (arg1, arg2) >>> tmp2 = ABS_EXPR (tmp) >>> arg3 = WIDEN_SUM_EXPR (tmp2, arg3) >>> >>> Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a >>> a value of the same (widened) type as arg3. >>> >> >> >> I have added it, although we currently don't have WIDEN_MINUS_EXPR (I >> mentioned it in tree.def). >> >> >>> Also, while looking for the history of DOT_PROD_EXPR I spotted this >>> patch: >>> >>> [autovect] [patch] detect mult-hi and sad patterns >>> http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html >>> >>> I wonder what the reason was for that patch to be dropped? >>> >> >> It has been 8 years.. I have no idea why this patch is not accepted >> finally. There is even no reply in that thread. But I believe the SAD >> pattern is very important to be recognized. ARM also provides >> instructions for it. >> >> >> Thank you for your comment again! >> >> >> thanks, >> Cong >> >> >> >>> Thanks, >>> James >>>
Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.
Ping? thanks, Cong On Mon, Dec 2, 2013 at 5:02 PM, Cong Hou wrote: > Any comment on this patch? > > > thanks, > Cong > > > On Fri, Nov 22, 2013 at 11:40 AM, Cong Hou wrote: >> On Fri, Nov 22, 2013 at 3:57 AM, Marc Glisse wrote: >>> On Thu, 21 Nov 2013, Cong Hou wrote: >>> >>>> On Thu, Nov 21, 2013 at 4:39 PM, Marc Glisse wrote: >>>>> >>>>> On Thu, 21 Nov 2013, Cong Hou wrote: >>>>> >>>>>> While I added the new define_insn_and_split for vec_merge, a bug is >>>>>> exposed: in config/i386/sse.md, [ define_expand "xop_vmfrcz2" ] >>>>>> only takes one input, but the corresponding builtin functions have two >>>>>> inputs, which are shown in i386.c: >>>>>> >>>>>> { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv4sf2, >>>>>> "__builtin_ia32_vfrczss", IX86_BUILTIN_VFRCZSS, UNKNOWN, >>>>>> (int)MULTI_ARG_2_SF }, >>>>>> { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv2df2, >>>>>> "__builtin_ia32_vfrczsd", IX86_BUILTIN_VFRCZSD, UNKNOWN, >>>>>> (int)MULTI_ARG_2_DF }, >>>>>> >>>>>> In consequence, the ix86_expand_multi_arg_builtin() function tries to >>>>>> check two args but based on the define_expand of xop_vmfrcz2, >>>>>> the content of insn_data[CODE_FOR_xop_vmfrczv4sf2].operand[2] may be >>>>>> incorrect (because it only needs one input). >>>>>> >>>>>> The patch below fixed this issue. >>>>>> >>>>>> Bootstrapped and tested on ax x86-64 machine. Note that this patch >>>>>> should be applied before the one I sent earlier (sorry for sending >>>>>> them in wrong order). >>>>> >>>>> >>>>> >>>>> This is PR 56788. Your patch seems strange to me and I don't think it >>>>> fixes the real issue, but I'll let more knowledgeable people answer. >>>> >>>> >>>> >>>> Thank you for pointing out the bug report. This patch is not intended >>>> to fix PR56788. >>> >>> >>> IMHO, if PR56788 was fixed, you wouldn't have this issue, and if PR56788 >>> doesn't get fixed, I'll post a patch to remove _mm_frcz_sd and the >>> associated builtin, which would solve your issue as well. >> >> >> I agree. Then I will wait until your patch is merged to the trunk, >> otherwise my patch could not pass the test. >> >> >>> >>> >>>> For your function: >>>> >>>> #include >>>> __m128d f(__m128d x, __m128d y){ >>>> return _mm_frcz_sd(x,y); >>>> } >>>> >>>> Note that the second parameter is ignored intentionally, but the >>>> prototype of this function contains two parameters. My fix is >>>> explicitly telling GCC that the optab xop_vmfrczv4sf3 should have >>>> three operands instead of two, to let it have the correct information >>>> in insn_data[CODE_FOR_xop_vmfrczv4sf3].operand[2] which is used to >>>> match the type of the second parameter in the builtin function in >>>> ix86_expand_multi_arg_builtin(). >>> >>> >>> I disagree that this is intentional, it is a bug. AFAIK there is no AMD >>> documentation that could be used as a reference for what _mm_frcz_sd is >>> supposed to do. The only existing documentations are by Microsoft (which >>> does *not* ignore the second argument) and by LLVM (which has a single >>> argument). Whatever we chose for _mm_frcz_sd, the builtin should take a >>> single argument, and if necessary we'll use 2 builtins to implement >>> _mm_frcz_sd. >>> >> >> >> I also only found the one by Microsoft.. If the second argument is >> ignored, we could just remove it, as long as there is no "standard" >> that requires two arguments. Hopefully it won't break current projects >> using _mm_frcz_sd. >> >> Thank you for your comments! >> >> >> Cong >> >> >>> -- >>> Marc Glisse
Re: [PATCH] Fixing PR59006 and PR58921 by delaying loop invariant hoisting in vectorizer.
I noticed that LIM could not hoist vector invariant, and that is why my first implementation tries to hoist them all. In addition, there are two disadvantages of hoisting invariant load + lim method: First, for some instructions the scalar version is faster than the vector version, and in this case hoisting scalar instructions before vectorization is better. Those instructions include data packing/unpacking, integer multiplication with SSE2, etc.. Second, it may use more SIMD registers. The following code shows a simple example: char *a, *b, *c; for (int i = 0; i < N; ++i) a[i] = b[0] * c[0] + a[i]; Vectorizing b[0]*c[0] is worse than loading the result of b[0]*c[0] into a vector. thanks, Cong On Mon, Jan 13, 2014 at 5:37 AM, Richard Biener wrote: > On Wed, 27 Nov 2013, Jakub Jelinek wrote: > >> On Wed, Nov 27, 2013 at 10:53:56AM +0100, Richard Biener wrote: >> > Hmm. I'm still thinking that we should handle this during the regular >> > transform step. >> >> I wonder if it can't be done instead just in vectorizable_load, >> if LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo) and the load is >> invariant, just emit the (broadcasted) load not inside of the loop, but on >> the loop preheader edge. > > So this implements this suggestion, XFAILing the no longer handled cases. > For example we get > > _94 = *b_8(D); > vect_cst_.18_95 = {_94, _94, _94, _94}; > _99 = prolog_loop_adjusted_niters.9_132 * 4; > vectp_a.22_98 = a_6(D) + _99; > ivtmp.43_77 = (unsigned long) vectp_a.22_98; > > : > # ivtmp.41_67 = PHI > # ivtmp.43_71 = PHI > vect__10.19_97 = vect_cst_.18_95 + { 1, 1, 1, 1 }; > _76 = (void *) ivtmp.43_71; > MEM[base: _76, offset: 0B] = vect__10.19_97; > > ... > > instead of having hoisted *b_8 + 1 as scalar computation. Not sure > why LIM doesn't hoist the vector variant later. > > vect__10.19_97 = vect_cst_.18_95 + vect_cst_.20_96; > invariant up to level 1, cost 1. > > ah, the cost thing. Should be "improved" to see that hoisting > reduces the number of live SSA names in the loop. > > Eventually lower_vector_ssa could optimize vector to scalar > code again ... (ick). > > Bootstrap / regtest running on x86_64. > > Comments? > > Thanks, > Richard. > > 2014-01-13 Richard Biener > > PR tree-optimization/58921 > PR tree-optimization/59006 > * tree-vect-loop-manip.c (vect_loop_versioning): Remove code > hoisting invariant stmts. > * tree-vect-stmts.c (vectorizable_load): Insert the splat of > invariant loads on the preheader edge if possible. > > * gcc.dg/torture/pr58921.c: New testcase. > * gcc.dg/torture/pr59006.c: Likewise. > * gcc.dg/vect/pr58508.c: XFAIL no longer handled cases. > > Index: gcc/tree-vect-loop-manip.c > === > *** gcc/tree-vect-loop-manip.c (revision 206576) > --- gcc/tree-vect-loop-manip.c (working copy) > *** vect_loop_versioning (loop_vec_info loop > *** 2435,2507 > } > } > > - > - /* Extract load statements on memrefs with zero-stride accesses. */ > - > - if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo)) > - { > - /* In the loop body, we iterate each statement to check if it is a > load. > -Then we check the DR_STEP of the data reference. If DR_STEP is zero, > -then we will hoist the load statement to the loop preheader. */ > - > - basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo); > - int nbbs = loop->num_nodes; > - > - for (int i = 0; i < nbbs; ++i) > - { > - for (gimple_stmt_iterator si = gsi_start_bb (bbs[i]); > - !gsi_end_p (si);) > - { > - gimple stmt = gsi_stmt (si); > - stmt_vec_info stmt_info = vinfo_for_stmt (stmt); > - struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info); > - > - if (is_gimple_assign (stmt) > - && (!dr > - || (DR_IS_READ (dr) && integer_zerop (DR_STEP (dr) > - { > - bool hoist = true; > - ssa_op_iter iter; > - tree var; > - > - /* We hoist a statement if all SSA uses in it are defined > -outside of the loop. */ > - FOR_EACH_SSA_TREE_OPERAND (var, stmt, iter, SSA_OP_USE) > - { > - gimple def = SSA_NAME_DEF_STMT (var); > - if (!gimple_nop_p (def) > - && flow_bb_inside_loop_p (loop, gimple_bb (def))) > - { > - hoist = false; > - break; > - } > - } > - > - if (hoist) > - { > - if (dr) > - gimple_set_vuse (stmt, NULL); > - > - gsi_remove (&si, false); > - gsi_i
Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
Thank you for your detailed explanation. Once GCC detects a reduction operation, it will automatically accumulate all elements in the vector after the loop. In the loop the reduction variable is always a vector whose elements are reductions of corresponding values from other vectors. Therefore in your case the only instruction you need to generate is: VABAL ops[3], ops[1], ops[2] It is OK if you accumulate the elements into one in the vector inside of the loop (if one instruction can do this), but you have to make sure other elements in the vector should remain zero so that the final result is correct. If you are confused about the documentation, check the one for udot_prod (just above usad in md.texi), as it has very similar behavior as usad. Actually I copied the text from there and did some changes. As those two instruction patterns are both for vectorization, their behavior should not be difficult to explain. If you have more questions or think that the documentation is still improper please let me know. Thank you very much! Cong On Tue, Nov 5, 2013 at 1:53 AM, James Greenhalgh wrote: > On Mon, Nov 04, 2013 at 06:30:55PM +0000, Cong Hou wrote: >> On Mon, Nov 4, 2013 at 2:06 AM, James Greenhalgh >> wrote: >> > On Fri, Nov 01, 2013 at 04:48:53PM +, Cong Hou wrote: >> >> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi >> >> index 2a5a2e1..8f5d39a 100644 >> >> --- a/gcc/doc/md.texi >> >> +++ b/gcc/doc/md.texi >> >> @@ -4705,6 +4705,16 @@ wider mode, is computed and added to operand 3. >> >> Operand 3 is of a mode equal or >> >> wider than the mode of the product. The result is placed in operand 0, >> >> which >> >> is of the same mode as operand 3. >> >> >> >> +@cindex @code{ssad@var{m}} instruction pattern >> >> +@item @samp{ssad@var{m}} >> >> +@cindex @code{usad@var{m}} instruction pattern >> >> +@item @samp{usad@var{m}} >> >> +Compute the sum of absolute differences of two signed/unsigned elements. >> >> +Operand 1 and operand 2 are of the same mode. Their absolute difference, >> >> which >> >> +is of a wider mode, is computed and added to operand 3. Operand 3 is of >> >> a mode >> >> +equal or wider than the mode of the absolute difference. The result is >> >> placed >> >> +in operand 0, which is of the same mode as operand 3. >> >> + >> >> @cindex @code{ssum_widen@var{m3}} instruction pattern >> >> @item @samp{ssum_widen@var{m3}} >> >> @cindex @code{usum_widen@var{m3}} instruction pattern >> >> diff --git a/gcc/expr.c b/gcc/expr.c >> >> index 4975a64..1db8a49 100644 >> > >> > I'm not sure I follow, and if I do - I don't think it matches what >> > you have implemented for i386. >> > >> > From your text description I would guess the series of operations to be: >> > >> > v1 = widen (operands[1]) >> > v2 = widen (operands[2]) >> > v3 = abs (v1 - v2) >> > operands[0] = v3 + operands[3] >> > >> > But if I understand the behaviour of PSADBW correctly, what you have >> > actually implemented is: >> > >> > v1 = widen (operands[1]) >> > v2 = widen (operands[2]) >> > v3 = abs (v1 - v2) >> > v4 = reduce_plus (v3) >> > operands[0] = v4 + operands[3] >> > >> > To my mind, synthesizing the reduce_plus step will be wasteful for targets >> > who do not get this for free with their Absolute Difference step. Imagine a >> > simple loop where we have synthesized the reduce_plus, we compute partial >> > sums each loop iteration, though we would be better to leave the >> > reduce_plus >> > step until after the loop. "REDUC_PLUS_EXPR" would be the appropriate >> > Tree code for this. >> >> What do you mean when you use "synthesizing" here? For each pattern, >> the only synthesized operation is the one being returned from the >> pattern recognizer. In this case, it is USAD_EXPR. The recognition of >> reduce sum is necessary as we need corresponding prolog and epilog for >> reductions, which is already done before pattern recognition. Note >> that reduction is not a pattern but is a type of vector definition. A >> vectorization pattern can still be a reduction operation as long as >> STMT_VINFO_RELATED_STMT of this pattern is a reduction operation. You >> can check the other two reduction patterns: widen_sum_pattern and >> dot_prod_pattern for reference. > > My apologies for not
Re: [PATCH] Handling == or != comparisons that may affect range test optimization.
It seems there are some changes in GCC. But if you change the type of n into signed int, the issue appears again: int foo(int n) { if (n != 0) if (n != 1) if (n != 2) if (n != 3) if (n != 4) return ++n; return n; } Also, ifcombine also suffers from the same issue here. thanks, Cong On Tue, Nov 5, 2013 at 12:53 PM, Jakub Jelinek wrote: > On Tue, Nov 05, 2013 at 01:23:00PM -0700, Jeff Law wrote: >> On 10/31/13 18:03, Cong Hou wrote: >> >(This patch is for the bug 58728: >> >http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58728) >> > >> >As in the bug report, consider the following loop: >> > >> >int foo(unsigned int n) >> >{ >> > if (n != 0) >> > if (n != 1) >> > if (n != 2) >> > if (n != 3) >> > if (n != 4) >> > return ++n; >> > return n; >> >} >> > >> >The range test optimization should be able to merge all those five >> >conditions into one in reassoc pass, but I fails to do so. The reason >> >is that the phi arg of n is replaced by the constant it compares to in >> >case of == or != comparisons (in vrp pass). GCC checks there is no >> >side effect on n between any two neighboring conditions by examining >> >if they defined the same phi arg in the join node. But as the phi arg >> >is replace by a constant, the check fails. > > I can't reproduce this, at least not on x86_64-linux with -O2, > the ifcombine pass already merges those. > > Jakub
Re: [PATCH] Handling == or != comparisons that may affect range test optimization.
On Tue, Nov 5, 2013 at 12:23 PM, Jeff Law wrote: > On 10/31/13 18:03, Cong Hou wrote: >> >> (This patch is for the bug 58728: >> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58728) >> >> As in the bug report, consider the following loop: >> >> int foo(unsigned int n) >> { >>if (n != 0) >>if (n != 1) >>if (n != 2) >>if (n != 3) >>if (n != 4) >> return ++n; >>return n; >> } >> >> The range test optimization should be able to merge all those five >> conditions into one in reassoc pass, but I fails to do so. The reason >> is that the phi arg of n is replaced by the constant it compares to in >> case of == or != comparisons (in vrp pass). GCC checks there is no >> side effect on n between any two neighboring conditions by examining >> if they defined the same phi arg in the join node. But as the phi arg >> is replace by a constant, the check fails. >> >> This patch deals with this situation by considering the existence of >> == or != comparisons, which is attached below (a text file is also >> attached with proper tabs). Bootstrap and make check both get passed. >> >> Any comment? > > > + bool is_eq_expr = is_cond && (gimple_cond_code (stmt) == NE_EXPR > + || gimple_cond_code (stmt) == > EQ_EXPR) > + && TREE_CODE (phi_arg) == INTEGER_CST; > + > + if (is_eq_expr) > + { > + lhs = gimple_cond_lhs (stmt); > + rhs = gimple_cond_rhs (stmt); > + > + if (operand_equal_p (lhs, phi_arg, 0)) > + { > + tree t = lhs; > + lhs = rhs; > + rhs = t; > + } > + if (operand_equal_p (rhs, phi_arg, 0) > + && operand_equal_p (lhs, phi_arg2, 0)) > + continue; > + } > + > + gimple stmt2 = last_stmt (test_bb); > + bool is_eq_expr2 = gimple_code (stmt2) == GIMPLE_COND > +&& (gimple_cond_code (stmt2) == NE_EXPR > +|| gimple_cond_code (stmt2) == EQ_EXPR) > +&& TREE_CODE (phi_arg2) == INTEGER_CST; > + > + if (is_eq_expr2) > + { > + lhs2 = gimple_cond_lhs (stmt2); > + rhs2 = gimple_cond_rhs (stmt2); > + > + if (operand_equal_p (lhs2, phi_arg2, 0)) > + { > + tree t = lhs2; > + lhs2 = rhs2; > + rhs2 = t; > + } > + if (operand_equal_p (rhs2, phi_arg2, 0) > + && operand_equal_p (lhs2, phi_arg, 0)) > + continue; > + } > > Can you factor those two hunks of nearly identical code into a single > function and call it twice? I'm also curious if you really need the code to > swap lhs/rhs. When can the LHS of a cond be an integer constant? Don't we > canonicalize it as ? I was not aware that the comparison between a variable and a constant will always be canonicalized as . Then I will remove the swap, and as the code is much smaller, I think it may not be necessary to create a function for them. > > I'd probably write the ChangeLog as: > > * tree-ssa-reassoc.c (suitable_cond_bb): Handle constant PHI > operands resulting from propagation of edge equivalences. > > OK, much better than mine ;) > I'm also curious -- did this code show up in a particular benchmark, if so, > which one? I didn't find this problem from any benchmark, but from another concern about loop upper bound estimation. Look at the following code: int foo(unsigned int n, int r) { int i; if (n > 0) if (n < 4) { do { --n; r *= 2; } while (n > 0); } return r+n; } In order to get the upper bound of the loop in this function, GCC traverses conditions n<4 and n>0 separately and tries to get any useful information. But as those two conditions cannot be combined into one due to this issue (note that n>0 will be transformed into n!=0), when GCC sees n<4, it will consider the possibility that n may be equal to 0, in which case the upper bound is UINT_MAX. If those two conditions can be combined into one, which is n-1<=2, then we can get the correct upper bound of the loop. thanks, Cong > > jeff
Re: [PATCH] Small fix: add { dg-require-effective-target vect_int } to testsuite/gcc.dg/vect/pr58508.c
Ping. OK for the trunk? thanks, Cong On Fri, Nov 1, 2013 at 10:47 AM, Cong Hou wrote: > It seems that on some platforms the loops in > testsuite/gcc.dg/vect/pr58508.c may be unable to be vectorized. This > small patch added { dg-require-effective-target vect_int } to make > sure all loops can be vectorized. > > > thanks, > Cong > > > diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog > index 9d0f4a5..3d9916d 100644 > --- a/gcc/testsuite/ChangeLog > +++ b/gcc/testsuite/ChangeLog > @@ -1,3 +1,7 @@ > +2013-10-29 Cong Hou > + > + * gcc.dg/vect/pr58508.c: Update. > + > 2013-10-15 Cong Hou > > * gcc.dg/vect/pr58508.c: New test. > diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c > b/gcc/testsuite/gcc.dg/vect/pr58508.c > index 6484a65..fff7a04 100644 > --- a/gcc/testsuite/gcc.dg/vect/pr58508.c > +++ b/gcc/testsuite/gcc.dg/vect/pr58508.c > @@ -1,3 +1,4 @@ > +/* { dg-require-effective-target vect_int } */ > /* { dg-do compile } */ > /* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details" } */
Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
Now is this patch OK for the trunk? Thank you! thanks, Cong On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou wrote: > Thank you for your detailed explanation. > > Once GCC detects a reduction operation, it will automatically > accumulate all elements in the vector after the loop. In the loop the > reduction variable is always a vector whose elements are reductions of > corresponding values from other vectors. Therefore in your case the > only instruction you need to generate is: > > VABAL ops[3], ops[1], ops[2] > > It is OK if you accumulate the elements into one in the vector inside > of the loop (if one instruction can do this), but you have to make > sure other elements in the vector should remain zero so that the final > result is correct. > > If you are confused about the documentation, check the one for > udot_prod (just above usad in md.texi), as it has very similar > behavior as usad. Actually I copied the text from there and did some > changes. As those two instruction patterns are both for vectorization, > their behavior should not be difficult to explain. > > If you have more questions or think that the documentation is still > improper please let me know. > > Thank you very much! > > > Cong > > > On Tue, Nov 5, 2013 at 1:53 AM, James Greenhalgh > wrote: >> On Mon, Nov 04, 2013 at 06:30:55PM +, Cong Hou wrote: >>> On Mon, Nov 4, 2013 at 2:06 AM, James Greenhalgh >>> wrote: >>> > On Fri, Nov 01, 2013 at 04:48:53PM +, Cong Hou wrote: >>> >> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi >>> >> index 2a5a2e1..8f5d39a 100644 >>> >> --- a/gcc/doc/md.texi >>> >> +++ b/gcc/doc/md.texi >>> >> @@ -4705,6 +4705,16 @@ wider mode, is computed and added to operand 3. >>> >> Operand 3 is of a mode equal or >>> >> wider than the mode of the product. The result is placed in operand 0, >>> >> which >>> >> is of the same mode as operand 3. >>> >> >>> >> +@cindex @code{ssad@var{m}} instruction pattern >>> >> +@item @samp{ssad@var{m}} >>> >> +@cindex @code{usad@var{m}} instruction pattern >>> >> +@item @samp{usad@var{m}} >>> >> +Compute the sum of absolute differences of two signed/unsigned elements. >>> >> +Operand 1 and operand 2 are of the same mode. Their absolute >>> >> difference, which >>> >> +is of a wider mode, is computed and added to operand 3. Operand 3 is of >>> >> a mode >>> >> +equal or wider than the mode of the absolute difference. The result is >>> >> placed >>> >> +in operand 0, which is of the same mode as operand 3. >>> >> + >>> >> @cindex @code{ssum_widen@var{m3}} instruction pattern >>> >> @item @samp{ssum_widen@var{m3}} >>> >> @cindex @code{usum_widen@var{m3}} instruction pattern >>> >> diff --git a/gcc/expr.c b/gcc/expr.c >>> >> index 4975a64..1db8a49 100644 >>> > >>> > I'm not sure I follow, and if I do - I don't think it matches what >>> > you have implemented for i386. >>> > >>> > From your text description I would guess the series of operations to be: >>> > >>> > v1 = widen (operands[1]) >>> > v2 = widen (operands[2]) >>> > v3 = abs (v1 - v2) >>> > operands[0] = v3 + operands[3] >>> > >>> > But if I understand the behaviour of PSADBW correctly, what you have >>> > actually implemented is: >>> > >>> > v1 = widen (operands[1]) >>> > v2 = widen (operands[2]) >>> > v3 = abs (v1 - v2) >>> > v4 = reduce_plus (v3) >>> > operands[0] = v4 + operands[3] >>> > >>> > To my mind, synthesizing the reduce_plus step will be wasteful for targets >>> > who do not get this for free with their Absolute Difference step. Imagine >>> > a >>> > simple loop where we have synthesized the reduce_plus, we compute partial >>> > sums each loop iteration, though we would be better to leave the >>> > reduce_plus >>> > step until after the loop. "REDUC_PLUS_EXPR" would be the appropriate >>> > Tree code for this. >>> >>> What do you mean when you use "synthesizing" here? For each pattern, >>> the only synthesized operation is the one being returned from the >>> pattern recognizer. In this case, it is USAD_EXPR. The recognition of &
[PATCH] Bug fix for PR59050
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59050 This is my bad. I forget to check the test result for gfortran. With this patch the bug should be fixed (tested on x86-64). thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 90b01f2..e62c672 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,8 @@ +2013-11-08 Cong Hou + + PR tree-optimization/59050 + * tree-vect-data-refs.c (comp_dr_addr_with_seg_len_pair): Bug fix. + 2013-11-07 Cong Hou * tree-vect-loop-manip.c (vect_create_cond_for_alias_checks): diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c index b2a31b1..b7eb926 100644 --- a/gcc/tree-vect-data-refs.c +++ b/gcc/tree-vect-data-refs.c @@ -2669,9 +2669,9 @@ comp_dr_addr_with_seg_len_pair (const void *p1_, const void *p2_) if (comp_res != 0) return comp_res; } - if (tree_int_cst_compare (p11.offset, p21.offset) < 0) + else if (tree_int_cst_compare (p11.offset, p21.offset) < 0) return -1; - if (tree_int_cst_compare (p11.offset, p21.offset) > 0) + else if (tree_int_cst_compare (p11.offset, p21.offset) > 0) return 1; if (TREE_CODE (p12.offset) != INTEGER_CST || TREE_CODE (p22.offset) != INTEGER_CST) @@ -2680,9 +2680,9 @@ comp_dr_addr_with_seg_len_pair (const void *p1_, const void *p2_) if (comp_res != 0) return comp_res; } - if (tree_int_cst_compare (p12.offset, p22.offset) < 0) + else if (tree_int_cst_compare (p12.offset, p22.offset) < 0) return -1; - if (tree_int_cst_compare (p12.offset, p22.offset) > 0) + else if (tree_int_cst_compare (p12.offset, p22.offset) > 0) return 1; return 0;
Re: [PATCH] Reducing number of alias checks in vectorization.
Thank you for the report. I have submitted a bug fix patch waiting to be reviewed. thanks, Cong On Fri, Nov 8, 2013 at 5:26 AM, Dominique Dhumieres wrote: > According to http://gcc.gnu.org/ml/gcc-regression/2013-11/msg00197.html > revision 204538 is breaking several tests. On x86_64-apple-darwin* the > failures I have looked at are of the kind > > /opt/gcc/work/gcc/testsuite/gfortran.dg/typebound_operator_9.f03: In function > 'nabla2_cart2d': > /opt/gcc/work/gcc/testsuite/gfortran.dg/typebound_operator_9.f03:272:0: > internal compiler error: tree check: expected integer_cst, have plus_expr in > tree_int_cst_lt, at tree.c:7083 >function nabla2_cart2d (obj) > > TIA > > Dominique
Re: [PATCH] Bug fix for PR59050
Yes, I think so. The bug is that the arguments of tree_int_cst_compare() may not be constant integers. This patch should take care of it. thanks, Cong On Fri, Nov 8, 2013 at 12:06 PM, H.J. Lu wrote: > On Fri, Nov 8, 2013 at 10:34 AM, Cong Hou wrote: >> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59050 >> >> This is my bad. I forget to check the test result for gfortran. With >> this patch the bug should be fixed (tested on x86-64). >> >> >> thanks, >> Cong >> >> >> diff --git a/gcc/ChangeLog b/gcc/ChangeLog >> index 90b01f2..e62c672 100644 >> --- a/gcc/ChangeLog >> +++ b/gcc/ChangeLog >> @@ -1,3 +1,8 @@ >> +2013-11-08 Cong Hou >> + >> + PR tree-optimization/59050 >> + * tree-vect-data-refs.c (comp_dr_addr_with_seg_len_pair): Bug fix. >> + > > Many SPEC CPU 2000 tests failed with > > costab.c: In function 'HandleCoinc2': > costab.c:1565:17: internal compiler error: tree check: expected > integer_cst, have plus_expr in tree_int_cst_lt, at tree.c:7083 > voidHandleCoinc2 ( cos1, cos2, hdfactor ) > ^ > 0xb6e084 tree_check_failed(tree_node const*, char const*, int, char const*, > ...) > ../../src-trunk/gcc/tree.c:9477 > 0xb6ffe4 tree_check > ../../src-trunk/gcc/tree.h:2914 > 0xb6ffe4 tree_int_cst_lt(tree_node const*, tree_node const*) > ../../src-trunk/gcc/tree.c:7083 > 0xb70020 tree_int_cst_compare(tree_node const*, tree_node const*) > ../../src-trunk/gcc/tree.c:7093 > 0xe53f1c comp_dr_addr_with_seg_len_pair > ../../src-trunk/gcc/tree-vect-data-refs.c:2672 > 0xe5cbb5 vec vl_embed>::qsort(int (*)(void const*, void const*)) > ../../src-trunk/gcc/vec.h:941 > 0xe5cbb5 vec::qsort(int > (*)(void const*, void const*)) > ../../src-trunk/gcc/vec.h:1620 > 0xe5cbb5 vect_prune_runtime_alias_test_list(_loop_vec_info*) > ../../src-trunk/gcc/tree-vect-data-refs.c:2845 > 0xb39382 vect_analyze_loop_2 > ../../src-trunk/gcc/tree-vect-loop.c:1716 > 0xb39382 vect_analyze_loop(loop*) > ../../src-trunk/gcc/tree-vect-loop.c:1807 > 0xb4f78f vectorize_loops() > ../../src-trunk/gcc/tree-vectorizer.c:360 > Please submit a full bug report, > with preprocessed source if appropriate. > Please include the complete backtrace with any bug report. > See <http://gcc.gnu.org/bugs.html> for instructions. > specmake[3]: *** [costab.o] Error 1 > specmake[3]: *** Waiting for unfinished jobs > > Will this patch fix them? > > > -- > H.J.
Re: [PATCH] Bug fix for PR59050
Hi Jeff I have committed the fix. Please update your repo. Thank you! Cong On Mon, Nov 11, 2013 at 10:32 AM, Jeff Law wrote: > On 11/11/13 02:32, Richard Biener wrote: >> >> On Fri, 8 Nov 2013, Cong Hou wrote: >> >>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59050 >>> >>> This is my bad. I forget to check the test result for gfortran. With >>> this patch the bug should be fixed (tested on x86-64). >> >> >> Ok. >> >> Btw, requirements are to bootstrap and test with all default >> languages enabled (that is, without any --enable-languages or >> --enable-languages=all). That >> gets you c,c++,objc,java,fortran,lto and misses obj-c++ ada and go. >> I am personally using --enable-languages=all,ada,obj-c++. > > FWIW, I bootstrapped with Cong's patch to keep my own test results clean. > So it's already been through those tests. > > If Cong doesn't get to it soon, I'll check it in myself. > > jeff >
Re: [PATCH] Bug fix for PR59050
Thank you for your advice! I will follow this instruction in future. thanks, Cong On Mon, Nov 11, 2013 at 1:32 AM, Richard Biener wrote: > On Fri, 8 Nov 2013, Cong Hou wrote: > >> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59050 >> >> This is my bad. I forget to check the test result for gfortran. With >> this patch the bug should be fixed (tested on x86-64). > > Ok. > > Btw, requirements are to bootstrap and test with all default > languages enabled (that is, without any --enable-languages or > --enable-languages=all). That > gets you c,c++,objc,java,fortran,lto and misses obj-c++ ada and go. > I am personally using --enable-languages=all,ada,obj-c++. > > Thanks, > Richard. > >> thanks, >> Cong >> >> >> diff --git a/gcc/ChangeLog b/gcc/ChangeLog >> index 90b01f2..e62c672 100644 >> --- a/gcc/ChangeLog >> +++ b/gcc/ChangeLog >> @@ -1,3 +1,8 @@ >> +2013-11-08 Cong Hou >> + >> + PR tree-optimization/59050 >> + * tree-vect-data-refs.c (comp_dr_addr_with_seg_len_pair): Bug fix. >> + >> 2013-11-07 Cong Hou >> >> * tree-vect-loop-manip.c (vect_create_cond_for_alias_checks): >> diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c >> index b2a31b1..b7eb926 100644 >> --- a/gcc/tree-vect-data-refs.c >> +++ b/gcc/tree-vect-data-refs.c >> @@ -2669,9 +2669,9 @@ comp_dr_addr_with_seg_len_pair (const void *p1_, >> const void *p2_) >>if (comp_res != 0) >> return comp_res; >> } >> - if (tree_int_cst_compare (p11.offset, p21.offset) < 0) >> + else if (tree_int_cst_compare (p11.offset, p21.offset) < 0) >> return -1; >> - if (tree_int_cst_compare (p11.offset, p21.offset) > 0) >> + else if (tree_int_cst_compare (p11.offset, p21.offset) > 0) >> return 1; >>if (TREE_CODE (p12.offset) != INTEGER_CST >>|| TREE_CODE (p22.offset) != INTEGER_CST) >> @@ -2680,9 +2680,9 @@ comp_dr_addr_with_seg_len_pair (const void *p1_, >> const void *p2_) >>if (comp_res != 0) >> return comp_res; >> } >> - if (tree_int_cst_compare (p12.offset, p22.offset) < 0) >> + else if (tree_int_cst_compare (p12.offset, p22.offset) < 0) >> return -1; >> - if (tree_int_cst_compare (p12.offset, p22.offset) > 0) >> + else if (tree_int_cst_compare (p12.offset, p22.offset) > 0) >> return 1; >> >>return 0; >> >> > > -- > Richard Biener > SUSE / SUSE Labs > SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746 > GF: Jeff Hawn, Jennifer Guild, Felix Imend
Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
Hi James Sorry for the late reply. On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh wrote: >> On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou wrote: >> > Thank you for your detailed explanation. >> > >> > Once GCC detects a reduction operation, it will automatically >> > accumulate all elements in the vector after the loop. In the loop the >> > reduction variable is always a vector whose elements are reductions of >> > corresponding values from other vectors. Therefore in your case the >> > only instruction you need to generate is: >> > >> > VABAL ops[3], ops[1], ops[2] >> > >> > It is OK if you accumulate the elements into one in the vector inside >> > of the loop (if one instruction can do this), but you have to make >> > sure other elements in the vector should remain zero so that the final >> > result is correct. >> > >> > If you are confused about the documentation, check the one for >> > udot_prod (just above usad in md.texi), as it has very similar >> > behavior as usad. Actually I copied the text from there and did some >> > changes. As those two instruction patterns are both for vectorization, >> > their behavior should not be difficult to explain. >> > >> > If you have more questions or think that the documentation is still >> > improper please let me know. > > Hi Cong, > > Thanks for your reply. > > I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and > DOT_PROD_EXPR and I see that the same ambiguity exists for > DOT_PROD_EXPR. Can you please add a note in your tree.def > that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either: > > tmp = WIDEN_MINUS_EXPR (arg1, arg2) > tmp2 = ABS_EXPR (tmp) > arg3 = PLUS_EXPR (tmp2, arg3) > > or: > > tmp = WIDEN_MINUS_EXPR (arg1, arg2) > tmp2 = ABS_EXPR (tmp) > arg3 = WIDEN_SUM_EXPR (tmp2, arg3) > > Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a > a value of the same (widened) type as arg3. > I have added it, although we currently don't have WIDEN_MINUS_EXPR (I mentioned it in tree.def). > Also, while looking for the history of DOT_PROD_EXPR I spotted this > patch: > > [autovect] [patch] detect mult-hi and sad patterns > http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html > > I wonder what the reason was for that patch to be dropped? > It has been 8 years.. I have no idea why this patch is not accepted finally. There is even no reply in that thread. But I believe the SAD pattern is very important to be recognized. ARM also provides instructions for it. Thank you for your comment again! thanks, Cong > Thanks, > James > diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 6bdaa31..37ff6c4 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,4 +1,24 @@ -2013-11-01 Trevor Saunders +2013-10-29 Cong Hou + + * tree-vect-patterns.c (vect_recog_sad_pattern): New function for SAD + pattern recognition. + (type_conversion_p): PROMOTION is true if it's a type promotion + conversion, and false otherwise. Return true if the given expression + is a type conversion one. + * tree-vectorizer.h: Adjust the number of patterns. + * tree.def: Add SAD_EXPR. + * optabs.def: Add sad_optab. + * cfgexpand.c (expand_debug_expr): Add SAD_EXPR case. + * expr.c (expand_expr_real_2): Likewise. + * gimple-pretty-print.c (dump_ternary_rhs): Likewise. + * gimple.c (get_gimple_rhs_num_ops): Likewise. + * optabs.c (optab_for_tree_code): Likewise. + * tree-cfg.c (estimate_operator_cost): Likewise. + * tree-ssa-operands.c (get_expr_operands): Likewise. + * tree-vect-loop.c (get_initial_def_for_reduction): Likewise. + * config/i386/sse.md: Add SSE2 and AVX2 expand for SAD. + * doc/generic.texi: Add document for SAD_EXPR. + * doc/md.texi: Add document for ssad and usad. * function.c (reorder_blocks): Convert block_stack to a stack_vec. * gimplify.c (gimplify_compound_lval): Likewise. diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c index fb05ce7..1f824fb 100644 --- a/gcc/cfgexpand.c +++ b/gcc/cfgexpand.c @@ -2740,6 +2740,7 @@ expand_debug_expr (tree exp) { case COND_EXPR: case DOT_PROD_EXPR: + case SAD_EXPR: case WIDEN_MULT_PLUS_EXPR: case WIDEN_MULT_MINUS_EXPR: case FMA_EXPR: diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index 9094a1c..af73817 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -7278,6 +7278,36 @@ DONE; }) +(define_expand "usadv16qi" + [(match_operand:V4SI 0 "register_operand") + (match_operand:V16QI 1 "register_operand") + (match_
Re: [PATCH] Small fix: add { dg-require-effective-target vect_int } to testsuite/gcc.dg/vect/pr58508.c
Hi Jakub Thank you for pointing it out. The updated patch is pasted below. I will pay attention to it in the future. thanks, Cong diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 3d9916d..32a6ff7 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,7 @@ +2013-11-12 Cong Hou + + * gcc.dg/vect/pr58508.c: Remove dg-options as vect_int is indicated. + 2013-10-29 Cong Hou * gcc.dg/vect/pr58508.c: Update. diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c b/gcc/testsuite/gcc.dg/vect/pr58508.c index fff7a04..c4921bb 100644 --- a/gcc/testsuite/gcc.dg/vect/pr58508.c +++ b/gcc/testsuite/gcc.dg/vect/pr58508.c @@ -1,6 +1,5 @@ /* { dg-require-effective-target vect_int } */ /* { dg-do compile } */ -/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details" } */ /* The GCC vectorizer generates loop versioning for the following loop On Tue, Nov 12, 2013 at 6:05 AM, Jakub Jelinek wrote: > On Thu, Nov 07, 2013 at 06:24:55PM -0800, Cong Hou wrote: >> Ping. OK for the trunk? >> On Fri, Nov 1, 2013 at 10:47 AM, Cong Hou wrote: >> > It seems that on some platforms the loops in >> > testsuite/gcc.dg/vect/pr58508.c may be unable to be vectorized. This >> > small patch added { dg-require-effective-target vect_int } to make >> > sure all loops can be vectorized. >> > diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog >> > index 9d0f4a5..3d9916d 100644 >> > --- a/gcc/testsuite/ChangeLog >> > +++ b/gcc/testsuite/ChangeLog >> > @@ -1,3 +1,7 @@ >> > +2013-10-29 Cong Hou >> > + >> > + * gcc.dg/vect/pr58508.c: Update. >> > + >> > 2013-10-15 Cong Hou >> > >> > * gcc.dg/vect/pr58508.c: New test. >> > diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c >> > b/gcc/testsuite/gcc.dg/vect/pr58508.c >> > index 6484a65..fff7a04 100644 >> > --- a/gcc/testsuite/gcc.dg/vect/pr58508.c >> > +++ b/gcc/testsuite/gcc.dg/vect/pr58508.c >> > @@ -1,3 +1,4 @@ >> > +/* { dg-require-effective-target vect_int } */ >> > /* { dg-do compile } */ >> > /* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details" } */ > > This isn't the only bug in the testcase. Another one is using > dg-options in gcc.dg/vect/, you should just leave that out, > the default options already include those options, but explicit dg-options > mean that other required options like -msse2 on i?86 aren't added. > > Jakub
Re: [PATCH] Small fix: add { dg-require-effective-target vect_int } to testsuite/gcc.dg/vect/pr58508.c
Got it! thanks, Cong On Tue, Nov 12, 2013 at 10:05 AM, Jakub Jelinek wrote: > On Tue, Nov 12, 2013 at 10:04:15AM -0800, Cong Hou wrote: >> Thank you for pointing it out. The updated patch is pasted below. I >> will pay attention to it in the future. > > Ok, thanks. > Note, you can use dg-additional-options if needed in g*.dg/vect/, just not > dg-options. > >> --- a/gcc/testsuite/ChangeLog >> +++ b/gcc/testsuite/ChangeLog >> @@ -1,3 +1,7 @@ >> +2013-11-12 Cong Hou >> + >> + * gcc.dg/vect/pr58508.c: Remove dg-options as vect_int is indicated. >> + >> 2013-10-29 Cong Hou >> >> * gcc.dg/vect/pr58508.c: Update. >> diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c >> b/gcc/testsuite/gcc.dg/vect/pr58508.c >> index fff7a04..c4921bb 100644 >> --- a/gcc/testsuite/gcc.dg/vect/pr58508.c >> +++ b/gcc/testsuite/gcc.dg/vect/pr58508.c >> @@ -1,6 +1,5 @@ >> /* { dg-require-effective-target vect_int } */ >> /* { dg-do compile } */ >> -/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details" } */ >> >> >> /* The GCC vectorizer generates loop versioning for the following loop > > Jakub
[PATCH] [Vectorization] Fixing a bug in alias checks merger.
The current alias check merger does not consider the DR_STEP of data-refs when sorting data-refs. For the following loop: for (i = 0; i < N; ++i) a[i] = b[0] + b[i] + b[1]; The data ref b[0] and b[i] have the same DR_INIT and DR_OFFSET, and after sorting three DR pairs, the following order can be a possible result: (a[i], b[0]), (a[i], b[i]), (a[i], b[1]) This prevents the alias checks for (a[i], b[0]) and (a[i], b[1]) being merged. This patch added the comparison between DR_STEP of two data refs during the sort. The test case is also updated. The previous one used explicit dg-options which blocks the options from the target vect_int. The test case also assumes a vector can hold at least 4 integers of int type, which may not be true on some targets. The patch is pasted below. Bootstrapped and tested on a x86-64 machine. thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 2c0554b..5faa5ca 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,14 @@ +2013-11-12 Cong Hou + + * tree-vectorizer.h (struct dr_with_seg_len): Remove the base + address field as it can be obtained from dr. Rename the struct. + * tree-vect-data-refs.c (comp_dr_with_seg_len_pair): Consider + steps of data references during sort. + (vect_prune_runtime_alias_test_list): Adjust with the change to + struct dr_with_seg_len. + * tree-vect-loop-manip.c (vect_create_cond_for_alias_checks): + Adjust with the change to struct dr_with_seg_len. + 2013-11-12 Jeff Law * tree-ssa-threadedge.c (thread_around_empty_blocks): New diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 09c7f20..8075409 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,7 @@ +2013-11-12 Cong Hou + + * gcc.dg/vect/vect-alias-check.c: Update. + 2013-11-12 Balaji V. Iyer * gcc.dg/cilk-plus/cilk-plus.exp: Added a check for LTO before running diff --git a/gcc/testsuite/gcc.dg/vect/vect-alias-check.c b/gcc/testsuite/gcc.dg/vect/vect-alias-check.c index 64a4e0c..c1bffed 100644 --- a/gcc/testsuite/gcc.dg/vect/vect-alias-check.c +++ b/gcc/testsuite/gcc.dg/vect/vect-alias-check.c @@ -1,17 +1,17 @@ /* { dg-require-effective-target vect_int } */ /* { dg-do compile } */ -/* { dg-options "-O2 -ftree-vectorize --param=vect-max-version-for-alias-checks=2 -fdump-tree-vect-details" } */ +/* { dg-additional-options "--param=vect-max-version-for-alias-checks=2" } */ -/* A test case showing three potential alias checks between - a[i] and b[i], b[i+7], b[i+14]. With alias checks merging - enabled, those tree checks can be merged into one, and the - loop will be vectorized with vect-max-version-for-alias-checks=2. */ +/* A test case showing four potential alias checks between a[i] and b[0], b[1], + b[i+1] and b[i+2]. With alias check merging enabled, those four checks + can be merged into two, and the loop will be vectorized with + vect-max-version-for-alias-checks=2. */ void foo (int *a, int *b) { int i; for (i = 0; i < 1000; ++i) -a[i] = b[i] + b[i+7] + b[i+14]; +a[i] = b[0] + b[1] + b[i+1] + b[i+2]; } /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c index c479775..7f0920d 100644 --- a/gcc/tree-vect-data-refs.c +++ b/gcc/tree-vect-data-refs.c @@ -2620,7 +2620,7 @@ vect_analyze_data_ref_accesses (loop_vec_info loop_vinfo, bb_vec_info bb_vinfo) } -/* Operator == between two dr_addr_with_seg_len objects. +/* Operator == between two dr_with_seg_len objects. This equality operator is used to make sure two data refs are the same one so that we will consider to combine the @@ -2628,62 +2628,51 @@ vect_analyze_data_ref_accesses (loop_vec_info loop_vinfo, bb_vec_info bb_vinfo) refs. */ static bool -operator == (const dr_addr_with_seg_len& d1, - const dr_addr_with_seg_len& d2) +operator == (const dr_with_seg_len& d1, + const dr_with_seg_len& d2) { - return operand_equal_p (d1.basic_addr, d2.basic_addr, 0) - && compare_tree (d1.offset, d2.offset) == 0 - && compare_tree (d1.seg_len, d2.seg_len) == 0; + return operand_equal_p (DR_BASE_ADDRESS (d1.dr), + DR_BASE_ADDRESS (d2.dr), 0) + && compare_tree (d1.offset, d2.offset) == 0 + && compare_tree (d1.seg_len, d2.seg_len) == 0; } -/* Function comp_dr_addr_with_seg_len_pair. +/* Function comp_dr_with_seg_len_pair. - Comparison function for sorting objects of dr_addr_with_seg_len_pair_t + Comparison function for sorting objects of dr_with_seg_len_pair_t so that we can combine aliasing checks in one scan. */ static int -comp_dr_addr_with_seg_len_pair (const void *p1_, const void *p2_) +comp_dr_with_seg_len_pair (const void *p1_, const void *p2_) { - const dr_addr_with_seg_len_pair_t* p1 = -(const dr_addr_with_seg_len_pair_t *) p1_; - const dr_addr_with_seg_len_pair_t* p2 = -(const dr_addr
[PATCH] Do not set flag_complex_method to 2 for C++ by default.
This patch is for PR58963. In the patch http://gcc.gnu.org/ml/gcc-patches/2005-02/msg00560.html, the builtin function is used to perform complex multiplication and division. This is to comply with C99 standard, but I am wondering if C++ also needs this. There is no complex keyword in C++, and no content in C++ standard about the behavior of operations on complex types. The header file is all written in source code, including complex multiplication and division. GCC should not do too much for them by using builtin calls by default (although we can set -fcx-limited-range to prevent GCC doing this), which has a big impact on performance (there may exist vectorization opportunities). In this patch flag_complex_method will not be set to 2 for C++. Bootstraped and tested on an x86-64 machine. thanks, Cong Index: gcc/c-family/c-opts.c === --- gcc/c-family/c-opts.c (revision 204712) +++ gcc/c-family/c-opts.c (working copy) @@ -198,8 +198,10 @@ c_common_init_options_struct (struct gcc opts->x_warn_write_strings = c_dialect_cxx (); opts->x_flag_warn_unused_result = true; - /* By default, C99-like requirements for complex multiply and divide. */ - opts->x_flag_complex_method = 2; + /* By default, C99-like requirements for complex multiply and divide. + But for C++ this should not be required. */ + if (c_language != clk_cxx && c_language != clk_objcxx) +opts->x_flag_complex_method = 2; } /* Common initialization before calling option handlers. */ Index: gcc/c-family/ChangeLog === --- gcc/c-family/ChangeLog (revision 204712) +++ gcc/c-family/ChangeLog (working copy) @@ -1,3 +1,8 @@ +2013-11-13 Cong Hou + + * c-opts.c (c_common_init_options_struct): Don't let C++ comply with + C99-like requirements for complex multiply and divide. + 2013-11-12 Joseph Myers * c-common.c (c_common_reswords): Add _Thread_local.
Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
Ping? thanks, Cong On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou wrote: > Hi James > > Sorry for the late reply. > > > On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh > wrote: >>> On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou wrote: >>> > Thank you for your detailed explanation. >>> > >>> > Once GCC detects a reduction operation, it will automatically >>> > accumulate all elements in the vector after the loop. In the loop the >>> > reduction variable is always a vector whose elements are reductions of >>> > corresponding values from other vectors. Therefore in your case the >>> > only instruction you need to generate is: >>> > >>> > VABAL ops[3], ops[1], ops[2] >>> > >>> > It is OK if you accumulate the elements into one in the vector inside >>> > of the loop (if one instruction can do this), but you have to make >>> > sure other elements in the vector should remain zero so that the final >>> > result is correct. >>> > >>> > If you are confused about the documentation, check the one for >>> > udot_prod (just above usad in md.texi), as it has very similar >>> > behavior as usad. Actually I copied the text from there and did some >>> > changes. As those two instruction patterns are both for vectorization, >>> > their behavior should not be difficult to explain. >>> > >>> > If you have more questions or think that the documentation is still >>> > improper please let me know. >> >> Hi Cong, >> >> Thanks for your reply. >> >> I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and >> DOT_PROD_EXPR and I see that the same ambiguity exists for >> DOT_PROD_EXPR. Can you please add a note in your tree.def >> that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either: >> >> tmp = WIDEN_MINUS_EXPR (arg1, arg2) >> tmp2 = ABS_EXPR (tmp) >> arg3 = PLUS_EXPR (tmp2, arg3) >> >> or: >> >> tmp = WIDEN_MINUS_EXPR (arg1, arg2) >> tmp2 = ABS_EXPR (tmp) >> arg3 = WIDEN_SUM_EXPR (tmp2, arg3) >> >> Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a >> a value of the same (widened) type as arg3. >> > > > I have added it, although we currently don't have WIDEN_MINUS_EXPR (I > mentioned it in tree.def). > > >> Also, while looking for the history of DOT_PROD_EXPR I spotted this >> patch: >> >> [autovect] [patch] detect mult-hi and sad patterns >> http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html >> >> I wonder what the reason was for that patch to be dropped? >> > > It has been 8 years.. I have no idea why this patch is not accepted > finally. There is even no reply in that thread. But I believe the SAD > pattern is very important to be recognized. ARM also provides > instructions for it. > > > Thank you for your comment again! > > > thanks, > Cong > > > >> Thanks, >> James >>
Re: [PATCH] Do not set flag_complex_method to 2 for C++ by default.
See the following code: #include using std::complex; template complex<_Tp>& mult_assign (complex<_Tp>& __y, const complex<_Up>& __z) { _Up& _M_real = __y.real(); _Up& _M_imag = __y.imag(); const _Tp __r = _M_real * __z.real() - _M_imag * __z.imag(); _M_imag = _M_real * __z.imag() + _M_imag * __z.real(); _M_real = __r; return __y; } void foo (complex& c1, complex& c2) { c1 *= c2; } void bar (complex& c1, complex& c2) { mult_assign(c1, c2); } The function mult_assign is written almost by copying the implementation of operator *= from . They have exactly the same behavior from the view of the source code. However, the compiled results of foo() and bar() are different: foo() is using builtin function for multiplication but bar() is not. Just because of a name change the final behavior is changed? This should not be how a compiler is working. thanks, Cong On Thu, Nov 14, 2013 at 10:17 AM, Andrew Pinski wrote: > On Thu, Nov 14, 2013 at 8:25 AM, Xinliang David Li wrote: >> Can we revisit the decision for this? Here are the reasons: >> >> 1) It seems that the motivation to make C++ consistent with c99 is to >> avoid confusing users who build the C source with both C and C++ >> compilers. Why should C++'s default behavior be tuned for this niche >> case? > > It is not a niche case. It is confusing for people who write C++ code > to rewrite their code to C99 and find that C is much slower because of > correctness? I think they have this backwards here. C++ should be > consistent with C here. > >> 2) It is very confusing for users who see huge performance difference >> between compiler generated code for Complex multiplication vs manually >> expanded code > > I don't see why this is an issue if they understand how complex > multiplication works for correctness. I am sorry but correctness over > speed is a good argument of why this should stay this way. > >> 3) The default setting can also block potential vectorization >> opportunities for complex operations > > Yes so again this is about a correctness issue over a speed issue. > >> 4) GCC is about the only compiler which has this default -- very few >> user knows about GCC's strict default, and will think GCC performs >> poorly. > > > Correctness over speed is better. I am sorry GCC is the only one > which gets it correct here. If people don't like there is a flag to > disable it. > > Thanks, > Andrew Pinski > >> >> thanks, >> >> David >> >> >> On Wed, Nov 13, 2013 at 9:07 PM, Andrew Pinski wrote: >>> On Wed, Nov 13, 2013 at 5:26 PM, Cong Hou wrote: >>>> This patch is for PR58963. >>>> >>>> In the patch http://gcc.gnu.org/ml/gcc-patches/2005-02/msg00560.html, >>>> the builtin function is used to perform complex multiplication and >>>> division. This is to comply with C99 standard, but I am wondering if >>>> C++ also needs this. >>>> >>>> There is no complex keyword in C++, and no content in C++ standard >>>> about the behavior of operations on complex types. The >>>> header file is all written in source code, including complex >>>> multiplication and division. GCC should not do too much for them by >>>> using builtin calls by default (although we can set -fcx-limited-range >>>> to prevent GCC doing this), which has a big impact on performance >>>> (there may exist vectorization opportunities). >>>> >>>> In this patch flag_complex_method will not be set to 2 for C++. >>>> Bootstraped and tested on an x86-64 machine. >>> >>> I think you need to look into this issue deeper as the original patch >>> only enabled it for C99: >>> http://gcc.gnu.org/ml/gcc-patches/2005-02/msg01483.html . >>> >>> Just a little deeper will find >>> http://gcc.gnu.org/ml/gcc/2007-07/msg00124.html which says yes C++ >>> needs this. >>> >>> Thanks, >>> Andrew Pinski >>> >>>> >>>> >>>> thanks, >>>> Cong >>>> >>>> >>>> Index: gcc/c-family/c-opts.c >>>> === >>>> --- gcc/c-family/c-opts.c (revision 204712) >>>> +++ gcc/c-family/c-opts.c (working copy) >>>> @@ -198,8 +198,10 @@ c_common_init_options_struct (struct gcc >>>>opts->x_warn_write_strings = c_dialect_cxx (); >>>>opts->x_flag_warn_unused_result = true; >>>> >>>> -
[PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.
Hi This patch adds the support to two non-isomorphic operations addsub and subadd for SLP vectorizer. More non-isomorphic operations can be added later, but the limitation is that operations on even/odd elements should still be isomorphic. Once such an operation is detected, the code of the operation used in vectorized code is stored and later will be used during statement transformation. Two new GIMPLE opeartions VEC_ADDSUB_EXPR and VEC_SUBADD_EXPR are defined. And also new optabs for them. They are also documented. The target supports for SSE/SSE2/SSE3/AVX are added for those two new operations on floating points. SSE3/AVX provides ADDSUBPD and ADDSUBPS instructions. For SSE/SSE2, those two operations are emulated using two instructions (selectively negate then add). With this patch the following function will be SLP vectorized: float a[4], b[4], c[4]; // double also OK. void subadd () { c[0] = a[0] - b[0]; c[1] = a[1] + b[1]; c[2] = a[2] - b[2]; c[3] = a[3] + b[3]; } void addsub () { c[0] = a[0] + b[0]; c[1] = a[1] - b[1]; c[2] = a[2] + b[2]; c[3] = a[3] - b[3]; } Boostrapped and tested on an x86-64 machine. thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 2c0554b..656d5fb 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,31 @@ +2013-11-14 Cong Hou + + * tree-vect-slp.c (vect_create_new_slp_node): Initialize + SLP_TREE_OP_CODE. + (slp_supported_non_isomorphic_op): New function. Check if the + non-isomorphic operation is supported or not. + (vect_build_slp_tree_1): Consider non-isomorphic operations. + (vect_build_slp_tree): Change argument. + * tree-vect-stmts.c (vectorizable_operation): Consider the opcode + for non-isomorphic operations. + * optabs.def (vec_addsub_optab, vec_subadd_optab): New optabs. + * tree.def (VEC_ADDSUB_EXPR, VEC_SUBADD_EXPR): New operations. + * expr.c (expand_expr_real_2): Add support to VEC_ADDSUB_EXPR and + VEC_SUBADD_EXPR. + * gimple-pretty-print.c (dump_binary_rhs): Likewise. + * optabs.c (optab_for_tree_code): Likewise. + * tree-cfg.c (verify_gimple_assign_binary): Likewise. + * tree-vectorizer.h (struct _slp_tree): New data member. + * config/i386/i386-protos.h (ix86_sse_expand_fp_addsub_operator): + New funtion. Expand addsub/subadd operations for SSE2. + * config/i386/i386.c (ix86_sse_expand_fp_addsub_operator): Likewise. + * config/i386/sse.md (UNSPEC_SUBADD, UNSPEC_ADDSUB): New RTL operation. + (vec_subadd_v4sf3, vec_subadd_v2df3, vec_subadd_3, + vec_addsub_v4sf3, vec_addsub_v2df3, vec_addsub_3): + Expand addsub/subadd operations for SSE/SSE2/SSE3/AVX. + * doc/generic.texi (VEC_ADDSUB_EXPR, VEC_SUBADD_EXPR): New doc. + * doc/md.texi (vec_addsub_@var{m}3, vec_subadd_@var{m}3): New doc. + 2013-11-12 Jeff Law * tree-ssa-threadedge.c (thread_around_empty_blocks): New diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h index fdf9d58..b02b757 100644 --- a/gcc/config/i386/i386-protos.h +++ b/gcc/config/i386/i386-protos.h @@ -117,6 +117,7 @@ extern rtx ix86_expand_adjust_ufix_to_sfix_si (rtx, rtx *); extern enum ix86_fpcmp_strategy ix86_fp_comparison_strategy (enum rtx_code); extern void ix86_expand_fp_absneg_operator (enum rtx_code, enum machine_mode, rtx[]); +extern void ix86_sse_expand_fp_addsub_operator (bool, enum machine_mode, rtx[]); extern void ix86_expand_copysign (rtx []); extern void ix86_split_copysign_const (rtx []); extern void ix86_split_copysign_var (rtx []); diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 5287b49..76f38f5 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -18702,6 +18702,51 @@ ix86_expand_fp_absneg_operator (enum rtx_code code, enum machine_mode mode, emit_insn (set); } +/* Generate code for addsub or subadd on fp vectors for sse/sse2. The flag + SUBADD indicates if we are generating code for subadd or addsub. */ + +void +ix86_sse_expand_fp_addsub_operator (bool subadd, enum machine_mode mode, +rtx operands[]) +{ + rtx mask; + rtx neg_mask32 = GEN_INT (0x8000); + rtx neg_mask64 = GEN_INT ((HOST_WIDE_INT)1 << 63); + + switch (mode) +{ +case V4SFmode: + if (subadd) + mask = gen_rtx_CONST_VECTOR (V4SImode, gen_rtvec (4, + neg_mask32, const0_rtx, neg_mask32, const0_rtx)); + else + mask = gen_rtx_CONST_VECTOR (V4SImode, gen_rtvec (4, + const0_rtx, neg_mask32, const0_rtx, neg_mask32)); + break; + +case V2DFmode: + if (subadd) + mask = gen_rtx_CONST_VECTOR (V2DImode, gen_rtvec (2, + neg_mask64, const0_rtx)); + else + mask = gen_rtx_CONST_VECTOR (V2DImode, gen_rtvec (2, + const0_rtx, neg_mask64)); + break; + +default: + gcc_unreachable (); +} + + rtx tmp = gen_reg_rtx (mode); + convert_move (tmp, mask, false); + + rtx tmp2 = gen_reg_rtx (mode); + tmp2 = expand_simple_binop (mode, XOR, tmp, operands[2], + tmp2, 0, OPTAB_DIRECT); + expand_simple_binop (mode, PLUS, operands[1], tmp2, + operands[0], 0, OPTAB_
Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
Any more comments? thanks, Cong On Wed, Nov 13, 2013 at 6:06 PM, Cong Hou wrote: > Ping? > > > thanks, > Cong > > > On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou wrote: >> Hi James >> >> Sorry for the late reply. >> >> >> On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh >> wrote: >>>> On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou wrote: >>>> > Thank you for your detailed explanation. >>>> > >>>> > Once GCC detects a reduction operation, it will automatically >>>> > accumulate all elements in the vector after the loop. In the loop the >>>> > reduction variable is always a vector whose elements are reductions of >>>> > corresponding values from other vectors. Therefore in your case the >>>> > only instruction you need to generate is: >>>> > >>>> > VABAL ops[3], ops[1], ops[2] >>>> > >>>> > It is OK if you accumulate the elements into one in the vector inside >>>> > of the loop (if one instruction can do this), but you have to make >>>> > sure other elements in the vector should remain zero so that the final >>>> > result is correct. >>>> > >>>> > If you are confused about the documentation, check the one for >>>> > udot_prod (just above usad in md.texi), as it has very similar >>>> > behavior as usad. Actually I copied the text from there and did some >>>> > changes. As those two instruction patterns are both for vectorization, >>>> > their behavior should not be difficult to explain. >>>> > >>>> > If you have more questions or think that the documentation is still >>>> > improper please let me know. >>> >>> Hi Cong, >>> >>> Thanks for your reply. >>> >>> I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and >>> DOT_PROD_EXPR and I see that the same ambiguity exists for >>> DOT_PROD_EXPR. Can you please add a note in your tree.def >>> that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either: >>> >>> tmp = WIDEN_MINUS_EXPR (arg1, arg2) >>> tmp2 = ABS_EXPR (tmp) >>> arg3 = PLUS_EXPR (tmp2, arg3) >>> >>> or: >>> >>> tmp = WIDEN_MINUS_EXPR (arg1, arg2) >>> tmp2 = ABS_EXPR (tmp) >>> arg3 = WIDEN_SUM_EXPR (tmp2, arg3) >>> >>> Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a >>> a value of the same (widened) type as arg3. >>> >> >> >> I have added it, although we currently don't have WIDEN_MINUS_EXPR (I >> mentioned it in tree.def). >> >> >>> Also, while looking for the history of DOT_PROD_EXPR I spotted this >>> patch: >>> >>> [autovect] [patch] detect mult-hi and sad patterns >>> http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html >>> >>> I wonder what the reason was for that patch to be dropped? >>> >> >> It has been 8 years.. I have no idea why this patch is not accepted >> finally. There is even no reply in that thread. But I believe the SAD >> pattern is very important to be recognized. ARM also provides >> instructions for it. >> >> >> Thank you for your comment again! >> >> >> thanks, >> Cong >> >> >> >>> Thanks, >>> James >>>
Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.
I tried your method and it works well for doubles. But for float, there is an issue. For the following gimple code: c1 = a - b; c2 = a + b; c = VEC_PERM It needs two instructions to implement the VEC_PERM operation in SSE2-4, one of which should be using shufps which is represented by the following pattern in rtl: (define_insn "sse_shufps_" [(set (match_operand:VI4F_128 0 "register_operand" "=x,x") (vec_select:VI4F_128 (vec_concat: (match_operand:VI4F_128 1 "register_operand" "0,x") (match_operand:VI4F_128 2 "nonimmediate_operand" "xm,xm")) (parallel [(match_operand 3 "const_0_to_3_operand") (match_operand 4 "const_0_to_3_operand") (match_operand 5 "const_4_to_7_operand") (match_operand 6 "const_4_to_7_operand")])))] ...) Note that it contains two rtl instructions. Together with minus, plus, and one more shuffling instruction, we have at least five instructions for addsub pattern. I think during the combine pass, only four instructions are considered to be combined, right? So unless we compress those five instructions into four or less, we could not use this method for float values. What do you think? thanks, Cong On Fri, Nov 15, 2013 at 12:53 AM, Richard Biener wrote: > On Thu, 14 Nov 2013, Cong Hou wrote: > >> Hi >> >> This patch adds the support to two non-isomorphic operations addsub >> and subadd for SLP vectorizer. More non-isomorphic operations can be >> added later, but the limitation is that operations on even/odd >> elements should still be isomorphic. Once such an operation is >> detected, the code of the operation used in vectorized code is stored >> and later will be used during statement transformation. Two new GIMPLE >> opeartions VEC_ADDSUB_EXPR and VEC_SUBADD_EXPR are defined. And also >> new optabs for them. They are also documented. >> >> The target supports for SSE/SSE2/SSE3/AVX are added for those two new >> operations on floating points. SSE3/AVX provides ADDSUBPD and ADDSUBPS >> instructions. For SSE/SSE2, those two operations are emulated using >> two instructions (selectively negate then add). >> >> With this patch the following function will be SLP vectorized: >> >> >> float a[4], b[4], c[4]; // double also OK. >> >> void subadd () >> { >> c[0] = a[0] - b[0]; >> c[1] = a[1] + b[1]; >> c[2] = a[2] - b[2]; >> c[3] = a[3] + b[3]; >> } >> >> void addsub () >> { >> c[0] = a[0] + b[0]; >> c[1] = a[1] - b[1]; >> c[2] = a[2] + b[2]; >> c[3] = a[3] - b[3]; >> } >> >> >> Boostrapped and tested on an x86-64 machine. > > I managed to do this without adding new tree codes or optabs by > vectorizing the above as > >c1 = a + b; >c2 = a - b; >c = VEC_PERM > > which then matches sse3_addsubv4sf3 if you fix that pattern to > not use vec_merge (or fix PR56766). Doing it this way also > means that the code is vectorizable if you don't have a HW > instruction for that but can do the VEC_PERM efficiently. > > So, I'd like to avoid new tree codes and optabs whenever possible > and here I've already proved (with a patch) that it is possible. > Didn't have time to clean it up, and it likely doesn't apply anymore > (and PR56766 blocks it but it even has a patch). > > Btw, this was PR56902 where I attached my patch. > > Richard. > >> >> thanks, >> Cong >> >> >> >> >> >> diff --git a/gcc/ChangeLog b/gcc/ChangeLog >> index 2c0554b..656d5fb 100644 >> --- a/gcc/ChangeLog >> +++ b/gcc/ChangeLog >> @@ -1,3 +1,31 @@ >> +2013-11-14 Cong Hou >> + >> + * tree-vect-slp.c (vect_create_new_slp_node): Initialize >> + SLP_TREE_OP_CODE. >> + (slp_supported_non_isomorphic_op): New function. Check if the >> + non-isomorphic operation is supported or not. >> + (vect_build_slp_tree_1): Consider non-isomorphic operations. >> + (vect_build_slp_tree): Change argument. >> + * tree-vect-stmts.c (vectorizable_operation): Consider the opcode >> + for non-isomorphic operations. >> + * optabs.def (vec_addsub_optab, vec_subadd_optab): New optabs. >> + * tree.def (VEC_ADDSUB_EXPR, VEC_SUBADD_EXPR): New operations. >> + * expr.c (expand_expr_real_2): Add support to VEC_ADDSUB_EXPR and >> + VEC_SUBADD_EXPR. >> + * gimple-pretty-print.c (dump_binary_rhs): Likewise. >> + * optabs.c (optab_for_tree_code): Likewise. >> + * tree-cfg.c (verify_gimple_assign_binary): Likewise. >> + * tree-vectorizer.h (struct _slp_tree): New data member
Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.
On Fri, Nov 15, 2013 at 1:20 AM, Uros Bizjak wrote: > Hello! > >> This patch adds the support to two non-isomorphic operations addsub >> and subadd for SLP vectorizer. More non-isomorphic operations can be >> added later, but the limitation is that operations on even/odd >> elements should still be isomorphic. Once such an operation is >> detected, the code of the operation used in vectorized code is stored >> and later will be used during statement transformation. Two new GIMPLE >> opeartions VEC_ADDSUB_EXPR and VEC_SUBADD_EXPR are defined. And also >> new optabs for them. They are also documented. >> >> The target supports for SSE/SSE2/SSE3/AVX are added for those two new >> operations on floating points. SSE3/AVX provides ADDSUBPD and ADDSUBPS >> instructions. For SSE/SSE2, those two operations are emulated using >> two instructions (selectively negate then add). > >;; SSE3 >UNSPEC_LDDQU > + UNSPEC_SUBADD > + UNSPEC_ADDSUB > > No! Please avoid unspecs. OK, got it. > > +(define_expand "vec_subadd_v4sf3" > + [(set (match_operand:V4SF 0 "register_operand") > + (unspec:V4SF > + [(match_operand:V4SF 1 "register_operand") > + (match_operand:V4SF 2 "nonimmediate_operand")] UNSPEC_SUBADD))] > + "TARGET_SSE" > +{ > + if (TARGET_SSE3) > +emit_insn (gen_sse3_addsubv4sf3 (operands[0], operands[1], operands[2])); > + else > +ix86_sse_expand_fp_addsub_operator (true, V4SFmode, operands); > + DONE; > +}) > > Make the expander pattern look like correspondig sse3 insn and: > ... > { > if (!TARGET_SSE3) > { > ix86_sse_expand_fp_...(); > DONE; > } > } > You mean I should write two expanders for SSE and SSE3 respectively? Thank you for your comment! Cong > Uros.
Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.
On Fri, Nov 15, 2013 at 10:18 AM, Richard Earnshaw wrote: > On 15/11/13 02:06, Cong Hou wrote: >> Hi >> >> This patch adds the support to two non-isomorphic operations addsub >> and subadd for SLP vectorizer. More non-isomorphic operations can be >> added later, but the limitation is that operations on even/odd >> elements should still be isomorphic. Once such an operation is >> detected, the code of the operation used in vectorized code is stored >> and later will be used during statement transformation. Two new GIMPLE >> opeartions VEC_ADDSUB_EXPR and VEC_SUBADD_EXPR are defined. And also >> new optabs for them. They are also documented. >> > > Not withstanding what Richi has already said on this subject, you > certainly don't need both VEC_ADDSUB_EXPR and VEC_SUBADD_EXPR. The > latter can always be formed by vec-negating the second operand and > passing it to VEC_ADDSUB_EXPR. > Right. But I also considered targets without the support to addsub instructions. Then we could still selectively negate odd/even elements using masks then use PLUS_EXPR (at most 2 instructions). If I implement VEC_ADDSUB_EXPR by negating the second operand then using VEC_ADDSUB_EXPR, I end up with one more instruction. thanks, Cong > R. > >
Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.
On Mon, Nov 18, 2013 at 12:27 PM, Uros Bizjak wrote: > On Mon, Nov 18, 2013 at 9:15 PM, Cong Hou wrote: > >>>> This patch adds the support to two non-isomorphic operations addsub >>>> and subadd for SLP vectorizer. More non-isomorphic operations can be >>>> added later, but the limitation is that operations on even/odd >>>> elements should still be isomorphic. Once such an operation is >>>> detected, the code of the operation used in vectorized code is stored >>>> and later will be used during statement transformation. Two new GIMPLE >>>> opeartions VEC_ADDSUB_EXPR and VEC_SUBADD_EXPR are defined. And also >>>> new optabs for them. They are also documented. >>>> >>>> The target supports for SSE/SSE2/SSE3/AVX are added for those two new >>>> operations on floating points. SSE3/AVX provides ADDSUBPD and ADDSUBPS >>>> instructions. For SSE/SSE2, those two operations are emulated using >>>> two instructions (selectively negate then add). >>> >>> +(define_expand "vec_subadd_v4sf3" >>> + [(set (match_operand:V4SF 0 "register_operand") >>> + (unspec:V4SF >>> + [(match_operand:V4SF 1 "register_operand") >>> + (match_operand:V4SF 2 "nonimmediate_operand")] UNSPEC_SUBADD))] >>> + "TARGET_SSE" >>> +{ >>> + if (TARGET_SSE3) >>> +emit_insn (gen_sse3_addsubv4sf3 (operands[0], operands[1], >>> operands[2])); >>> + else >>> +ix86_sse_expand_fp_addsub_operator (true, V4SFmode, operands); >>> + DONE; >>> +}) >>> >>> Make the expander pattern look like correspondig sse3 insn and: >>> ... >>> { >>> if (!TARGET_SSE3) >>> { >>> ix86_sse_expand_fp_...(); >>> DONE; >>> } >>> } >>> >> >> You mean I should write two expanders for SSE and SSE3 respectively? > > No, please use the same approach as you did for abs2 expander. > For !TARGET_SSE3, call the helper function (ix86_sse_expand...), > otherwise expand through pattern. Also, it looks to me that you should > partially expand in the pattern before calling helper function, mainly > to avoid a bunch of "if (...)" at the beginning of the helper > function. > I know what you mean. Then I have to change the pattern being detected for sse3_addsubv4sf3, so that it can handle ADDSUB_EXPR for SSE3. Currently I am considering using Richard's method without creating new tree nodes and optabs, based on pattern matching. I will handle SSE2 and SSE3 separately by define_expand and define_insn. The current problem is that the pattern may contain more than four instructions which cannot be processed by the combine pass. I am considering how to reduce the number of instructions in the pattern to four. Thank you very much! Cong > Uros.
Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.
On Tue, Nov 19, 2013 at 1:45 AM, Richard Biener wrote: > > On Mon, 18 Nov 2013, Cong Hou wrote: > > > I tried your method and it works well for doubles. But for float, > > there is an issue. For the following gimple code: > > > >c1 = a - b; > >c2 = a + b; > >c = VEC_PERM > > > > It needs two instructions to implement the VEC_PERM operation in > > SSE2-4, one of which should be using shufps which is represented by > > the following pattern in rtl: > > > > > > (define_insn "sse_shufps_" > > [(set (match_operand:VI4F_128 0 "register_operand" "=x,x") > > (vec_select:VI4F_128 > > (vec_concat: > >(match_operand:VI4F_128 1 "register_operand" "0,x") > >(match_operand:VI4F_128 2 "nonimmediate_operand" "xm,xm")) > > (parallel [(match_operand 3 "const_0_to_3_operand") > > (match_operand 4 "const_0_to_3_operand") > > (match_operand 5 "const_4_to_7_operand") > > (match_operand 6 "const_4_to_7_operand")])))] > > ...) > > > > Note that it contains two rtl instructions. > > It's a single instruction as far as combine is concerned (RTL > instructions have arbitrary complexity). Even it is one instruction, we will end up with four rtl statements, which still cannot be combined as there are restrictions on combining four instructions (loads of constants or binary operations involving a constant). Note that vec_select instead of vec_merge is used here because currently vec_merge is emitted only if SSE4 is enabled (thus blend instructions can be used. If you look at ix86_expand_vec_perm_const_1() in i386.c, you can find that vec_merge is generated in expand_vec_perm_1() with SSE4.). Without SSE4 support, in most cases a vec_merge statement could not be translated by one SSE instruction. > > > > Together with minus, plus, > > and one more shuffling instruction, we have at least five instructions > > for addsub pattern. I think during the combine pass, only four > > instructions are considered to be combined, right? So unless we > > compress those five instructions into four or less, we could not use > > this method for float values. > > At the moment addsubv4sf looks like > > (define_insn "sse3_addsubv4sf3" > [(set (match_operand:V4SF 0 "register_operand" "=x,x") > (vec_merge:V4SF > (plus:V4SF > (match_operand:V4SF 1 "register_operand" "0,x") > (match_operand:V4SF 2 "nonimmediate_operand" "xm,xm")) > (minus:V4SF (match_dup 1) (match_dup 2)) > (const_int 10)))] > > to match this it's best to have the VEC_SHUFFLE retained as > vec_merge and thus support arbitrary(?) vec_merge for the aid > of combining until reload(?) after which we can split it. > You mean VEC_PERM (this is generated in gimple from your patch)? Note as I mentioned above, without SSE4, it is difficult to translate VEC_PERM into vec_merge. Even if we can do it, we still need do define split to convert one vec_merge into two or more other statements later. ADDSUB instructions are proved by SSE3 and I think we should not rely on SSE4 to perform this transformation, right? To sum up, if we use vec_select instead of vec_merge, we may have four rtl statements for float types, in which case they cannot be combined. If we use vec_merge, we need to define the split for it without SSE4 support, and we need also to change the behavior of ix86_expand_vec_perm_const_1(). > > What do you think? > > Besides of addsub are there other instructions that can be expressed > similarly? Thus, how far should the combiner pattern go? > I think your method is quite flexible. Beside blending add/sub, we could blend other combinations of two operations, and even one operation and a no-op. For example, consider vectorizing complex conjugate operation: for (int i = 0; i < N; i+=2) { a[i] = b[i]; a[i+1] = -b[i+1]; } This is loop is better to be vectorized by hybrid SLP. The second statement has a unary minus operation but there is no operation in the first one. We can improve out SLP grouping algorithm to let GCC SLP vectorize it. thanks, Cong > Richard. > > > > > > > > > thanks, > > Cong > > > > > > On Fri, Nov 15, 2013 at 12:53 AM, Richard Biener wrote: > > > On Thu, 14 Nov 2013, Cong Hou wrote: > > > > > >> Hi > > >> > > >> This patch adds the support to two non-isomorphic operations addsub > > >> and subadd for SLP vectorizer. More non-isomorphic operations can be > > >> add
Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
Ping... thanks, Cong On Fri, Nov 15, 2013 at 9:52 AM, Cong Hou wrote: > Any more comments? > > > > thanks, > Cong > > > On Wed, Nov 13, 2013 at 6:06 PM, Cong Hou wrote: >> Ping? >> >> >> thanks, >> Cong >> >> >> On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou wrote: >>> Hi James >>> >>> Sorry for the late reply. >>> >>> >>> On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh >>> wrote: >>>>> On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou wrote: >>>>> > Thank you for your detailed explanation. >>>>> > >>>>> > Once GCC detects a reduction operation, it will automatically >>>>> > accumulate all elements in the vector after the loop. In the loop the >>>>> > reduction variable is always a vector whose elements are reductions of >>>>> > corresponding values from other vectors. Therefore in your case the >>>>> > only instruction you need to generate is: >>>>> > >>>>> > VABAL ops[3], ops[1], ops[2] >>>>> > >>>>> > It is OK if you accumulate the elements into one in the vector inside >>>>> > of the loop (if one instruction can do this), but you have to make >>>>> > sure other elements in the vector should remain zero so that the final >>>>> > result is correct. >>>>> > >>>>> > If you are confused about the documentation, check the one for >>>>> > udot_prod (just above usad in md.texi), as it has very similar >>>>> > behavior as usad. Actually I copied the text from there and did some >>>>> > changes. As those two instruction patterns are both for vectorization, >>>>> > their behavior should not be difficult to explain. >>>>> > >>>>> > If you have more questions or think that the documentation is still >>>>> > improper please let me know. >>>> >>>> Hi Cong, >>>> >>>> Thanks for your reply. >>>> >>>> I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and >>>> DOT_PROD_EXPR and I see that the same ambiguity exists for >>>> DOT_PROD_EXPR. Can you please add a note in your tree.def >>>> that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either: >>>> >>>> tmp = WIDEN_MINUS_EXPR (arg1, arg2) >>>> tmp2 = ABS_EXPR (tmp) >>>> arg3 = PLUS_EXPR (tmp2, arg3) >>>> >>>> or: >>>> >>>> tmp = WIDEN_MINUS_EXPR (arg1, arg2) >>>> tmp2 = ABS_EXPR (tmp) >>>> arg3 = WIDEN_SUM_EXPR (tmp2, arg3) >>>> >>>> Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a >>>> a value of the same (widened) type as arg3. >>>> >>> >>> >>> I have added it, although we currently don't have WIDEN_MINUS_EXPR (I >>> mentioned it in tree.def). >>> >>> >>>> Also, while looking for the history of DOT_PROD_EXPR I spotted this >>>> patch: >>>> >>>> [autovect] [patch] detect mult-hi and sad patterns >>>> http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html >>>> >>>> I wonder what the reason was for that patch to be dropped? >>>> >>> >>> It has been 8 years.. I have no idea why this patch is not accepted >>> finally. There is even no reply in that thread. But I believe the SAD >>> pattern is very important to be recognized. ARM also provides >>> instructions for it. >>> >>> >>> Thank you for your comment again! >>> >>> >>> thanks, >>> Cong >>> >>> >>> >>>> Thanks, >>>> James >>>>
Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.
On Thu, Nov 21, 2013 at 4:39 PM, Marc Glisse wrote: > On Thu, 21 Nov 2013, Cong Hou wrote: > >> While I added the new define_insn_and_split for vec_merge, a bug is >> exposed: in config/i386/sse.md, [ define_expand "xop_vmfrcz2" ] >> only takes one input, but the corresponding builtin functions have two >> inputs, which are shown in i386.c: >> >> { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv4sf2, >> "__builtin_ia32_vfrczss", IX86_BUILTIN_VFRCZSS, UNKNOWN, >> (int)MULTI_ARG_2_SF }, >> { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv2df2, >> "__builtin_ia32_vfrczsd", IX86_BUILTIN_VFRCZSD, UNKNOWN, >> (int)MULTI_ARG_2_DF }, >> >> In consequence, the ix86_expand_multi_arg_builtin() function tries to >> check two args but based on the define_expand of xop_vmfrcz2, >> the content of insn_data[CODE_FOR_xop_vmfrczv4sf2].operand[2] may be >> incorrect (because it only needs one input). >> >> The patch below fixed this issue. >> >> Bootstrapped and tested on ax x86-64 machine. Note that this patch >> should be applied before the one I sent earlier (sorry for sending >> them in wrong order). > > > This is PR 56788. Your patch seems strange to me and I don't think it > fixes the real issue, but I'll let more knowledgeable people answer. Thank you for pointing out the bug report. This patch is not intended to fix PR56788. For your function: #include __m128d f(__m128d x, __m128d y){ return _mm_frcz_sd(x,y); } Note that the second parameter is ignored intentionally, but the prototype of this function contains two parameters. My fix is explicitly telling GCC that the optab xop_vmfrczv4sf3 should have three operands instead of two, to let it have the correct information in insn_data[CODE_FOR_xop_vmfrczv4sf3].operand[2] which is used to match the type of the second parameter in the builtin function in ix86_expand_multi_arg_builtin(). thanks, Cong > > -- > Marc Glisse
Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.
On Fri, Nov 22, 2013 at 1:32 AM, Uros Bizjak wrote: > Hello! > >> In consequence, the ix86_expand_multi_arg_builtin() function tries to >> check two args but based on the define_expand of xop_vmfrcz2, >> the content of insn_data[CODE_FOR_xop_vmfrczv4sf2].operand[2] may be >> incorrect (because it only needs one input). > > ;; scalar insns > -(define_expand "xop_vmfrcz2" > +(define_expand "xop_vmfrcz3" >[(set (match_operand:VF_128 0 "register_operand") > (vec_merge:VF_128 > (unspec:VF_128 >[(match_operand:VF_128 1 "nonimmediate_operand")] >UNSPEC_FRCZ) > - (match_dup 3) > + (match_operand:VF_128 2 "register_operand") > (const_int 1)))] >"TARGET_XOP" > { > - operands[3] = CONST0_RTX (mode); > + operands[2] = CONST0_RTX (mode); > }) > > No, just use (match_dup 2) in the RTX in addition to operands[2] > change. Do not rename patterns. If I use match_dup 2, GCC still thinks this optab has one input argument instead of two, which won't fix the current issue. Marc suggested we should remove the second argument. This also works. Thank you! Cong > > Uros.
Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.
On Fri, Nov 22, 2013 at 3:57 AM, Marc Glisse wrote: > On Thu, 21 Nov 2013, Cong Hou wrote: > >> On Thu, Nov 21, 2013 at 4:39 PM, Marc Glisse wrote: >>> >>> On Thu, 21 Nov 2013, Cong Hou wrote: >>> >>>> While I added the new define_insn_and_split for vec_merge, a bug is >>>> exposed: in config/i386/sse.md, [ define_expand "xop_vmfrcz2" ] >>>> only takes one input, but the corresponding builtin functions have two >>>> inputs, which are shown in i386.c: >>>> >>>> { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv4sf2, >>>> "__builtin_ia32_vfrczss", IX86_BUILTIN_VFRCZSS, UNKNOWN, >>>> (int)MULTI_ARG_2_SF }, >>>> { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv2df2, >>>> "__builtin_ia32_vfrczsd", IX86_BUILTIN_VFRCZSD, UNKNOWN, >>>> (int)MULTI_ARG_2_DF }, >>>> >>>> In consequence, the ix86_expand_multi_arg_builtin() function tries to >>>> check two args but based on the define_expand of xop_vmfrcz2, >>>> the content of insn_data[CODE_FOR_xop_vmfrczv4sf2].operand[2] may be >>>> incorrect (because it only needs one input). >>>> >>>> The patch below fixed this issue. >>>> >>>> Bootstrapped and tested on ax x86-64 machine. Note that this patch >>>> should be applied before the one I sent earlier (sorry for sending >>>> them in wrong order). >>> >>> >>> >>> This is PR 56788. Your patch seems strange to me and I don't think it >>> fixes the real issue, but I'll let more knowledgeable people answer. >> >> >> >> Thank you for pointing out the bug report. This patch is not intended >> to fix PR56788. > > > IMHO, if PR56788 was fixed, you wouldn't have this issue, and if PR56788 > doesn't get fixed, I'll post a patch to remove _mm_frcz_sd and the > associated builtin, which would solve your issue as well. I agree. Then I will wait until your patch is merged to the trunk, otherwise my patch could not pass the test. > > >> For your function: >> >> #include >> __m128d f(__m128d x, __m128d y){ >> return _mm_frcz_sd(x,y); >> } >> >> Note that the second parameter is ignored intentionally, but the >> prototype of this function contains two parameters. My fix is >> explicitly telling GCC that the optab xop_vmfrczv4sf3 should have >> three operands instead of two, to let it have the correct information >> in insn_data[CODE_FOR_xop_vmfrczv4sf3].operand[2] which is used to >> match the type of the second parameter in the builtin function in >> ix86_expand_multi_arg_builtin(). > > > I disagree that this is intentional, it is a bug. AFAIK there is no AMD > documentation that could be used as a reference for what _mm_frcz_sd is > supposed to do. The only existing documentations are by Microsoft (which > does *not* ignore the second argument) and by LLVM (which has a single > argument). Whatever we chose for _mm_frcz_sd, the builtin should take a > single argument, and if necessary we'll use 2 builtins to implement > _mm_frcz_sd. > I also only found the one by Microsoft.. If the second argument is ignored, we could just remove it, as long as there is no "standard" that requires two arguments. Hopefully it won't break current projects using _mm_frcz_sd. Thank you for your comments! Cong > -- > Marc Glisse
[PATCH] Fixing PR59006 and PR58921 by delaying loop invariant hoisting in vectorizer.
Hi Currently in GCC vectorization, some loop invariant may be detected after aliasing checks, which can be hoisted outside of the loop. The current method in GCC may break the information built during the analysis phase, causing some crash (see PR59006 and PR58921). This patch improves the loop invariant hoisting by delaying it until all statements are vectorized, thereby keeping all built information. But those loop invariant statements won't be vectorized, and if a variable is defined by one of those loop invariant, it is treated as an external definition. Bootstrapped and testes on an x86-64 machine. thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 2c0554b..0614bab 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,18 @@ +2013-11-22 Cong Hou + + PR tree-optimization/58921 + PR tree-optimization/59006 + * tree-vectorizer.h (struct _stmt_vec_info): New data member + loop_invariant. + * tree-vect-loop-manip.c (vect_loop_versioning): Delay hoisting loop + invariants until all statements are vectorized. + * tree-vect-loop.c (vect_hoist_loop_invariants): New functions. + (vect_transform_loop): Hoist loop invariants after all statements + are vectorized. Do not vectorize loop invariants stmts. + * tree-vect-stmts.c (vect_get_vec_def_for_operand): Treat a loop + invariant as an external definition. + (new_stmt_vec_info): Initialize new data member. + 2013-11-12 Jeff Law * tree-ssa-threadedge.c (thread_around_empty_blocks): New diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 09c7f20..447625b 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,10 @@ +2013-11-22 Cong Hou + + PR tree-optimization/58921 + PR tree-optimization/59006 + * gcc.dg/vect/pr58921.c: New test. + * gcc.dg/vect/pr59006.c: New test. + 2013-11-12 Balaji V. Iyer * gcc.dg/cilk-plus/cilk-plus.exp: Added a check for LTO before running diff --git a/gcc/testsuite/gcc.dg/vect/pr58921.c b/gcc/testsuite/gcc.dg/vect/pr58921.c new file mode 100644 index 000..ee3694a --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr58921.c @@ -0,0 +1,15 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target vect_int } */ + +int a[7]; +int b; + +void +fn1 () +{ + for (; b; b++) +a[b] = ((a[b] <= 0) == (a[0] != 0)); +} + +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ +/* { dg-final { cleanup-tree-dump "vect" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/pr59006.c b/gcc/testsuite/gcc.dg/vect/pr59006.c new file mode 100644 index 000..95d90a9 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr59006.c @@ -0,0 +1,24 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target vect_int } */ + +int a[8], b; + +void fn1 (void) +{ + int c; + for (; b; b++) +{ + int d = a[b]; + c = a[0] ? d : 0; + a[b] = c; +} +} + +void fn2 () +{ + for (; b <= 0; b++) +a[b] = a[0] || b; +} + +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 2 "vect" } } */ +/* { dg-final { cleanup-tree-dump "vect" } } */ diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c index 15227856..3adc73d 100644 --- a/gcc/tree-vect-loop-manip.c +++ b/gcc/tree-vect-loop-manip.c @@ -2448,8 +2448,12 @@ vect_loop_versioning (loop_vec_info loop_vinfo, FOR_EACH_SSA_TREE_OPERAND (var, stmt, iter, SSA_OP_USE) { gimple def = SSA_NAME_DEF_STMT (var); + stmt_vec_info def_stmt_info; + if (!gimple_nop_p (def) - && flow_bb_inside_loop_p (loop, gimple_bb (def))) + && flow_bb_inside_loop_p (loop, gimple_bb (def)) + && !((def_stmt_info = vinfo_for_stmt (def)) + && STMT_VINFO_LOOP_INVARIANT_P (def_stmt_info))) { hoist = false; break; @@ -2458,21 +2462,8 @@ vect_loop_versioning (loop_vec_info loop_vinfo, if (hoist) { - if (dr) - gimple_set_vuse (stmt, NULL); - - gsi_remove (&si, false); - gsi_insert_on_edge_immediate (loop_preheader_edge (loop), -stmt); - - if (dump_enabled_p ()) - { - dump_printf_loc - (MSG_NOTE, vect_location, - "hoisting out of the vectorized loop: "); - dump_gimple_stmt (MSG_NOTE, TDF_SLIM, stmt, 0); - dump_printf (MSG_NOTE, "\n"); - } + STMT_VINFO_LOOP_INVARIANT_P (stmt_info) = true; + gsi_next (&si); continue; } } @@ -2481,6 +2472,7 @@ vect_loop_versioning (loop_vec_info loop_vinfo, } } + /* End loop-exit-fixes after versioning. */ if (cond_expr_stmt_list) diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index 292e771..148f9f1 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -5572,6 +5572,49 @@ vect_loop_kill_debug_uses (struct loop *loop, gimple stmt) } } +/* Find all loop invariants detected after alias checks, and hoist them + before the loop preheader. */ + +static void +vect_hoist_loop_invariants (loop_vec_info loop_vinfo) +{ + struct
Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
It has been 8 months since this patch is posted. I have addressed all comments to this patch. The SAD pattern is very useful for some multimedia algorithms like ffmpeg. This patch will greatly improve the performance of such algorithms. Could you please have a look again and check if it is OK for the trunk? If it is necessary I can re-post this patch in a new thread. Thank you! Cong On Tue, Dec 17, 2013 at 10:04 AM, Cong Hou wrote: > > Ping? > > > thanks, > Cong > > > On Mon, Dec 2, 2013 at 5:06 PM, Cong Hou wrote: > > Hi Richard > > > > Could you please take a look at this patch and see if it is ready for > > the trunk? The patch is pasted as a text file here again. > > > > Thank you very much! > > > > > > Cong > > > > > > On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou wrote: > >> Hi James > >> > >> Sorry for the late reply. > >> > >> > >> On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh > >> wrote: > >>>> On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou wrote: > >>>> > Thank you for your detailed explanation. > >>>> > > >>>> > Once GCC detects a reduction operation, it will automatically > >>>> > accumulate all elements in the vector after the loop. In the loop the > >>>> > reduction variable is always a vector whose elements are reductions of > >>>> > corresponding values from other vectors. Therefore in your case the > >>>> > only instruction you need to generate is: > >>>> > > >>>> > VABAL ops[3], ops[1], ops[2] > >>>> > > >>>> > It is OK if you accumulate the elements into one in the vector inside > >>>> > of the loop (if one instruction can do this), but you have to make > >>>> > sure other elements in the vector should remain zero so that the final > >>>> > result is correct. > >>>> > > >>>> > If you are confused about the documentation, check the one for > >>>> > udot_prod (just above usad in md.texi), as it has very similar > >>>> > behavior as usad. Actually I copied the text from there and did some > >>>> > changes. As those two instruction patterns are both for vectorization, > >>>> > their behavior should not be difficult to explain. > >>>> > > >>>> > If you have more questions or think that the documentation is still > >>>> > improper please let me know. > >>> > >>> Hi Cong, > >>> > >>> Thanks for your reply. > >>> > >>> I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and > >>> DOT_PROD_EXPR and I see that the same ambiguity exists for > >>> DOT_PROD_EXPR. Can you please add a note in your tree.def > >>> that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either: > >>> > >>> tmp = WIDEN_MINUS_EXPR (arg1, arg2) > >>> tmp2 = ABS_EXPR (tmp) > >>> arg3 = PLUS_EXPR (tmp2, arg3) > >>> > >>> or: > >>> > >>> tmp = WIDEN_MINUS_EXPR (arg1, arg2) > >>> tmp2 = ABS_EXPR (tmp) > >>> arg3 = WIDEN_SUM_EXPR (tmp2, arg3) > >>> > >>> Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a > >>> a value of the same (widened) type as arg3. > >>> > >> > >> > >> I have added it, although we currently don't have WIDEN_MINUS_EXPR (I > >> mentioned it in tree.def). > >> > >> > >>> Also, while looking for the history of DOT_PROD_EXPR I spotted this > >>> patch: > >>> > >>> [autovect] [patch] detect mult-hi and sad patterns > >>> http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html > >>> > >>> I wonder what the reason was for that patch to be dropped? > >>> > >> > >> It has been 8 years.. I have no idea why this patch is not accepted > >> finally. There is even no reply in that thread. But I believe the SAD > >> pattern is very important to be recognized. ARM also provides > >> instructions for it. > >> > >> > >> Thank you for your comment again! > >> > >> > >> thanks, > >> Cong > >> > >> > >> > >>> Thanks, > >>> James > >>>
Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
OK. Thank you very much for your review, Richard! thanks, Cong On Tue, Jun 24, 2014 at 4:19 AM, Richard Biener wrote: > On Tue, Dec 3, 2013 at 2:06 AM, Cong Hou wrote: >> Hi Richard >> >> Could you please take a look at this patch and see if it is ready for >> the trunk? The patch is pasted as a text file here again. > > (found it) > > The patch is ok for trunk. (please consider re-testing before you commit) > > Thanks, > Richard. > >> Thank you very much! >> >> >> Cong >> >> >> On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou wrote: >>> Hi James >>> >>> Sorry for the late reply. >>> >>> >>> On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh >>> wrote: >>>>> On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou wrote: >>>>> > Thank you for your detailed explanation. >>>>> > >>>>> > Once GCC detects a reduction operation, it will automatically >>>>> > accumulate all elements in the vector after the loop. In the loop the >>>>> > reduction variable is always a vector whose elements are reductions of >>>>> > corresponding values from other vectors. Therefore in your case the >>>>> > only instruction you need to generate is: >>>>> > >>>>> > VABAL ops[3], ops[1], ops[2] >>>>> > >>>>> > It is OK if you accumulate the elements into one in the vector inside >>>>> > of the loop (if one instruction can do this), but you have to make >>>>> > sure other elements in the vector should remain zero so that the final >>>>> > result is correct. >>>>> > >>>>> > If you are confused about the documentation, check the one for >>>>> > udot_prod (just above usad in md.texi), as it has very similar >>>>> > behavior as usad. Actually I copied the text from there and did some >>>>> > changes. As those two instruction patterns are both for vectorization, >>>>> > their behavior should not be difficult to explain. >>>>> > >>>>> > If you have more questions or think that the documentation is still >>>>> > improper please let me know. >>>> >>>> Hi Cong, >>>> >>>> Thanks for your reply. >>>> >>>> I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and >>>> DOT_PROD_EXPR and I see that the same ambiguity exists for >>>> DOT_PROD_EXPR. Can you please add a note in your tree.def >>>> that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either: >>>> >>>> tmp = WIDEN_MINUS_EXPR (arg1, arg2) >>>> tmp2 = ABS_EXPR (tmp) >>>> arg3 = PLUS_EXPR (tmp2, arg3) >>>> >>>> or: >>>> >>>> tmp = WIDEN_MINUS_EXPR (arg1, arg2) >>>> tmp2 = ABS_EXPR (tmp) >>>> arg3 = WIDEN_SUM_EXPR (tmp2, arg3) >>>> >>>> Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a >>>> a value of the same (widened) type as arg3. >>>> >>> >>> >>> I have added it, although we currently don't have WIDEN_MINUS_EXPR (I >>> mentioned it in tree.def). >>> >>> >>>> Also, while looking for the history of DOT_PROD_EXPR I spotted this >>>> patch: >>>> >>>> [autovect] [patch] detect mult-hi and sad patterns >>>> http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html >>>> >>>> I wonder what the reason was for that patch to be dropped? >>>> >>> >>> It has been 8 years.. I have no idea why this patch is not accepted >>> finally. There is even no reply in that thread. But I believe the SAD >>> pattern is very important to be recognized. ARM also provides >>> instructions for it. >>> >>> >>> Thank you for your comment again! >>> >>> >>> thanks, >>> Cong >>> >>> >>> >>>> Thanks, >>>> James >>>>