On Tue, 20 Mar 2018, Rainer Orth wrote:

> Hi Tom,
> 
> > On 03/19/2018 10:11 AM, Richard Biener wrote:
> >> On Fri, 16 Mar 2018, Tom de Vries wrote:
> >>
> >>> On 03/16/2018 12:55 PM, Richard Biener wrote:
> >>>> On Fri, 16 Mar 2018, Tom de Vries wrote:
> >>>>
> >>>>> On 02/27/2018 01:42 PM, Richard Biener wrote:
> >>>>>> Index: gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
> >>>>>> ===================================================================
> >>>>>> --- gcc/testsuite/gcc.dg/tree-ssa/pr84512.c    (nonexistent)
> >>>>>> +++ gcc/testsuite/gcc.dg/tree-ssa/pr84512.c    (working copy)
> >>>>>> @@ -0,0 +1,15 @@
> >>>>>> +/* { dg-do compile } */
> >>>>>> +/* { dg-options "-O3 -fdump-tree-optimized" } */
> >>>>>> +
> >>>>>> +int foo()
> >>>>>> +{
> >>>>>> +  int a[10];
> >>>>>> +  for(int i = 0; i < 10; ++i)
> >>>>>> +    a[i] = i*i;
> >>>>>> +  int res = 0;
> >>>>>> +  for(int i = 0; i < 10; ++i)
> >>>>>> +    res += a[i];
> >>>>>> +  return res;
> >>>>>> +}
> >>>>>> +
> >>>>>> +/* { dg-final { scan-tree-dump "return 285;" "optimized" } } */
> >>>>>
> >>>>> This fails for nvptx, because it doesn't have the required vector
> >>>>> operations.
> >>>>> To fix the fail, I've added requiring effective target vect_int_mult.
> >>>>
> >>>> On targets that do not vectorize you should see the scalar loops unrolled
> >>>> instead.  Or do you have only one loop vectorized?
> >>>
> >>> Sort of. Loop vectorization has no effect, and the scalar loops are 
> >>> completely
> >>> unrolled. But then slp vectorization vectorizes the stores.
> >>>
> >>> So at optimized we have:
> >>> ...
> >>>    MEM[(int *)&a] = { 0, 1 };
> >>>    MEM[(int *)&a + 8B] = { 4, 9 };
> >>>    MEM[(int *)&a + 16B] = { 16, 25 };
> >>>    MEM[(int *)&a + 24B] = { 36, 49 };
> >>>    MEM[(int *)&a + 32B] = { 64, 81 };
> >>>    _6 = a[0];
> >>>    _28 = a[1];
> >>>    res_29 = _6 + _28;
> >>>    _35 = a[2];
> >>>    res_36 = res_29 + _35;
> >>>    _42 = a[3];
> >>>    res_43 = res_36 + _42;
> >>>    _49 = a[4];
> >>>    res_50 = res_43 + _49;
> >>>    _56 = a[5];
> >>>    res_57 = res_50 + _56;
> >>>    _63 = a[6];
> >>>    res_64 = res_57 + _63;
> >>>    _70 = a[7];
> >>>    res_71 = res_64 + _70;
> >>>    _77 = a[8];
> >>>    res_78 = res_71 + _77;
> >>>    _2 = a[9];
> >>>    res_11 = _2 + res_78;
> >>>    a ={v} {CLOBBER};
> >>>    return res_11;
> >>> ...
> >>>
> >>> The stores and loads are eliminated by dse1 in the rtl phase, and in the 
> >>> end
> >>> we have:
> >>> ...
> >>> .visible .func (.param.u32 %value_out) foo
> >>> {
> >>>          .reg.u32 %value;
> >>>          .local .align 16 .b8 %frame_ar[48];
> >>>          .reg.u64 %frame;
> >>>          cvta.local.u64 %frame, %frame_ar;
> >>>          mov.u32 %value, 285;
> >>>          st.param.u32    [%value_out], %value;
> >>>          ret;
> >>> }
> >>> ...
> >>>
> >>>> That's precisely
> >>>> what the PR was about...  which means it isn't fixed for nvptx :/
> >>>
> >>> Indeed the assembly is not optimal, and would be optimal if we'd have 
> >>> optimal
> >>> code at optimized.
> >>>
> >>> FWIW, using this patch we generate optimal code at optimized:
> >>> ...
> >>> diff --git a/gcc/passes.def b/gcc/passes.def
> >>> index 3ebcfc30349..6b64f600c4a 100644
> >>> --- a/gcc/passes.def
> >>> +++ b/gcc/passes.def
> >>> @@ -325,6 +325,7 @@ along with GCC; see the file COPYING3.  If not see
> >>>         NEXT_PASS (pass_tracer);
> >>>         NEXT_PASS (pass_thread_jumps);
> >>>         NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
> >>> +      NEXT_PASS (pass_fre);
> >>>         NEXT_PASS (pass_strlen);
> >>>         NEXT_PASS (pass_thread_jumps);
> >>>         NEXT_PASS (pass_vrp, false /* warn_array_bounds_p */);
> >>> ...
> >>>
> >>> and we get:
> >>> ...
> >>> .visible .func (.param.u32 %value_out) foo
> >>> {
> >>>          .reg.u32 %value;
> >>>          mov.u32 %value, 285;
> >>>          st.param.u32    [%value_out], %value;
> >>>          ret;
> >>> }
> >>> ...
> >>>
> >>> I could file a missing optimization PR for nvptx, but I'm not sure where 
> >>> this
> >>> should be fixed.
> >>
> >> Ah, yeah... the usual issue then.
> >>
> >> Can you please XFAIL the test on nvptx instead of requiring vect_int_mult?
> >>
> >
> > Done.
> >
> > Committed at attached.
> 
> this caused the test to FAIL on 64-bit (only) sparc-sun-solaris2.11:
> 
> FAIL: gcc.dg/tree-ssa/pr84512.c scan-tree-dump optimized "return 285;"
> 
> where it was UNSUPPORTED before.

So it failed before Toms original patch.  Please add sparc-solaris
to the list of XFAILed targets.

> The dump has
> 
> ;; Function foo (foo, funcdef_no=0, decl_uid=1557, cgraph_uid=0, 
> symbol_order=0)
> 
> foo ()
> {
>   int res;
>   int a[10];
>   int _2;
>   int _6;
>   int _28;
>   int _35;
>   int _42;
>   int _49;
>   int _56;
>   int _63;
>   int _70;
>   int _77;
> 
>   <bb 2> [local count: 97603132]:
>   MEM[(int *)&a] = { 0, 1 };
>   MEM[(int *)&a + 8B] = { 4, 9 };
>   MEM[(int *)&a + 16B] = { 16, 25 };
>   MEM[(int *)&a + 24B] = { 36, 49 };
>   MEM[(int *)&a + 32B] = { 64, 81 };
>   _6 = a[0];
>   _28 = a[1];
>   res_29 = _6 + _28;
>   _35 = a[2];
>   res_36 = res_29 + _35;
>   _42 = a[3];
>   res_43 = res_36 + _42;
>   _49 = a[4];
>   res_50 = res_43 + _49;
>   _56 = a[5];
>   res_57 = res_50 + _56;
>   _63 = a[6];
>   res_64 = res_57 + _63;
>   _70 = a[7];
>   res_71 = res_64 + _70;
>   _77 = a[8];
>   res_78 = res_71 + _77;
>   _2 = a[9];
>   res_11 = _2 + res_78;
>   a ={v} {CLOBBER};
>   return res_11;
> 
> }
> 
>       Rainer
> 
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 
21284 (AG Nuernberg)

Reply via email to