https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87288

--- Comment #7 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 19 Sep 2018, amker at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87288
> 
> --- Comment #6 from bin cheng <amker at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #5)
> > it's set here:
> > 
> >   if (!is_gimple_val (niters_vector))
> >     {
> >       var = create_tmp_var (type, "bnd");
> >       gimple_seq stmts = NULL;
> >       niters_vector = force_gimple_operand (niters_vector, &stmts, true,
> > var);
> >       gsi_insert_seq_on_edge_immediate (pe, stmts);
> >       /* Peeling algorithm guarantees that vector loop bound is at least 
> > ONE,
> >          we set range information to make niters analyzer's life easier.  */
> >       if (stmts != NULL && log_vf)
> >         set_range_info (niters_vector, VR_RANGE,
> >                         wi::to_wide (build_int_cst (type, 1)),
> >                         wi::to_wide (fold_build2 (RSHIFT_EXPR, type,
> >                                                   TYPE_MAX_VALUE (type),
> >                                                   log_vf)));
> > 
> > and the loop is
> > 
> >   <bb 5> [local count: 105119325]:
> >   niters.0_25 = (unsigned int) n_15;
> >   ni_gap.1_36 = niters.0_25 + 4294967295;
> >   # RANGE [1, 2147483647] NONZERO 2147483647
> >   bnd.2_37 = ni_gap.1_36 >> 1;
> > 
> >   <bb 4> [local count: 567644349]:
> >   # ivtmp_50 = PHI <ivtmp_51(6), 0(5)>
> >   ivtmp_51 = ivtmp_50 + 1;
> >   if (ivtmp_51 >= bnd.2_37)
> >     goto <bb 12>; [16.67%]
> >   else
> >     goto <bb 6>; [83.33%]
> > 
> >   <bb 6> [local count: 473036958]:
> >   goto <bb 4>; [100.00%]
> > 
> > which looks good according to the comment.  So the number of iterations
> > _is_ bnd.2_37 - 1 (that number may be zero).
> 
> Not really.  The code was added under assumption that vector code is guarded 
> by
> condition in peeling.
> 
> Dump for vect is like:
>   <bb 2> [local count: 161061274]:
>   # PT = { D.2425 } (escaped, escaped heap)
>   # USE = nonlocal null { D.2425 } (escaped, escaped heap)
>   # CLB = nonlocal null { D.2425 } (escaped, escaped heap)
>   _10 = operator new [] (32);
>   _10->_M_elems0 = 0.0;
>   MEM[(struct array2 *)_10 + 16B]._M_elems0 = 0.0;
>   # RANGE [-2147483648, 2147483647] NONZERO 4294967294
>   n_14 = argc_13(D) * 2;
>   if (n_14 <= 0)
>     goto <bb 3>; [15.00%]
>   else
>     goto <bb 5>; [85.00%]
> 
>   <bb 3> [local count: 161061274]:
>   # USE = nonlocal null { D.2425 } (escaped, escaped heap)
>   # CLB = nonlocal null { D.2425 } (escaped, escaped heap)
>   operator delete [] (_10);
>   jacobianTransposeds ={v} {CLOBBER};
>   return 0;
> 
>   <bb 5> [local count: 136902083]:
>   niters.0_25 = (unsigned int) n_14;
>   ni_gap.1_36 = niters.0_25 + 4294967295;
>   # RANGE [1, 2147483647] NONZERO 2147483647
>   bnd.2_37 = ni_gap.1_36 >> 1;
> 
>   <bb 4> [local count: 739271244]:
>   # RANGE [0, 2147483647] NONZERO 2147483647
>   # i_21 = PHI <i_16(6), 0(5)>
>   # PT = null { D.2425 } (escaped, escaped heap)
>   # ALIGN = 8, MISALIGN = 0
>   # vectp.5_40 = PHI <vectp.5_41(6), _10(5)>
>   # PT = { D.2388 }
>   # ALIGN = 16, MISALIGN = 0
>   # vectp_jacobianTransposeds.9_47 = PHI <vectp_jacobianTransposeds.9_48(6),
> &jacobianTransposeds(5)>
>   # ivtmp_50 = PHI <ivtmp_51(6), 0(5)>
>   # RANGE [0, 2147483646] NONZERO 2147483647
>   _1 = (long unsigned int) i_21;
>   # RANGE [0, 34359738336] NONZERO 34359738352
>   _2 = _1 * 16;
>   # PT = null { D.2425 } (escaped, escaped heap)
>   _3 = _10 + _2;
>   _6 = _1 * 8;
>   # PT = { D.2388 }
>   # ALIGN = 8, MISALIGN = 0
>   _4 = &jacobianTransposeds + _6;
>   vect__5.7_42 = MEM[(double *)vectp.5_40];
>   # PT = null { D.2425 } (escaped, escaped heap)
>   # ALIGN = 8, MISALIGN = 0
>   vectp.5_43 = vectp.5_40 + 16;
>   vect__5.8_44 = MEM[(double *)vectp.5_43];
>   vect_perm_even_45 = VEC_PERM_EXPR <vect__5.7_42, vect__5.8_44, { 0, 2 }>;
>   vect_perm_odd_46 = VEC_PERM_EXPR <vect__5.7_42, vect__5.8_44, { 1, 3 }>;
>   _5 = _3->_M_elems0;
>   MEM[(double &)vectp_jacobianTransposeds.9_47] = vect_perm_even_45;
>   # RANGE [1, 2147483647] NONZERO 2147483647
>   i_16 = i_21 + 1;
>   # PT = null { D.2425 } (escaped, escaped heap)
>   vectp.5_41 = vectp.5_43 + 16;
>   # PT = { D.2388 }
>   # ALIGN = 16, MISALIGN = 0
>   vectp_jacobianTransposeds.9_48 = vectp_jacobianTransposeds.9_47 + 16;
>   ivtmp_51 = ivtmp_50 + 1;
>   if (ivtmp_51 >= bnd.2_37)
>     goto <bb 12>; [16.67%]
>   else
>     goto <bb 6>; [83.33%]
> 
>   <bb 9> [local count: 136902081]:
>   goto <bb 3>; [100.00%]
> 
>   <bb 6> [local count: 616059372]:
>   goto <bb 4>; [100.00%]
> 
>   <bb 12> [local count: 136902083]:
>   niters_vector_mult_vf.3_38 = bnd.2_37 << 1;
>   tmp.4_39 = (int) niters_vector_mult_vf.3_38;
> 
>   <bb 13> [local count: 912680552]:
>   # RANGE [0, 2147483647] NONZERO 2147483647
>   # i_24 = PHI <i_29(14), tmp.4_39(12)>
>   # RANGE [0, 2147483646] NONZERO 2147483647
>   _19 = (long unsigned int) i_24;
>   # RANGE [0, 34359738336] NONZERO 34359738352
>   _33 = _19 * 16;
>   # PT = null { D.2425 } (escaped, escaped heap)
>   _26 = _10 + _33;
>   _27 = _19 * 8;
>   # PT = { D.2388 }
>   # ALIGN = 8, MISALIGN = 0
>   _28 = &jacobianTransposeds + _27;
>   _31 = _26->_M_elems0;
>   MEM[(double &)_28] = _31;
>   # RANGE [1, 2147483647] NONZERO 2147483647
>   i_29 = i_24 + 1;
>   if (n_14 <= i_29)
>     goto <bb 9>; [15.00%]
>   else
>     goto <bb 14>; [85.00%]
> 
>   <bb 14> [local count: 775778470]:
>   goto <bb 13>; [100.00%]
> 
> This is no peeling guard condition skipping vector loop anymore.  In case of
> "argc_13(D) == 1", the vector loop body is executed exactly once 
> (corresponding
> 2 times before vectorization);

Hmm, I think it looks all a bit bogus - clearly the vector loop should
execute once for argc == 1 but we compute bnd.2_37 == 0 (argc*2 - 1 / 2).
That doesn't fit with the <= bnd.2_37 use for the exit condition.  For
argc == 2 we'd compute bnd.2_37 == 1 and execute the body once as well...

> after vector loop, the epilog loop body is
> executed 2 times again (for the same iteration as done in vector loop).  There
> is two problems here:
> A) it's at least inefficient when ("argc_13(D) == 1" && !SVE).
> B) Given the vector loop is not guarded by peeling condition anymore, range
> info as you noted should not be set for bnd.2_37, because it could take value
> ZERO.

But how does the exit condition make sense there?!  If we're peeling
for gaps then we should not execute the loop.  I guess we optimize
the condition immediately because of the bogus range info.

> Change in peeling is made by revision 256635, specifically, by below code
> changes:
> 
> +  poly_uint64 bound_epilog = 0;
> +  if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
> +      && LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
> +    bound_epilog += vf - 1;
> +  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
> +    bound_epilog += 1;
> 
> //......
> 
> @@ -2577,10 +2593,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters,
> tree nitersm1,
>        if (skip_vector)
>         {
>           /* Additional epilogue iteration is peeled if gap exists.  */
> -         bool peel_for_gaps = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo);
>           tree t = vect_gen_scalar_loop_niters (niters_prolog, prolog_peeling,
> -                                               bound_prolog,
> -                                               peel_for_gaps ? vf : vf - 1,
> +                                               bound_prolog, bound_epilog,
> 
> Now bound_epilog == 1 is passed into vect_gen_scalar_loop_niters, rather than
> vf (== 2), this causes no peeling condition is generated.
> 
> Either below condition is too strict here or we need to identify and skip
> setting range info in this case:
> +  if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
> +      && LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
> +    bound_epilog += vf - 1;

Thanks for the further analysis.  The above allows us to blame Richard.

Reply via email to