On Sep 20, 2017, at 3:49 AM, Richard Sandiford <richard.sandif...@linaro.org> 
wrote:
> 
> Bill Schmidt <wschm...@linux.vnet.ibm.com> writes:
>> On Sep 19, 2017, at 4:21 PM, Bill Schmidt <wschm...@linux.vnet.ibm.com> 
>> wrote:
>>> On Sep 19, 2017, at 3:58 PM, Richard Sandiford
>>> <richard.sandif...@linaro.org> wrote:
>>>> Bill Schmidt <wschm...@linux.vnet.ibm.com> writes:
>>>>> Index: gcc/tree-vect-stmts.c
>>>>> ===================================================================
>>>>> --- gcc/tree-vect-stmts.c (revision 252760)
>>>>> +++ gcc/tree-vect-stmts.c (working copy)
>>>>> @@ -1091,8 +1091,19 @@ vect_model_load_cost (stmt_vec_info stmt_info, int
>>>>>                   prologue_cost_vec, body_cost_vec, true);
>>>>> if (memory_access_type == VMAT_ELEMENTWISE
>>>>>     || memory_access_type == VMAT_STRIDED_SLP)
>>>>> -    inside_cost += record_stmt_cost (body_cost_vec, ncopies, 
>>>>> vec_construct,
>>>>> -                              stmt_info, 0, vect_body);
>>>>> +    {
>>>>> +      int group_size = GROUP_SIZE (stmt_info);
>>>>> +      int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
>>>>> +      if (group_size < nunits)
>>>>> + {
>>>>> +   if (dump_enabled_p ())
>>>>> +     dump_printf_loc (MSG_NOTE, vect_location,
>>>>> +                      "vect_model_load_cost: vec_construct required");
>>>>> +   inside_cost += record_stmt_cost (body_cost_vec, ncopies,
>>>>> +                                    vec_construct, stmt_info, 0,
>>>>> +                                    vect_body);
>>>>> + }
>>>>> +    }
>>>>> 
>>>>> if (dump_enabled_p ())
>>>>>   dump_printf_loc (MSG_NOTE, vect_location,
>>>> 
>>>> This feels like we've probably got the wrong memory_access_type.
>>>> If it's a just a contiguous load then it should be VMAT_CONTIGUOUS
>>>> instead.
>>> 
>>> I tend to agree -- that will take more surgery to the code that
>>> detects strided loads in
>>> both the cost model analysis and in vectorizable_load.
>>> 
>> It's just a little less clear how to clean this up in
>> vect_analyze_data_refs.  The too-simplistic test there now is:
>> 
>>      else if (is_a <loop_vec_info> (vinfo)
>>               && TREE_CODE (DR_STEP (dr)) != INTEGER_CST)
>>        {
>>          if (nested_in_vect_loop_p (loop, stmt))
>>            {
>>              if (dump_enabled_p ())
>>              {
>>                  dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>                                   "not vectorized: not suitable for strided "
>>                                   "load ");
>>                  dump_gimple_stmt (MSG_MISSED_OPTIMIZATION, TDF_SLIM, stmt, 
>> 0);
>>                }
>>              return false;
>>            }
>>          STMT_VINFO_STRIDED_P (stmt_info) = true;
>>        }
>> 
>> Furthermore, this is being set while processing each data reference, not 
>> after
>> all information has been gathered and analyzed.  So this is too early.
> 
> Yeah, agree we can't tell at this stage.
> 
>> Next place to fix this would be in vect_analyze_slp_cost_1, where
>> 
>>      vect_memory_access_type memory_access_type
>>        = (STMT_VINFO_STRIDED_P (stmt_info)
>>           ? VMAT_STRIDED_SLP
>>           : VMAT_CONTIGUOUS);
>> 
>> is too simplistic.  I guess we could do something here, but it would
>> require unpacking all the grouped accesses and duplicating all the
>> logic from scratch to determine whether it's a single load or not.
> 
> I think we can use:
> 
>    SLP_INSTANCE_UNROLLING_FACTOR (instance) * ncopies_for_cost
> 
> to calculate the number of times that the vector SLP statements repeat
> the original scalar group.  We should then only use STRIDED_SLP if
> that's greater than 1.
> 
> This affects stores as well as loads, so fixing it here will work
> for both.

Thanks, Richard, that makes sense.  Let me play around with it a bit.
> 
> But I think this shows up another problem.  In the vectorised loop,
> we have 1 copy of the load and 4 copies of the ABS (after unpacking).
> But vect_analyze_slp_cost_1 is being called with ncopies_for_cost == 4
> even for the loads.  So I think it's a double whammy: we've quadrupling
> the real cost via the excessive ncopies_for_cost, and we're using the
> wrong access type too.

Absolutely right, I failed to pay attention to this!  Good catch.
> 
> The patch below should fix the second problem but isn't much use without
> the first.  I think we need to adjust ncopies_for_cost in the recursive
> calls as well.

Just to be clear, I think you mean it fixes the original problem (wrong
access type).  I'll look into how to fix the excessive ncopies_for_cost.
> 
> Of course, fixing an overestimate here is likely to expose an
> underestimate elsewhere. :-)

Right...it was ever thus... 

Very much appreciate your assistance!

Bill
> 
> Interesting test case though.  It seems counterproductive to unroll
> a loop like that before vectorisation.
> 
> Thanks,
> Richard
> 
> 
> Index: gcc/tree-vect-slp.c
> ===================================================================
> --- gcc/tree-vect-slp.c       (revision 252986)
> +++ gcc/tree-vect-slp.c       (working copy)
> @@ -1687,8 +1687,10 @@
>   stmt_info = vinfo_for_stmt (stmt);
>   if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
>     {
> +      unsigned int group_ncopies = (SLP_INSTANCE_UNROLLING_FACTOR (instance)
> +                                * ncopies_for_cost);
>       vect_memory_access_type memory_access_type
> -     = (STMT_VINFO_STRIDED_P (stmt_info)
> +     = (STMT_VINFO_STRIDED_P (stmt_info) && group_ncopies > 1
>          ? VMAT_STRIDED_SLP
>          : VMAT_CONTIGUOUS);
>       if (DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info)))

Reply via email to