On Sep 20, 2017, at 3:49 AM, Richard Sandiford <richard.sandif...@linaro.org> wrote: > > Bill Schmidt <wschm...@linux.vnet.ibm.com> writes: >> On Sep 19, 2017, at 4:21 PM, Bill Schmidt <wschm...@linux.vnet.ibm.com> >> wrote: >>> On Sep 19, 2017, at 3:58 PM, Richard Sandiford >>> <richard.sandif...@linaro.org> wrote: >>>> Bill Schmidt <wschm...@linux.vnet.ibm.com> writes: >>>>> Index: gcc/tree-vect-stmts.c >>>>> =================================================================== >>>>> --- gcc/tree-vect-stmts.c (revision 252760) >>>>> +++ gcc/tree-vect-stmts.c (working copy) >>>>> @@ -1091,8 +1091,19 @@ vect_model_load_cost (stmt_vec_info stmt_info, int >>>>> prologue_cost_vec, body_cost_vec, true); >>>>> if (memory_access_type == VMAT_ELEMENTWISE >>>>> || memory_access_type == VMAT_STRIDED_SLP) >>>>> - inside_cost += record_stmt_cost (body_cost_vec, ncopies, >>>>> vec_construct, >>>>> - stmt_info, 0, vect_body); >>>>> + { >>>>> + int group_size = GROUP_SIZE (stmt_info); >>>>> + int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)); >>>>> + if (group_size < nunits) >>>>> + { >>>>> + if (dump_enabled_p ()) >>>>> + dump_printf_loc (MSG_NOTE, vect_location, >>>>> + "vect_model_load_cost: vec_construct required"); >>>>> + inside_cost += record_stmt_cost (body_cost_vec, ncopies, >>>>> + vec_construct, stmt_info, 0, >>>>> + vect_body); >>>>> + } >>>>> + } >>>>> >>>>> if (dump_enabled_p ()) >>>>> dump_printf_loc (MSG_NOTE, vect_location, >>>> >>>> This feels like we've probably got the wrong memory_access_type. >>>> If it's a just a contiguous load then it should be VMAT_CONTIGUOUS >>>> instead. >>> >>> I tend to agree -- that will take more surgery to the code that >>> detects strided loads in >>> both the cost model analysis and in vectorizable_load. >>> >> It's just a little less clear how to clean this up in >> vect_analyze_data_refs. The too-simplistic test there now is: >> >> else if (is_a <loop_vec_info> (vinfo) >> && TREE_CODE (DR_STEP (dr)) != INTEGER_CST) >> { >> if (nested_in_vect_loop_p (loop, stmt)) >> { >> if (dump_enabled_p ()) >> { >> dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, >> "not vectorized: not suitable for strided " >> "load "); >> dump_gimple_stmt (MSG_MISSED_OPTIMIZATION, TDF_SLIM, stmt, >> 0); >> } >> return false; >> } >> STMT_VINFO_STRIDED_P (stmt_info) = true; >> } >> >> Furthermore, this is being set while processing each data reference, not >> after >> all information has been gathered and analyzed. So this is too early. > > Yeah, agree we can't tell at this stage. > >> Next place to fix this would be in vect_analyze_slp_cost_1, where >> >> vect_memory_access_type memory_access_type >> = (STMT_VINFO_STRIDED_P (stmt_info) >> ? VMAT_STRIDED_SLP >> : VMAT_CONTIGUOUS); >> >> is too simplistic. I guess we could do something here, but it would >> require unpacking all the grouped accesses and duplicating all the >> logic from scratch to determine whether it's a single load or not. > > I think we can use: > > SLP_INSTANCE_UNROLLING_FACTOR (instance) * ncopies_for_cost > > to calculate the number of times that the vector SLP statements repeat > the original scalar group. We should then only use STRIDED_SLP if > that's greater than 1. > > This affects stores as well as loads, so fixing it here will work > for both.
Thanks, Richard, that makes sense. Let me play around with it a bit. > > But I think this shows up another problem. In the vectorised loop, > we have 1 copy of the load and 4 copies of the ABS (after unpacking). > But vect_analyze_slp_cost_1 is being called with ncopies_for_cost == 4 > even for the loads. So I think it's a double whammy: we've quadrupling > the real cost via the excessive ncopies_for_cost, and we're using the > wrong access type too. Absolutely right, I failed to pay attention to this! Good catch. > > The patch below should fix the second problem but isn't much use without > the first. I think we need to adjust ncopies_for_cost in the recursive > calls as well. Just to be clear, I think you mean it fixes the original problem (wrong access type). I'll look into how to fix the excessive ncopies_for_cost. > > Of course, fixing an overestimate here is likely to expose an > underestimate elsewhere. :-) Right...it was ever thus... Very much appreciate your assistance! Bill > > Interesting test case though. It seems counterproductive to unroll > a loop like that before vectorisation. > > Thanks, > Richard > > > Index: gcc/tree-vect-slp.c > =================================================================== > --- gcc/tree-vect-slp.c (revision 252986) > +++ gcc/tree-vect-slp.c (working copy) > @@ -1687,8 +1687,10 @@ > stmt_info = vinfo_for_stmt (stmt); > if (STMT_VINFO_GROUPED_ACCESS (stmt_info)) > { > + unsigned int group_ncopies = (SLP_INSTANCE_UNROLLING_FACTOR (instance) > + * ncopies_for_cost); > vect_memory_access_type memory_access_type > - = (STMT_VINFO_STRIDED_P (stmt_info) > + = (STMT_VINFO_STRIDED_P (stmt_info) && group_ncopies > 1 > ? VMAT_STRIDED_SLP > : VMAT_CONTIGUOUS); > if (DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info)))