Bill Schmidt <wschm...@linux.vnet.ibm.com> writes: > On Sep 19, 2017, at 4:21 PM, Bill Schmidt <wschm...@linux.vnet.ibm.com> wrote: >> On Sep 19, 2017, at 3:58 PM, Richard Sandiford >> <richard.sandif...@linaro.org> wrote: >>> Bill Schmidt <wschm...@linux.vnet.ibm.com> writes: >>>> Index: gcc/tree-vect-stmts.c >>>> =================================================================== >>>> --- gcc/tree-vect-stmts.c (revision 252760) >>>> +++ gcc/tree-vect-stmts.c (working copy) >>>> @@ -1091,8 +1091,19 @@ vect_model_load_cost (stmt_vec_info stmt_info, int >>>> prologue_cost_vec, body_cost_vec, true); >>>> if (memory_access_type == VMAT_ELEMENTWISE >>>> || memory_access_type == VMAT_STRIDED_SLP) >>>> - inside_cost += record_stmt_cost (body_cost_vec, ncopies, >>>> vec_construct, >>>> - stmt_info, 0, vect_body); >>>> + { >>>> + int group_size = GROUP_SIZE (stmt_info); >>>> + int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)); >>>> + if (group_size < nunits) >>>> + { >>>> + if (dump_enabled_p ()) >>>> + dump_printf_loc (MSG_NOTE, vect_location, >>>> + "vect_model_load_cost: vec_construct required"); >>>> + inside_cost += record_stmt_cost (body_cost_vec, ncopies, >>>> + vec_construct, stmt_info, 0, >>>> + vect_body); >>>> + } >>>> + } >>>> >>>> if (dump_enabled_p ()) >>>> dump_printf_loc (MSG_NOTE, vect_location, >>> >>> This feels like we've probably got the wrong memory_access_type. >>> If it's a just a contiguous load then it should be VMAT_CONTIGUOUS >>> instead. >> >> I tend to agree -- that will take more surgery to the code that >> detects strided loads in >> both the cost model analysis and in vectorizable_load. >> > It's just a little less clear how to clean this up in > vect_analyze_data_refs. The too-simplistic test there now is: > > else if (is_a <loop_vec_info> (vinfo) > && TREE_CODE (DR_STEP (dr)) != INTEGER_CST) > { > if (nested_in_vect_loop_p (loop, stmt)) > { > if (dump_enabled_p ()) > { > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > "not vectorized: not suitable for strided " > "load "); > dump_gimple_stmt (MSG_MISSED_OPTIMIZATION, TDF_SLIM, stmt, > 0); > } > return false; > } > STMT_VINFO_STRIDED_P (stmt_info) = true; > } > > Furthermore, this is being set while processing each data reference, not after > all information has been gathered and analyzed. So this is too early.
Yeah, agree we can't tell at this stage. > Next place to fix this would be in vect_analyze_slp_cost_1, where > > vect_memory_access_type memory_access_type > = (STMT_VINFO_STRIDED_P (stmt_info) > ? VMAT_STRIDED_SLP > : VMAT_CONTIGUOUS); > > is too simplistic. I guess we could do something here, but it would > require unpacking all the grouped accesses and duplicating all the > logic from scratch to determine whether it's a single load or not. I think we can use: SLP_INSTANCE_UNROLLING_FACTOR (instance) * ncopies_for_cost to calculate the number of times that the vector SLP statements repeat the original scalar group. We should then only use STRIDED_SLP if that's greater than 1. This affects stores as well as loads, so fixing it here will work for both. But I think this shows up another problem. In the vectorised loop, we have 1 copy of the load and 4 copies of the ABS (after unpacking). But vect_analyze_slp_cost_1 is being called with ncopies_for_cost == 4 even for the loads. So I think it's a double whammy: we've quadrupling the real cost via the excessive ncopies_for_cost, and we're using the wrong access type too. The patch below should fix the second problem but isn't much use without the first. I think we need to adjust ncopies_for_cost in the recursive calls as well. Of course, fixing an overestimate here is likely to expose an underestimate elsewhere. :-) Interesting test case though. It seems counterproductive to unroll a loop like that before vectorisation. Thanks, Richard Index: gcc/tree-vect-slp.c =================================================================== --- gcc/tree-vect-slp.c (revision 252986) +++ gcc/tree-vect-slp.c (working copy) @@ -1687,8 +1687,10 @@ stmt_info = vinfo_for_stmt (stmt); if (STMT_VINFO_GROUPED_ACCESS (stmt_info)) { + unsigned int group_ncopies = (SLP_INSTANCE_UNROLLING_FACTOR (instance) + * ncopies_for_cost); vect_memory_access_type memory_access_type - = (STMT_VINFO_STRIDED_P (stmt_info) + = (STMT_VINFO_STRIDED_P (stmt_info) && group_ncopies > 1 ? VMAT_STRIDED_SLP : VMAT_CONTIGUOUS); if (DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info)))