On Sep 21, 2017, at 2:37 AM, Richard Biener <richard.guent...@gmail.com> wrote: > > On Wed, Sep 20, 2017 at 9:17 PM, Bill Schmidt > <wschm...@linux.vnet.ibm.com> wrote: >> On Sep 20, 2017, at 2:14 PM, Bill Schmidt <wschm...@linux.vnet.ibm.com> >> wrote: >>> >>> On Sep 20, 2017, at 3:49 AM, Richard Sandiford >>> <richard.sandif...@linaro.org> wrote: >>>> >>>> But I think this shows up another problem. In the vectorised loop, >>>> we have 1 copy of the load and 4 copies of the ABS (after unpacking). >>>> But vect_analyze_slp_cost_1 is being called with ncopies_for_cost == 4 >>>> even for the loads. So I think it's a double whammy: we've quadrupling >>>> the real cost via the excessive ncopies_for_cost, and we're using the >>>> wrong access type too. >>>> >>>> The patch below should fix the second problem but isn't much use without >>>> the first. I think we need to adjust ncopies_for_cost in the recursive >>>> calls as well. >>> >>> Unfortunately we have to deal with the fact that ncopies_for_cost is set >>> once for the whole SLP instance, which is not ideal since we know the >>> number of loads and stores doesn't follow the same rules. >>> >>> What about the following? I *think* that group_size / nunits is sufficient >>> for >>> VMAT_STRIDED_SLP and VMAT_CONTIGUOUS for which permuted loads are >>> not possible (already handled differently), but I could well be missing >>> something. >> >> Uh, I need to do the obvious rounding of (group_size + nunits - 1) / >> nunits... > > I don't think any such simple approach will result in appropriate costing. > The > current costing is conservative and you definitely don't want to be > overly optimistic > here ;) > > You'd have to fully emulate the vectorizable_load code. Which means we should > probably refactor the code-gen part > > if (memory_access_type == VMAT_ELEMENTWISE > || memory_access_type == VMAT_STRIDED_SLP) > { > ... > > where it computes ltype, nloads, etc., do that computation up-front in > the analysis > phase, storing the result somewhere in stmt_vinfo (err, or slp_node or > both -- think > of hybrid SLP and/or the stmt used in different SLP contexts) and use the info > during transform. And hope to have enough info to statically compute the > number of loads and their size/alignment at costing time.
I suppose so. I was thinking that if these are truly strided loads, then the cost is predictable -- but only for targets that have hardware strided loads available, which are few, so this isn't conservative enough. The code in question didn't deal with VMAT_ELEMENTWISE, so the factoring is even more complicated -- the structures of the costing logic and the code gen logic don't match well at all. I'm unlikely to have time for a complex solution right now, so I'll stick a pointer to this conversation in the bugzilla and take my name off in case somebody gets to it sooner. But I'll probably keep poking at it in spare moments. Thanks, Bill > > Or go the full way of restructuring costing to have a prologue/body_cost_vec > in stmt_info and slp_node so that we can compute and store the cost > from vectorizable_* rather than duplicating the hard parts in SLP costing > code. > > Richard. > >> Bill >>> >>> Index: gcc/tree-vect-slp.c >>> =================================================================== >>> --- gcc/tree-vect-slp.c (revision 252760) >>> +++ gcc/tree-vect-slp.c (working copy) >>> @@ -1662,12 +1662,14 @@ vect_analyze_slp_cost_1 (slp_instance instance, sl >>> stmt_info = vinfo_for_stmt (stmt); >>> if (STMT_VINFO_GROUPED_ACCESS (stmt_info)) >>> { >>> + int group_size = GROUP_SIZE (stmt_info); >>> + int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)); >>> vect_memory_access_type memory_access_type >>> - = (STMT_VINFO_STRIDED_P (stmt_info) >>> + = (STMT_VINFO_STRIDED_P (stmt_info) && group_size > nunits >>> ? VMAT_STRIDED_SLP >>> : VMAT_CONTIGUOUS); >>> if (DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info))) >>> - vect_model_store_cost (stmt_info, ncopies_for_cost, >>> + vect_model_store_cost (stmt_info, group_size / nunits, >>> memory_access_type, vect_uninitialized_def, >>> node, prologue_cost_vec, body_cost_vec); >>> else >>> @@ -1714,6 +1716,8 @@ vect_analyze_slp_cost_1 (slp_instance instance, sl >>> + nunits - 1) / nunits); >>> ncopies_for_cost *= SLP_INSTANCE_UNROLLING_FACTOR (instance); >>> } >>> + else >>> + ncopies_for_cost = group_size / nunits; >>> /* Record the cost for the vector loads. */ >>> vect_model_load_cost (stmt_info, ncopies_for_cost, >>> memory_access_type, node, prologue_cost_vec, >>> >>> Bill >>> >> >