On Thu, May 13, 2021 at 9:04 AM Kewen.Lin <li...@linux.ibm.com> wrote: > > Hi! > > >>> But in the end the vector code shouldn't end up worse than the > >>> scalar code with respect to IVs - the cases where it would should > >>> be already costed. So I wonder if you have specific examples > >>> where things go worse enough for the heuristic to trigger? > >>> > >> > >> One typical case that I worked on to reuse this density check is the > >> function mat_times_vec of src file block_solver.fppized.f of SPEC2017 > >> 503.bwaves_r, the density with the existing heuristic is 83 (doesn't > >> exceed the threshold unlikely). The interesting loop is the innermost > >> one while option set is "-O2 -mcpu=power8 -ffast-math -ftree-vectorize". > >> We have verified that this loop isn't profitable to be vectorized at > >> O2 (without loop-interchange). > > > > Yeah, but that's because the loop only runs 5 iterations, not because > > of some "density" (which suggests AGU overloading or some such)? > > Because if you modify it so it iterates more then with keeping the > > "density" measurement constant you suddenly become profitable? > > > > Yes, I agree this isn't a perfect one showing how the density check > matters, though it led me to find this check. I tried to run SPEC2017 > bmks w/ and w/o this density heuristic to catch some "expected" case, > but failed to unluckily. It may be worth to trying with some more > option sets or even test with the previous SPECs later. > > I hacked the innermost loop iteration from 5 to 20, but baseline run > didn't stop (after more than 7 hrs then I killed it), which was > suspected to become endless because of some garbage (out of bound) data. > > But the current cost modeling for this loop on Power is still bad, the > min profitable iteration (both static and run time) are evaluated as 2, > while the reality shows 5 isn't profitable at least. > > > > The loop does have quite many memory streams so optimizing > > the (few) arithmetic ops by vectorizign them might not be worth > > the trouble, esp. since most of the loads are "strided" (composed > > from scalars) when no interchange is performed. So it's probably > > more a "density" of # memory streams vs. # arithmetic ops, and > > esp. with any non-consecutive vector loads this balance being > > worse in the vector case? > > > > Yeah, these many scalar "strided" loads make things worse. The fed > vector CTORs have to wait for all of their required loads are ready, > and these vector CTOR are required by further multiplications. > > I posted one patch[1] on this, which tries to model it with > some counts: nload (total load number), nload_ctor (strided > load number fed into CTOR) and nctor_strided (CTOR number fed > by strided load). > > Restricting the penalization by considering some factors: > 1) vect density ratio, if there are many vector instructions, > the stalls from loads are easy to impact the subsequent > computation. > 2) total load number, if nload is small, it's unlikely to > bother the load/store units much. > 3) strided loads fed into CTOR pct., if there are high portion > strided loads fed into CTOR, it's very likely to block > the CTOR and its subsequent chain. > > btw, as your previous comments on add_stmt_cost, the load/strided/ctor > statistics should be gathered there instead, like: > > if (!data->costing_for_scalar && data->loop_info && where == vect_body) > { > if (kind == scalar_load || kind == vector_load || kind == unaligned_load > || kind == vector_gather_load) > data->nload += count; > if (stmt_info && STMT_VINFO_STRIDED_P (stmt_info)) > { > if (kind == scalar_load || kind == unaligned_load) > data->nload_ctor += count; > else if (kind == vec_construct) > data->nctor_strided += count; > } > } > > [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/569791.html > > > The x86 add_stmt_cost has > > > > /* If we do elementwise loads into a vector then we are bound by > > latency and execution resources for the many scalar loads > > (AGU and load ports). Try to account for this by scaling the > > construction cost by the number of elements involved. */ > > if ((kind == vec_construct || kind == vec_to_scalar) > > && stmt_info > > && (STMT_VINFO_TYPE (stmt_info) == load_vec_info_type > > || STMT_VINFO_TYPE (stmt_info) == store_vec_info_type) > > && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE > > && TREE_CODE (DR_STEP (STMT_VINFO_DATA_REF (stmt_info))) != > > INTEGER_CST) > > { > > stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); > > stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1); > > } > > > > so it penaltizes VMAT_ELEMENTWISE for variable step for both loads and > > stores. > > The above materialized over PRs 84037, 85491 and 87561, so not specifically > > for the bwaves case. IIRC on x86 bwaves at -O2 is slower with vectorization > > as well. > > > > Thanks for the pointer! rs6000 probably can follow this way instead. > IIUC, this cost adjustment is for each individual vec_construct/vec_to_scalar, > is it better to use the way that firstly collecting the data in add_stmt_cost > and then considering them from a view of the whole loop?
It depends - when you look at the whole loop you'd like to see data dependences which are not obvious for the vector code. So for the PRs mentioned the simplest thing was to do it per-stmt (since x86 doesn't yet have any special code in finish_cost). > This individual way > seems easy to overkill if there are just a few VMAT_ELEMENTWISE loads then the > resource hazard isn't so bad. True, but then the pessimization is cancelled easily by the vectorization benefit of other vectorized stmts. > And as you mentioned above, bwaves_r mainly > suffers from strided loads, this tuning looks should be applied for > VMAT_STRIDED_SLP as well? Is it the reason why it only considers the variable > step that the constant step loads can be grouped on x86 (like: paired > load/store > fusion)? Possibly a speciality of x86 and its complex addressing modes where those need extra AGU resources (and those proved the bottlenecks for the PRs IIRC). So it's just a heuristic and you shouldn't read too much into the details of the implementation ;) (just realize I do the same for the rs6000 one...) Richard. > > BR, > Kewen > > > Oh, and yes - the info the vectorizer presents the target with for > > vector loads/stores > > leaves a lot to be desired ... > > > > Richard. > >