on 2021/5/17 下午4:55, Richard Biener wrote: > On Thu, May 13, 2021 at 9:04 AM Kewen.Lin <li...@linux.ibm.com> wrote: >> >> Hi! >> >>>>> But in the end the vector code shouldn't end up worse than the >>>>> scalar code with respect to IVs - the cases where it would should >>>>> be already costed. So I wonder if you have specific examples >>>>> where things go worse enough for the heuristic to trigger? >>>>> >>>> >>>> One typical case that I worked on to reuse this density check is the >>>> function mat_times_vec of src file block_solver.fppized.f of SPEC2017 >>>> 503.bwaves_r, the density with the existing heuristic is 83 (doesn't >>>> exceed the threshold unlikely). The interesting loop is the innermost >>>> one while option set is "-O2 -mcpu=power8 -ffast-math -ftree-vectorize". >>>> We have verified that this loop isn't profitable to be vectorized at >>>> O2 (without loop-interchange). >>> >>> Yeah, but that's because the loop only runs 5 iterations, not because >>> of some "density" (which suggests AGU overloading or some such)? >>> Because if you modify it so it iterates more then with keeping the >>> "density" measurement constant you suddenly become profitable? >>> >> >> Yes, I agree this isn't a perfect one showing how the density check >> matters, though it led me to find this check. I tried to run SPEC2017 >> bmks w/ and w/o this density heuristic to catch some "expected" case, >> but failed to unluckily. It may be worth to trying with some more >> option sets or even test with the previous SPECs later. >> >> I hacked the innermost loop iteration from 5 to 20, but baseline run >> didn't stop (after more than 7 hrs then I killed it), which was >> suspected to become endless because of some garbage (out of bound) data. >> >> But the current cost modeling for this loop on Power is still bad, the >> min profitable iteration (both static and run time) are evaluated as 2, >> while the reality shows 5 isn't profitable at least. >> >> >>> The loop does have quite many memory streams so optimizing >>> the (few) arithmetic ops by vectorizign them might not be worth >>> the trouble, esp. since most of the loads are "strided" (composed >>> from scalars) when no interchange is performed. So it's probably >>> more a "density" of # memory streams vs. # arithmetic ops, and >>> esp. with any non-consecutive vector loads this balance being >>> worse in the vector case? >>> >> >> Yeah, these many scalar "strided" loads make things worse. The fed >> vector CTORs have to wait for all of their required loads are ready, >> and these vector CTOR are required by further multiplications. >> >> I posted one patch[1] on this, which tries to model it with >> some counts: nload (total load number), nload_ctor (strided >> load number fed into CTOR) and nctor_strided (CTOR number fed >> by strided load). >> >> Restricting the penalization by considering some factors: >> 1) vect density ratio, if there are many vector instructions, >> the stalls from loads are easy to impact the subsequent >> computation. >> 2) total load number, if nload is small, it's unlikely to >> bother the load/store units much. >> 3) strided loads fed into CTOR pct., if there are high portion >> strided loads fed into CTOR, it's very likely to block >> the CTOR and its subsequent chain. >> >> btw, as your previous comments on add_stmt_cost, the load/strided/ctor >> statistics should be gathered there instead, like: >> >> if (!data->costing_for_scalar && data->loop_info && where == vect_body) >> { >> if (kind == scalar_load || kind == vector_load || kind == >> unaligned_load >> || kind == vector_gather_load) >> data->nload += count; >> if (stmt_info && STMT_VINFO_STRIDED_P (stmt_info)) >> { >> if (kind == scalar_load || kind == unaligned_load) >> data->nload_ctor += count; >> else if (kind == vec_construct) >> data->nctor_strided += count; >> } >> } >> >> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/569791.html >> >>> The x86 add_stmt_cost has >>> >>> /* If we do elementwise loads into a vector then we are bound by >>> latency and execution resources for the many scalar loads >>> (AGU and load ports). Try to account for this by scaling the >>> construction cost by the number of elements involved. */ >>> if ((kind == vec_construct || kind == vec_to_scalar) >>> && stmt_info >>> && (STMT_VINFO_TYPE (stmt_info) == load_vec_info_type >>> || STMT_VINFO_TYPE (stmt_info) == store_vec_info_type) >>> && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE >>> && TREE_CODE (DR_STEP (STMT_VINFO_DATA_REF (stmt_info))) != >>> INTEGER_CST) >>> { >>> stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); >>> stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1); >>> } >>> >>> so it penaltizes VMAT_ELEMENTWISE for variable step for both loads and >>> stores. >>> The above materialized over PRs 84037, 85491 and 87561, so not specifically >>> for the bwaves case. IIRC on x86 bwaves at -O2 is slower with vectorization >>> as well. >>> >> >> Thanks for the pointer! rs6000 probably can follow this way instead. >> IIUC, this cost adjustment is for each individual >> vec_construct/vec_to_scalar, >> is it better to use the way that firstly collecting the data in add_stmt_cost >> and then considering them from a view of the whole loop? > > It depends - when you look at the whole loop you'd like to see data > dependences > which are not obvious for the vector code. So for the PRs mentioned > the simplest > thing was to do it per-stmt (since x86 doesn't yet have any special code in > finish_cost). > >> This individual way >> seems easy to overkill if there are just a few VMAT_ELEMENTWISE loads then >> the >> resource hazard isn't so bad. > > True, but then the pessimization is cancelled easily by the > vectorization benefit of > other vectorized stmts. > >> And as you mentioned above, bwaves_r mainly >> suffers from strided loads, this tuning looks should be applied for >> VMAT_STRIDED_SLP as well? Is it the reason why it only considers the >> variable >> step that the constant step loads can be grouped on x86 (like: paired >> load/store >> fusion)? > > Possibly a speciality of x86 and its complex addressing modes where those > need extra AGU resources (and those proved the bottlenecks for the PRs > IIRC). So it's just a heuristic and you shouldn't read too much into > the details > of the implementation ;) (just realize I do the same for the rs6000 one...) >
Thanks for all the comments above. I'll do some experiments referring to the i386 way, to see if it's fine without considering the density ratio. btw, I guessed you meant "do the same for the i386 one"? ;) BR, Kewen