on 2021/5/17 下午4:55, Richard Biener wrote:
> On Thu, May 13, 2021 at 9:04 AM Kewen.Lin <li...@linux.ibm.com> wrote:
>>
>> Hi!
>>
>>>>> But in the end the vector code shouldn't end up worse than the
>>>>> scalar code with respect to IVs - the cases where it would should
>>>>> be already costed.  So I wonder if you have specific examples
>>>>> where things go worse enough for the heuristic to trigger?
>>>>>
>>>>
>>>> One typical case that I worked on to reuse this density check is the
>>>> function mat_times_vec of src file block_solver.fppized.f of SPEC2017
>>>> 503.bwaves_r, the density with the existing heuristic is 83 (doesn't
>>>> exceed the threshold unlikely).  The interesting loop is the innermost
>>>> one while option set is "-O2 -mcpu=power8 -ffast-math -ftree-vectorize".
>>>> We have verified that this loop isn't profitable to be vectorized at
>>>> O2 (without loop-interchange).
>>>
>>> Yeah, but that's because the loop only runs 5 iterations, not because
>>> of some "density" (which suggests AGU overloading or some such)?
>>> Because if you modify it so it iterates more then with keeping the
>>> "density" measurement constant you suddenly become profitable?
>>>
>>
>> Yes, I agree this isn't a perfect one showing how the density check
>> matters, though it led me to find this check.  I tried to run SPEC2017
>> bmks w/ and w/o this density heuristic to catch some "expected" case,
>> but failed to unluckily.  It may be worth to trying with some more
>> option sets or even test with the previous SPECs later.
>>
>> I hacked the innermost loop iteration from 5 to 20, but baseline run
>> didn't stop (after more than 7 hrs then I killed it), which was
>> suspected to become endless because of some garbage (out of bound) data.
>>
>> But the current cost modeling for this loop on Power is still bad, the
>> min profitable iteration (both static and run time) are evaluated as 2,
>> while the reality shows 5 isn't profitable at least.
>>
>>
>>> The loop does have quite many memory streams so optimizing
>>> the (few) arithmetic ops by vectorizign them might not be worth
>>> the trouble, esp. since most of the loads are "strided" (composed
>>> from scalars) when no interchange is performed.  So it's probably
>>> more a "density" of # memory streams vs. # arithmetic ops, and
>>> esp. with any non-consecutive vector loads this balance being
>>> worse in the vector case?
>>>
>>
>> Yeah, these many scalar "strided" loads make things worse.  The fed
>> vector CTORs have to wait for all of their required loads are ready,
>> and these vector CTOR are required by further multiplications.
>>
>> I posted one patch[1] on this, which tries to model it with
>> some counts: nload (total load number), nload_ctor (strided
>> load number fed into CTOR) and nctor_strided (CTOR number fed
>> by strided load).
>>
>> Restricting the penalization by considering some factors:
>>   1) vect density ratio, if there are many vector instructions,
>>      the stalls from loads are easy to impact the subsequent
>>      computation.
>>   2) total load number, if nload is small, it's unlikely to
>>      bother the load/store units much.
>>   3) strided loads fed into CTOR pct., if there are high portion
>>      strided loads fed into CTOR, it's very likely to block
>>      the CTOR and its subsequent chain.
>>
>> btw, as your previous comments on add_stmt_cost, the load/strided/ctor
>> statistics should be gathered there instead, like:
>>
>>   if (!data->costing_for_scalar && data->loop_info && where == vect_body)
>>     {
>>       if (kind == scalar_load || kind == vector_load || kind == 
>> unaligned_load
>>           || kind == vector_gather_load)
>>           data->nload += count;
>>       if (stmt_info && STMT_VINFO_STRIDED_P (stmt_info))
>>         {
>>           if (kind == scalar_load || kind == unaligned_load)
>>             data->nload_ctor += count;
>>           else if (kind == vec_construct)
>>             data->nctor_strided += count;
>>         }
>>     }
>>
>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/569791.html
>>
>>> The x86 add_stmt_cost has
>>>
>>>   /* If we do elementwise loads into a vector then we are bound by
>>>      latency and execution resources for the many scalar loads
>>>      (AGU and load ports).  Try to account for this by scaling the
>>>      construction cost by the number of elements involved.  */
>>>   if ((kind == vec_construct || kind == vec_to_scalar)
>>>       && stmt_info
>>>       && (STMT_VINFO_TYPE (stmt_info) == load_vec_info_type
>>>           || STMT_VINFO_TYPE (stmt_info) == store_vec_info_type)
>>>       && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE
>>>       && TREE_CODE (DR_STEP (STMT_VINFO_DATA_REF (stmt_info))) != 
>>> INTEGER_CST)
>>>     {
>>>       stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
>>>       stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
>>>     }
>>>
>>> so it penaltizes VMAT_ELEMENTWISE for variable step for both loads and 
>>> stores.
>>> The above materialized over PRs 84037, 85491 and 87561, so not specifically
>>> for the bwaves case.  IIRC on x86 bwaves at -O2 is slower with vectorization
>>> as well.
>>>
>>
>> Thanks for the pointer!  rs6000 probably can follow this way instead.
>> IIUC, this cost adjustment is for each individual 
>> vec_construct/vec_to_scalar,
>> is it better to use the way that firstly collecting the data in add_stmt_cost
>> and then considering them from a view of the whole loop?
> 
> It depends - when you look at the whole loop you'd like to see data 
> dependences
> which are not obvious for the vector code.  So for the PRs mentioned
> the simplest
> thing was to do it per-stmt (since x86 doesn't yet have any special code in
> finish_cost).
> 
>> This individual way
>> seems easy to overkill if there are just a few VMAT_ELEMENTWISE loads then 
>> the
>> resource hazard isn't so bad.
> 
> True, but then the pessimization is cancelled easily by the
> vectorization benefit of
> other vectorized stmts.
> 
>>  And as you mentioned above, bwaves_r mainly
>> suffers from strided loads, this tuning looks should be applied for
>> VMAT_STRIDED_SLP as well?  Is it the reason why it only considers the 
>> variable
>> step that the constant step loads can be grouped on x86 (like: paired 
>> load/store
>> fusion)?
> 
> Possibly a speciality of x86 and its complex addressing modes where those
> need extra AGU resources (and those proved the bottlenecks for the PRs
> IIRC).  So it's just a heuristic and you shouldn't read too much into
> the details
> of the implementation ;)  (just realize I do the same for the rs6000 one...)
> 

Thanks for all the comments above.  I'll do some experiments referring to
the i386 way, to see if it's fine without considering the density ratio.

btw, I guessed you meant "do the same for the i386 one"?  ;)

BR,
Kewen

Reply via email to