store with length in vectorizer

Kewen.Lin via Gcc-patches Tue, 07 Jul 2020 02:27:38 -0700

Hi Richard,

on 2020/7/2 下午1:20, Kewen.Lin via Gcc-patches wrote:
> on 2020/7/1 下午11:17, Richard Sandiford wrote:
>> "Kewen.Lin" <li...@linux.ibm.com> writes:
>>> on 2020/7/1 上午3:53, Richard Sandiford wrote:
>>>> "Kewen.Lin" <li...@linux.ibm.com> writes:
[...]
>> Hmm, OK.  But in that case can you update the names of the variables
>> to match?  It's confusing to have some nscalars_* variables actually
>> count scalars (and thus have “nitems” equivalents) and other nscalars_*
>> variables count something else (and thus effectively be nitems_* variables
>> themselves).
>>
> 
> OK.  I'll update the names like nscalars_total/nscalars_step and equivalents
> to nitems_total/... (or nunits_total better?)
>


Please ignore this part, I have used nitems_ for the names.  :)

>>>>> +  /* Work out how many bits we need to represent the length limit.  */
>>>>> +  unsigned int nscalars_per_iter_ft = rgl->max_nscalars_per_iter * 
>>>>> rgl->factor;
>>>>
>>>> I think this breaks the abstraction.  There's no guarantee that the
>>>> factor is the same for each rgroup_control, so there's no guarantee
>>>> that the maximum bytes per iter comes the last entry.  (Also, it'd
>>>> be better to avoid talking about bytes if we're trying to be general.)
>>>> I think we should take the maximum of each entry instead.
>>>>
>>>
>>> Agree!  I guess the above "maximum bytes per iter" is a typo? and you meant
>>> "maximum elements per iter"?  Yes, the code is for length in bytes, checking
>>> the last entry is only reasonable for it.  Will update it to check all 
>>> entries
>>> instead.
>>
>> I meant bytes, since that's what the code is effectively calculating
>> (at least for Power).  I.e. I think this breaks the abstraction even
>> if we assume the Power scheme to measuring length, since in principle
>> it's possible to fix different vector sizes in the same vector region.
>>
> 
> Sorry I didn't catch the meaning of "it's possible to fix different
> vector sizes in the same vector region."  I guess if we are counting
> bytes, the max nunits per iteration should come from the last entry
> since the last one holds max bytes which is the result of 
> max_nscalar_per_iter * factor.  But I agree that it breaks abstraction
> here since it's not applied to length in lanes.
> 

By further thought, I guessed you meant we can have different vector
sizes for the same loop in future?  Yes, the assumption doesn't hold then.

> 
>>>>> +      /* Decide whether to use fully-masked approach.  */
>>>>> +      if (vect_verify_full_masking (loop_vinfo))
>>>>> + LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
>>>>> +      /* Decide whether to use length-based approach.  */
>>>>> +      else if (vect_verify_loop_lens (loop_vinfo))
>>>>> + {
>>>>> +   if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
>>>>> +       || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo))
>>>>> +     {
>>>>> +       if (dump_enabled_p ())
>>>>> +         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>>>> +                          "can't vectorize this loop with length-based"
>>>>> +                          " partial vectors approach becuase peeling"
>>>>> +                          " for alignment or gaps is required.\n");
>>>>> +       LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
>>>>> +     }
>>>>
>>>> Why are these peeling cases necessary?  Peeling for gaps should
>>>> just mean subtracting one scalar iteration from the iteration count
>>>> and shouldn't otherwise affect the main loop.  Similarly, peeling for
>>>> alignment can be handled in the normal way, with a scalar prologue loop.
>>>>
>>>
>>> I was thinking to relax this later and to avoid to handle too many cases
>>> in the first enablement patch.  Since Power hw whose level is able to 
>>> support
>>> vector with length, it supports unaligned load/store, need to construct
>>> some cases for them.  May I postpone it a bit?  Or you prefer me to support
>>> it here?
>>
>> I've no objection to postponing it if there are specific known
>> problems that make it difficult, but I think we should at least
>> say what they are.  On the face of it, I'm not sure why it doesn't
>> Just Work, since the way that we control the main loop should be
>> mostly orthogonal to how we handle peeled prologue iterations
>> and how we handle a single peeled epilogue iteration.
>>
> 
> OK, I will remove it to see the impact.  By the way, do you think to
> use partial vectors for prologue is something worth to trying in future?
> 

I tested the updated patch with this releasing, LOOP_VINFO_PEELING_FOR_GAPS
part looks fine, but LOOP_VINFO_PEELING_FOR_ALIGNMENT caused one case to
fail at execution during vect-partial-vector-usage=2.  So far the patch
doesn't handle any niters_skip cases.  I think if we want to support it, 
we have to add some handlings in/like what we have for masking, such as: 
mask_skip_niters, vect_prepare_for_masked_peels etc.  

Do you prefer me to extend the support in this patch series?

>>> Sorry I might miss something, but all undetermined lengths are generated 
>>> here,
>>> the other places you meant is doc or elsewhere?
>>
>> For example, we'd need to start querying the length operand of the optabs
>> to see what length precision the target uses, since it would be invalid
>> to do this optimisation for IVs that are wider than that precision.
>> The routine above doesn't seem the right place to do that.
>>
> 
> OK, but it seems it's acceptable if the IV wider than the precision since
> we allows it out of range?
> 

Please ignore this question, I agree that we have to avoid that case.  Sorry 
that
I was misunderstanding it before.  :)

BR,
Kewen

Re: [PATCH 5/7 v6] vect: Support vector load/store with length in vectorizer

Reply via email to