On April 11, 2017 4:57:29 PM GMT+02:00, "Bin.Cheng" <amker.ch...@gmail.com> 
wrote:
>On Tue, Apr 11, 2017 at 3:38 PM, Robin Dapp <rd...@linux.vnet.ibm.com>
>wrote:
>> Hi,
>>
>> when looking at various vectorization examples on s390x I noticed
>that
>> we still peel vf/2 iterations for alignment even though vectorization
>> costs of unaligned loads and stores are the same as normal
>loads/stores.
>>
>> A simple example is
>>
>> void foo(int *restrict a, int *restrict b, unsigned int n)
>> {
>>   for (unsigned int i = 0; i < n; i++)
>>     {
>>       b[i] = a[i] * 2 + 1;
>>     }
>> }
>>
>> which gets peeled unless __builtin_assume_aligned (a, 8) is used.
>>
>> In tree-vect-data-refs.c there are several checks that involve costs 
>in
>> the peeling decision none of which seems to suffice in this case. For
>a
>> loop with only read DRs there is a check that has been triggering
>(i.e.
>> disable peeling) since we implemented the vectorization costs.
>>
>> Here, we have DR_MISALIGNMENT (dr) == -1 for all DRs but the costs
>> should still dictate to never peel. I attached a tentative patch for
>> discussion which fixes the problem by checking the costs for npeel =
>0
>> and npeel = vf/2 after ensuring we support all misalignments. Is
>there a
>> better way and place to do it? Are we missing something somewhere
>else
>> that would preclude the peeling from happening?
>>
>> This is not indended for stage 4 obviously :)
>Hi Robin,
>Seems Richi added code like below comparing costs between aligned and
>unsigned loads, and only peeling if it's beneficial:
>
>/* In case there are only loads with different unknown misalignments,
>use
>     peeling only if it may help to align other accesses in the loop or
>     if it may help improving load bandwith when we'd end up using
>     unaligned loads.  */
>     tree dr0_vt = STMT_VINFO_VECTYPE (vinfo_for_stmt (DR_STMT (dr0)));
>      if (!first_store
>      && !STMT_VINFO_SAME_ALIGN_REFS (
>          vinfo_for_stmt (DR_STMT (dr0))).length ()
>      && (vect_supportable_dr_alignment (dr0, false)
>          != dr_unaligned_supported
>          || (builtin_vectorization_cost (vector_load, dr0_vt, 0)
>          == builtin_vectorization_cost (unaligned_load, dr0_vt, -1))))
>        do_peeling = false;
>
>I think similar codes can be added for store cases too.

Note I was very conservative here to allow store bandwidth starved CPUs to 
benefit from aligning a store.

I think it would be reasonable to apply the same heuristic to the store case 
that we only peel for same cost if peeling would at least align two refs.

Richard.

>Thanks,
>bin
>>
>> Regards
>>  Robin

Reply via email to