On Thu, Apr 9, 2015 at 9:52 AM, Richard Biener
<richard.guent...@gmail.com> wrote:
> On Wed, Apr 8, 2015 at 5:14 PM, James Greenhalgh
> <james.greenha...@arm.com> wrote:
>> On Thu, Apr 02, 2015 at 04:20:06AM +0100, Ekanathan, Saravanan wrote:
>>> (I had sent this mail to gcc-help a week ago. Not sure, all GCC developers
>>> are subscribed to gcc-help, so re-sending to GCC development mailing list)
>>>
>>> Hi,
>>>
>>> This looks like a missed vectorization opportunity for one of the 'Fortran'
>>> hot loops in cactusADM (CPU2006 benchmark) when compiled with
>>> "-mcpu=cortex-a57 -Ofast".  Interestingly, the 'generic' model (compiled 
>>> with
>>> plain "-Ofast or -O3" and without -mcpu option) vectorizes this hot loop,
>>> hence there is good runtime performance improvement noticed on native 
>>> Aarch64
>>> platform.
>>>
>>> I don't have a small reproducible testcase, hence quoting cactusADM 
>>> benchmark
>>> here.  The hot loop is present in Bench_StaggeredLeapfrog2() in
>>> StaggeredLeapfrog2.F file.
>>>
>>> For cortex-a57, vectorization report clearly mentions that scalar cost <
>>> vector_cost/vectorization_factor, hence didn't vectorize.
>>>
>>> For generic case, due to un-tuned vector cost model, the scalar cost >
>>> vector_cost/vectorization_factor  (since scalar_cost = vector_cost), so the
>>> loop got vectorized
>>>
>>>    << Output of  generic vectorized case>>
>>>      
>>> StaggeredLeapfrog2.fppized.f.130t.vect:StaggeredLeapfrog2.fppized.f:362:0: 
>>> note: LOOP VECTORIZED
>>>
>>> I have also played around with cortexa57_vector_cost table(esp.,
>>> scalar_stmt_cost, vector_stmt_cost, vec_unaligned_cost  etc..,), which
>>> influences the vectorization decision in this case.  The
>>> cortexa57_vector_cost table directly maps to the cost mentioned in
>>> "Cortex-A57 Software Optimisation Guide".  But, it looks like there is
>>> further scope of tuning the cortexa57 vector cost to vectorize such cases.
>>>
>>> Any comments on this missed opportunity ?
>>
>> When I added the vector costs for Cortex-A57, I followed the Cortex-A57
>> Software Optimisation Guide [1] you mentioned above. I took a lower-bound
>> estimate for each cost, which will certainly underestimate the
>> floating-point scalar costs.
>>
>> So, I can believe that the costs will not be optimal for all test code
>> you can give them, and I'm happy to look at patches which improve the
>> vector costs. If you are planning to look at this, please feel free to
>> raise a bugzilla issue and assign it to yourself so we can track things.
>>
>> Please be sure to test any changes across a range of workloads - from
>> time to time I've seen issues with the Cortex-A57 vector costs where
>> we have been too eager to vectorize and would have been better keeping
>> to scalar code.
>
> Speaking of AARCH64 vector costs there is the completely bogus
>
> /* Implement targetm.vectorize.add_stmt_cost.  */
> static unsigned
> aarch64_add_stmt_cost (void *data, int count, enum vect_cost_for_stmt kind,
>                        struct _stmt_vec_info *stmt_info, int misalign,
>                        enum vect_cost_model_location where)
> {
> ...
>       /* Statements in an inner loop relative to the loop being
>          vectorized are weighted more heavily.  The value here is
>          a function (linear for now) of the loop nest level.  */
>       if (where == vect_body && stmt_info && stmt_in_inner_loop_p (stmt_info))
>         {
>           loop_vec_info loop_info = STMT_VINFO_LOOP_VINFO (stmt_info);
>           struct loop *loop =  LOOP_VINFO_LOOP (loop_info);
>           unsigned nest_level = loop_depth (loop);
>
>           count *= nest_level;
>         }
>
> where somebody appearantly tried to be clever, not noticing that the
> code present
> in every other target is that odd because it matches the vectorizer
> code.  You should
> really replace this with the code from arm.c for example, that is,
>
> Index: gcc/config/aarch64/aarch64.c
> ===================================================================
> --- gcc/config/aarch64/aarch64.c        (revision 221918)
> +++ gcc/config/aarch64/aarch64.c        (working copy)
> @@ -6588,13 +6588,7 @@ aarch64_add_stmt_cost (void *data, int c
>          vectorized are weighted more heavily.  The value here is
>          a function (linear for now) of the loop nest level.  */
>        if (where == vect_body && stmt_info && stmt_in_inner_loop_p 
> (stmt_info))
> -       {
> -         loop_vec_info loop_info = STMT_VINFO_LOOP_VINFO (stmt_info);
> -         struct loop *loop =  LOOP_VINFO_LOOP (loop_info);
> -         unsigned nest_level = loop_depth (loop);
> -
> -         count *= nest_level;
> -       }
> +       count *= 50;  /* FIXME.  */
>
>        retval = (unsigned) (count * stmt_cost);
>        cost[where] += retval;
>
> see tree-vect-loop.c:vect_get_single_scalar_iteration_cost:
>
> ...
>   /* FORNOW.  */
>   innerloop_iters = 1;
>   if (loop->inner)
>     innerloop_iters = 50; /* FIXME */
> ...
>
> yes, that '50' should be a parameter somewhere in loop_vec_info.

I see the broken code is still in aarch64.c - can someone please test
& apply the above
patch?

Thanks,
Richard.

> Richard.
>
>> Thanks,
>> James
>>
>> ---
>> [1]: Cortex-57 Software Optimisation Guide
>>   
>> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.uan0015a/index.html
>>

Reply via email to