On Thu, Apr 9, 2015 at 9:52 AM, Richard Biener <richard.guent...@gmail.com> wrote: > On Wed, Apr 8, 2015 at 5:14 PM, James Greenhalgh > <james.greenha...@arm.com> wrote: >> On Thu, Apr 02, 2015 at 04:20:06AM +0100, Ekanathan, Saravanan wrote: >>> (I had sent this mail to gcc-help a week ago. Not sure, all GCC developers >>> are subscribed to gcc-help, so re-sending to GCC development mailing list) >>> >>> Hi, >>> >>> This looks like a missed vectorization opportunity for one of the 'Fortran' >>> hot loops in cactusADM (CPU2006 benchmark) when compiled with >>> "-mcpu=cortex-a57 -Ofast". Interestingly, the 'generic' model (compiled >>> with >>> plain "-Ofast or -O3" and without -mcpu option) vectorizes this hot loop, >>> hence there is good runtime performance improvement noticed on native >>> Aarch64 >>> platform. >>> >>> I don't have a small reproducible testcase, hence quoting cactusADM >>> benchmark >>> here. The hot loop is present in Bench_StaggeredLeapfrog2() in >>> StaggeredLeapfrog2.F file. >>> >>> For cortex-a57, vectorization report clearly mentions that scalar cost < >>> vector_cost/vectorization_factor, hence didn't vectorize. >>> >>> For generic case, due to un-tuned vector cost model, the scalar cost > >>> vector_cost/vectorization_factor (since scalar_cost = vector_cost), so the >>> loop got vectorized >>> >>> << Output of generic vectorized case>> >>> >>> StaggeredLeapfrog2.fppized.f.130t.vect:StaggeredLeapfrog2.fppized.f:362:0: >>> note: LOOP VECTORIZED >>> >>> I have also played around with cortexa57_vector_cost table(esp., >>> scalar_stmt_cost, vector_stmt_cost, vec_unaligned_cost etc..,), which >>> influences the vectorization decision in this case. The >>> cortexa57_vector_cost table directly maps to the cost mentioned in >>> "Cortex-A57 Software Optimisation Guide". But, it looks like there is >>> further scope of tuning the cortexa57 vector cost to vectorize such cases. >>> >>> Any comments on this missed opportunity ? >> >> When I added the vector costs for Cortex-A57, I followed the Cortex-A57 >> Software Optimisation Guide [1] you mentioned above. I took a lower-bound >> estimate for each cost, which will certainly underestimate the >> floating-point scalar costs. >> >> So, I can believe that the costs will not be optimal for all test code >> you can give them, and I'm happy to look at patches which improve the >> vector costs. If you are planning to look at this, please feel free to >> raise a bugzilla issue and assign it to yourself so we can track things. >> >> Please be sure to test any changes across a range of workloads - from >> time to time I've seen issues with the Cortex-A57 vector costs where >> we have been too eager to vectorize and would have been better keeping >> to scalar code. > > Speaking of AARCH64 vector costs there is the completely bogus > > /* Implement targetm.vectorize.add_stmt_cost. */ > static unsigned > aarch64_add_stmt_cost (void *data, int count, enum vect_cost_for_stmt kind, > struct _stmt_vec_info *stmt_info, int misalign, > enum vect_cost_model_location where) > { > ... > /* Statements in an inner loop relative to the loop being > vectorized are weighted more heavily. The value here is > a function (linear for now) of the loop nest level. */ > if (where == vect_body && stmt_info && stmt_in_inner_loop_p (stmt_info)) > { > loop_vec_info loop_info = STMT_VINFO_LOOP_VINFO (stmt_info); > struct loop *loop = LOOP_VINFO_LOOP (loop_info); > unsigned nest_level = loop_depth (loop); > > count *= nest_level; > } > > where somebody appearantly tried to be clever, not noticing that the > code present > in every other target is that odd because it matches the vectorizer > code. You should > really replace this with the code from arm.c for example, that is, > > Index: gcc/config/aarch64/aarch64.c > =================================================================== > --- gcc/config/aarch64/aarch64.c (revision 221918) > +++ gcc/config/aarch64/aarch64.c (working copy) > @@ -6588,13 +6588,7 @@ aarch64_add_stmt_cost (void *data, int c > vectorized are weighted more heavily. The value here is > a function (linear for now) of the loop nest level. */ > if (where == vect_body && stmt_info && stmt_in_inner_loop_p > (stmt_info)) > - { > - loop_vec_info loop_info = STMT_VINFO_LOOP_VINFO (stmt_info); > - struct loop *loop = LOOP_VINFO_LOOP (loop_info); > - unsigned nest_level = loop_depth (loop); > - > - count *= nest_level; > - } > + count *= 50; /* FIXME. */ > > retval = (unsigned) (count * stmt_cost); > cost[where] += retval; > > see tree-vect-loop.c:vect_get_single_scalar_iteration_cost: > > ... > /* FORNOW. */ > innerloop_iters = 1; > if (loop->inner) > innerloop_iters = 50; /* FIXME */ > ... > > yes, that '50' should be a parameter somewhere in loop_vec_info.
I see the broken code is still in aarch64.c - can someone please test & apply the above patch? Thanks, Richard. > Richard. > >> Thanks, >> James >> >> --- >> [1]: Cortex-57 Software Optimisation Guide >> >> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.uan0015a/index.html >>