On April 11, 2017 4:57:29 PM GMT+02:00, "Bin.Cheng" <amker.ch...@gmail.com> wrote: >On Tue, Apr 11, 2017 at 3:38 PM, Robin Dapp <rd...@linux.vnet.ibm.com> >wrote: >> Hi, >> >> when looking at various vectorization examples on s390x I noticed >that >> we still peel vf/2 iterations for alignment even though vectorization >> costs of unaligned loads and stores are the same as normal >loads/stores. >> >> A simple example is >> >> void foo(int *restrict a, int *restrict b, unsigned int n) >> { >> for (unsigned int i = 0; i < n; i++) >> { >> b[i] = a[i] * 2 + 1; >> } >> } >> >> which gets peeled unless __builtin_assume_aligned (a, 8) is used. >> >> In tree-vect-data-refs.c there are several checks that involve costs >in >> the peeling decision none of which seems to suffice in this case. For >a >> loop with only read DRs there is a check that has been triggering >(i.e. >> disable peeling) since we implemented the vectorization costs. >> >> Here, we have DR_MISALIGNMENT (dr) == -1 for all DRs but the costs >> should still dictate to never peel. I attached a tentative patch for >> discussion which fixes the problem by checking the costs for npeel = >0 >> and npeel = vf/2 after ensuring we support all misalignments. Is >there a >> better way and place to do it? Are we missing something somewhere >else >> that would preclude the peeling from happening? >> >> This is not indended for stage 4 obviously :) >Hi Robin, >Seems Richi added code like below comparing costs between aligned and >unsigned loads, and only peeling if it's beneficial: > >/* In case there are only loads with different unknown misalignments, >use > peeling only if it may help to align other accesses in the loop or > if it may help improving load bandwith when we'd end up using > unaligned loads. */ > tree dr0_vt = STMT_VINFO_VECTYPE (vinfo_for_stmt (DR_STMT (dr0))); > if (!first_store > && !STMT_VINFO_SAME_ALIGN_REFS ( > vinfo_for_stmt (DR_STMT (dr0))).length () > && (vect_supportable_dr_alignment (dr0, false) > != dr_unaligned_supported > || (builtin_vectorization_cost (vector_load, dr0_vt, 0) > == builtin_vectorization_cost (unaligned_load, dr0_vt, -1)))) > do_peeling = false; > >I think similar codes can be added for store cases too.
Note I was very conservative here to allow store bandwidth starved CPUs to benefit from aligning a store. I think it would be reasonable to apply the same heuristic to the store case that we only peel for same cost if peeling would at least align two refs. Richard. >Thanks, >bin >> >> Regards >> Robin