"Kewen.Lin" <li...@linux.ibm.com> writes: > Hi Richard, > > on 2020/7/27 下午9:10, Richard Sandiford wrote: >> "Kewen.Lin" <li...@linux.ibm.com> writes: >>> Hi, >>> >>> As Richard S. suggested in the thread: >>> >>> https://gcc.gnu.org/pipermail/gcc-patches/2020-July/550633.html >>> >>> this patch is separated from the one of that thread, mainly to refactor the >>> existing peel_iters_{pro,epi}logue cost model handlings. >>> >>> I've addressed Richard S.'s review comments there, moreover because of one >>> failure of aarch64 testing, I updated it a bit more to keep the logic >>> unchanged >>> as before first (refactor_cost.diff). >> >> Heh, nice when a clean-up exposes an existing bug. ;-) I agree the >> updates look correct. E.g. if vf is 1, we should assume that there >> are no peeled iterations even if the number of iterations is unknown. >> >> Both patches are OK with some very minor nits (sorry): > > Thanks for the comments! I've addressed them and commit refactor_cost.diff > via r11-2378. > > For adjust.diff, it works well on powerpc64le-linux-gnu (bootstrap/regtest), > but got one failure on aarch64 as below: > > PASS->FAIL: gcc.target/aarch64/sve/cost_model_2.c -march=armv8.2-a+sve > scan-assembler-times \\tst1w\\tz 1 > > I spent some time investigating it and thought it's expected. As you > mentioned > above, the patch can fix the cases with VF 1, the VF of this case is exactly > one. :) > > Without the proposed adjustment, the cost for V16QI looks like: > > Vector inside of loop cost: 1 > Vector prologue cost: 4 > Vector epilogue cost: 3 > Scalar iteration cost: 4 > Scalar outside cost: 6 > Vector outside cost: 7 > prologue iterations: 0 > epilogue iterations: 0 > > After the change, the cost becomes to: > > Vector inside of loop cost: 1 > Vector prologue cost: 1 > Vector epilogue cost: 0 > Scalar iteration cost: 4 > Scalar outside cost: 6 > Vector outside cost: 1 > prologue iterations: 0 > epilogue iterations: 0 > > which outperforms VNx16QI that isn't affected by this patch with partial > vectors.
Hmm. V16QI outperforming VNx16QI is kind-of bogus in this case, since the store is a full-vector store for both Advanced SIMD and 128-bit SVE. But we do currently still generate the SVE loop in a more general form (since doing otherwise introduces other problems) so I guess the result is self-consistent. More importantly, even if the costs were equal, V16QI would still win for 128-bit SVE. So yeah, while I had my doubts at first, I agree this is the right fix. > gcc/ChangeLog: > > * tree-vect-loop.c (vect_get_known_peeling_cost): Don't consider branch > taken costs for prologue and epilogue if they don't exist. > (vect_estimate_min_profitable_iters): Likewise. > > gcc/testsuite/ChangeLog: > > * gcc.target/aarch64/sve/cost_model_2.c: Adjust due to cost model > change. OK, thanks. Richard