on 2020/7/31 下午6:57, Richard Sandiford wrote: > "Kewen.Lin" <li...@linux.ibm.com> writes: >> Hi Richard, >> >> on 2020/7/27 下午9:10, Richard Sandiford wrote: >>> "Kewen.Lin" <li...@linux.ibm.com> writes: >>>> Hi, >>>> >>>> As Richard S. suggested in the thread: >>>> >>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-July/550633.html >>>> >>>> this patch is separated from the one of that thread, mainly to refactor the >>>> existing peel_iters_{pro,epi}logue cost model handlings. >>>> >>>> I've addressed Richard S.'s review comments there, moreover because of one >>>> failure of aarch64 testing, I updated it a bit more to keep the logic >>>> unchanged >>>> as before first (refactor_cost.diff). >>> >>> Heh, nice when a clean-up exposes an existing bug. ;-) I agree the >>> updates look correct. E.g. if vf is 1, we should assume that there >>> are no peeled iterations even if the number of iterations is unknown. >>> >>> Both patches are OK with some very minor nits (sorry): >> >> Thanks for the comments! I've addressed them and commit refactor_cost.diff >> via r11-2378. >> >> For adjust.diff, it works well on powerpc64le-linux-gnu (bootstrap/regtest), >> but got one failure on aarch64 as below: >> >> PASS->FAIL: gcc.target/aarch64/sve/cost_model_2.c -march=armv8.2-a+sve >> scan-assembler-times \\tst1w\\tz 1 >> >> I spent some time investigating it and thought it's expected. As you >> mentioned >> above, the patch can fix the cases with VF 1, the VF of this case is exactly >> one. :) >> >> Without the proposed adjustment, the cost for V16QI looks like: >> >> Vector inside of loop cost: 1 >> Vector prologue cost: 4 >> Vector epilogue cost: 3 >> Scalar iteration cost: 4 >> Scalar outside cost: 6 >> Vector outside cost: 7 >> prologue iterations: 0 >> epilogue iterations: 0 >> >> After the change, the cost becomes to: >> >> Vector inside of loop cost: 1 >> Vector prologue cost: 1 >> Vector epilogue cost: 0 >> Scalar iteration cost: 4 >> Scalar outside cost: 6 >> Vector outside cost: 1 >> prologue iterations: 0 >> epilogue iterations: 0 >> >> which outperforms VNx16QI that isn't affected by this patch with partial >> vectors. > > Hmm. V16QI outperforming VNx16QI is kind-of bogus in this case, > since the store is a full-vector store for both Advanced SIMD and > 128-bit SVE. But we do currently still generate the SVE loop in > a more general form (since doing otherwise introduces other problems) > so I guess the result is self-consistent. > > More importantly, even if the costs were equal, V16QI would still > win for 128-bit SVE. > > So yeah, while I had my doubts at first, I agree this is the right fix. >
Thanks for the explanation! Committed in r11-2453. BR, Kewen