on 2020/7/31 下午6:57, Richard Sandiford wrote:
> "Kewen.Lin" <li...@linux.ibm.com> writes:
>> Hi Richard,
>>
>> on 2020/7/27 下午9:10, Richard Sandiford wrote:
>>> "Kewen.Lin" <li...@linux.ibm.com> writes:
>>>> Hi,
>>>>
>>>> As Richard S. suggested in the thread:
>>>>
>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-July/550633.html
>>>>
>>>> this patch is separated from the one of that thread, mainly to refactor the
>>>> existing peel_iters_{pro,epi}logue cost model handlings.
>>>>
>>>> I've addressed Richard S.'s review comments there, moreover because of one
>>>> failure of aarch64 testing, I updated it a bit more to keep the logic 
>>>> unchanged
>>>> as before first (refactor_cost.diff).
>>>
>>> Heh, nice when a clean-up exposes an existing bug. ;-)  I agree the
>>> updates look correct.  E.g. if vf is 1, we should assume that there
>>> are no peeled iterations even if the number of iterations is unknown.
>>>
>>> Both patches are OK with some very minor nits (sorry):
>>
>> Thanks for the comments!  I've addressed them and commit refactor_cost.diff
>> via r11-2378.
>>
>> For adjust.diff, it works well on powerpc64le-linux-gnu (bootstrap/regtest),
>> but got one failure on aarch64 as below:
>>
>> PASS->FAIL: gcc.target/aarch64/sve/cost_model_2.c -march=armv8.2-a+sve  
>> scan-assembler-times \\tst1w\\tz 1
>>
>> I spent some time investigating it and thought it's expected.  As you 
>> mentioned
>> above, the patch can fix the cases with VF 1, the VF of this case is exactly 
>> one.  :)
>>
>> Without the proposed adjustment, the cost for V16QI looks like:
>>
>>   Vector inside of loop cost: 1
>>   Vector prologue cost: 4
>>   Vector epilogue cost: 3
>>   Scalar iteration cost: 4
>>   Scalar outside cost: 6
>>   Vector outside cost: 7
>>   prologue iterations: 0
>>   epilogue iterations: 0
>>
>> After the change, the cost becomes to:
>>
>>     Vector inside of loop cost: 1
>>     Vector prologue cost: 1
>>     Vector epilogue cost: 0
>>     Scalar iteration cost: 4
>>     Scalar outside cost: 6
>>     Vector outside cost: 1
>>     prologue iterations: 0
>>     epilogue iterations: 0
>>
>> which outperforms VNx16QI that isn't affected by this patch with partial 
>> vectors.
> 
> Hmm.  V16QI outperforming VNx16QI is kind-of bogus in this case,
> since the store is a full-vector store for both Advanced SIMD and
> 128-bit SVE.  But we do currently still generate the SVE loop in
> a more general form (since doing otherwise introduces other problems)
> so I guess the result is self-consistent.
> 
> More importantly, even if the costs were equal, V16QI would still
> win for 128-bit SVE.
> 
> So yeah, while I had my doubts at first, I agree this is the right fix.
> 

Thanks for the explanation!  Committed in r11-2453.

BR,
Kewen

Reply via email to