On Mon, Jan 30, 2017 at 3:49 PM, Maxim Kuvyrkov
<maxim.kuvyr...@linaro.org> wrote:
>> On Jan 27, 2017, at 6:59 PM, Andrew Pinski <apin...@cavium.com> wrote:
>>
>> On Fri, Jan 27, 2017 at 4:11 AM, Richard Biener
>> <richard.guent...@gmail.com> wrote:
>>> On Fri, Jan 27, 2017 at 1:10 PM, Richard Biener
>>> <richard.guent...@gmail.com> wrote:
>>>> On Thu, Jan 26, 2017 at 9:56 PM, Andrew Pinski <apin...@cavium.com> wrote:
>>>>> Hi,
>>>>>  This patch enables -fprefetch-loop-arrays for -mcpu=thunderxt88 and
>>>>> -mcpu=thunderxt88p1.  I filled out the tuning structures for both
>>>>> thunderx and thunderx2t99.  No other core current enables software
>>>>> prefetching so I set them to 0 which does not change the default
>>>>> parameters.
>>>>>
>>>>> OK?  Bootstrapped and tested on both ThunderX2 CN99xx and ThunderX
>>>>> CN88xx with no regressions.  I got a 2x improvement for 462.libquantum
>>>>> on CN88xx, overall a 10% improvement on SPEC INT on CN88xx at -Ofast.
>>>>> CN99xx's SPEC did not change.
>>>>
>>>> Heh, quite impressive for this kind of bit-rotten (and broken?) pass ;)
>>>
>>> And I wonder if most benefit comes from the unrolling the pass might do
>>> rather than from the prefetches...
>>
>> Not in this case.  The main reason why I know is because the number of
>> L1 and L2 misses drops a lot.
>
> I can confirm this.  In my experiments loop unrolling hurts several tests.
>
> The prefetching approach I'm testing for -O2 includes disabling of loop 
> unrolling to prevent code bloat.

How do you get at the desired prefetching distance then?  Is it enough
to seed the HW prefetcher by
prefetching once before the loop?

> --
> Maxim Kuvyrkov
> www.linaro.org
>
>

Reply via email to