> On Jan 27, 2017, at 6:59 PM, Andrew Pinski <apin...@cavium.com> wrote: > > On Fri, Jan 27, 2017 at 4:11 AM, Richard Biener > <richard.guent...@gmail.com> wrote: >> On Fri, Jan 27, 2017 at 1:10 PM, Richard Biener >> <richard.guent...@gmail.com> wrote: >>> On Thu, Jan 26, 2017 at 9:56 PM, Andrew Pinski <apin...@cavium.com> wrote: >>>> Hi, >>>> This patch enables -fprefetch-loop-arrays for -mcpu=thunderxt88 and >>>> -mcpu=thunderxt88p1. I filled out the tuning structures for both >>>> thunderx and thunderx2t99. No other core current enables software >>>> prefetching so I set them to 0 which does not change the default >>>> parameters. >>>> >>>> OK? Bootstrapped and tested on both ThunderX2 CN99xx and ThunderX >>>> CN88xx with no regressions. I got a 2x improvement for 462.libquantum >>>> on CN88xx, overall a 10% improvement on SPEC INT on CN88xx at -Ofast. >>>> CN99xx's SPEC did not change. >>> >>> Heh, quite impressive for this kind of bit-rotten (and broken?) pass ;) >> >> And I wonder if most benefit comes from the unrolling the pass might do >> rather than from the prefetches... > > Not in this case. The main reason why I know is because the number of > L1 and L2 misses drops a lot.
I can confirm this. In my experiments loop unrolling hurts several tests. The prefetching approach I'm testing for -O2 includes disabling of loop unrolling to prevent code bloat. -- Maxim Kuvyrkov www.linaro.org