On Sun, Sep 17, 2017 at 4:41 PM, Kugan Vivekanandarajah <kugan.vivekanandara...@linaro.org> wrote: > Hi Andrew, > > On 15 September 2017 at 13:36, Andrew Pinski <pins...@gmail.com> wrote: >> On Thu, Sep 14, 2017 at 6:33 PM, Kugan Vivekanandarajah >> <kugan.vivekanandara...@linaro.org> wrote: >>> This patch adds aarch64_loop_unroll_adjust to limit partial unrolling >>> in rtl based on strided-loads in loop. >> >> Can you expand on this some more? Like give an example of where this >> helps? I am trying to better understand your counting schemes since >> it seems like the count is based on the number of loads and not cache >> lines. > > This is a simplified model and I am assuming here that prefetcher will > tune based on the memory accesses. I don't have access to any of the > internals of how this is implemented in different microarchitectures > but I am assuming (in a simplified sense) that hw logic will detect > memory accesses patterns and using this it will prefetch the cache > line. If there are memory accesses like what you have shown that falls > within the cache line, they may be combined but you still need to > detect them and tune. And also detecting them at compile time is not > always easy. So this is a simplified model. > >> What do you mean by a strided load? >> Doesn't this function overcount when you have: >> for(int i = 1;i<1024;i++) >> { >> t+= a[i-1]*a[i]; >> } >> if it is counting based on cache lines rather than based on load addresses? > Sorry for my terminology. what I mean by strided access is any memory > accesses in the form memory[iv]. I am counting memory[iv] and > memory[iv+1] as two deferent streams. This may or may not fall into > same cache line. > >> >> It also seems to do some weird counting when you have: >> for(int i = 1;i<1024;i++) >> { >> t+= a[(i-1)*N+i]*a[(i)*N+i]; >> } >> >> That is: >> (PLUS (REG) (REG)) >> >> Also seems to overcount when loading from the same pointer twice. > > If you prefer to count cache line basis, then I am counting it twice > intentionally. > >> >> In my micro-arch, the number of prefetch slots is based on cache line >> miss so this would be overcounting by a factor of 2. > > I am not entirely sure if this will be useful for all the cores. It is > shown to beneficial for falkor based on what is done in LLVM.
Can you share at least one benchmark or microbenchmark which shows the benefit? Because I can't seem to understand how the falkor core handles their hardware prefetcher to see if this is beneficial even there? Thanks, Andrew > > Thanks, > Kugan >> >> Thanks, >> Andrew >> >>> >>> Thanks, >>> Kugan >>> >>> gcc/ChangeLog: >>> >>> 2017-09-12 Kugan Vivekanandarajah <kug...@linaro.org> >>> >>> * cfgloop.h (iv_analyze_biv): export. >>> * loop-iv.c: Likewise. >>> * config/aarch64/aarch64.c (strided_load_p): New. >>> (insn_has_strided_load): New. >>> (count_strided_load_rtl): New. >>> (aarch64_loop_unroll_adjust): New.