On Sat, Aug 6, 2016 at 3:47 PM, Richard Henderson <r...@twiddle.net> wrote: > On 08/02/2016 03:50 PM, vijay.kil...@gmail.com wrote: >> >> +#define VEC_PREFETCH(base, index) \ >> + asm volatile ("prfm pldl1strm, [%x[a]]\n" : : >> [a]"r"(&base[(index)])) > > > Is this not __builtin_prefetch(base + index) ? > > I.e. you can defined this generically for all targets.
__builtin_prefetch() is available only in gcc 5.3 for arm64. > >> +#if defined (__aarch64__) >> + do_prefetch = is_thunder_pass2_cpu(); >> + if (do_prefetch) { >> + VEC_PREFETCH(p, 8); >> + VEC_PREFETCH(p, 16); >> + VEC_PREFETCH(p, 24); >> + } >> +#endif >> + >> for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; >> i < len / sizeof(VECTYPE); >> i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) { >> + >> +#if defined (__aarch64__) >> + if (do_prefetch) { >> + VEC_PREFETCH(p, i+32); >> + VEC_PREFETCH(p, i+40); >> + } >> +#endif >> + > > > Surely we shouldn't be adding a conditional around a prefetch inside of the > inner loop. Does it really hurt to perform this prefetch for all aarch64 > cpus? prefetch is only required for thunderx pass2 platform. I will remove this condition check. > > I'll note that you're also prefetching too much, off the end of the block, > and that you're probably not prefetching far enough. You'd need to break > off the last iteration(s) of the loop. > > I'll note that you're also prefetching too close. The loop operates on > 8*vecsize units. In the case of aarch64, 128 byte units. Both i+32 and 128 unit is specific to thunder. I will move this to thunder specific function > i+40 are within the current loop. I believe you want to prefetch at > I am dropping i+40 > i + BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * N > > where N is the number of iterations in advance to be fetched. Probably N is > 1 or 2, unless the memory controller is really slow. > > > r~