On 08/16/2016 04:45 PM, Vijay Kilari wrote:
On Tue, Aug 16, 2016 at 11:32 PM, Richard Henderson <r...@twiddle.net> wrote:
On 08/16/2016 05:02 AM, vijay.kil...@gmail.com wrote:
+static inline void prefetch_vector_loop(const VECTYPE *p, int index)
+{
+#if defined(__aarch64__)
+ if (is_thunderx_pass2_cpu()) {
+ /* Prefetch 4 cache lines ahead from index */
+ VEC_PREFETCH(p, index + (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
* 4));
+ }
+#endif
+}
Oh come now. This is even worse than before. A function call protecting a
mere prefetch within the main body of an inner loop?
Did you not understand what I was asking for?
No, Could you please detail the problem?.
The thunderx check, *if it even needs to exist at all*, must happen outside the
loop. Preferably not more than once, at startup time.
I strongly suspect that you do not need any check at all. That even for cpus
which automatically detect the streaming loop, adding a prefetch will not hurt.
You should repeat your same benchmark, with and without the prefetch, on (1) an
A57 or suchlike, and (2) an x86 of some variety.
r~