On 08/16/2016 04:45 PM, Vijay Kilari wrote:
On Tue, Aug 16, 2016 at 11:32 PM, Richard Henderson <r...@twiddle.net> wrote:
On 08/16/2016 05:02 AM, vijay.kil...@gmail.com wrote:

+static inline void prefetch_vector_loop(const VECTYPE *p, int index)
+{
+#if defined(__aarch64__)
+    if (is_thunderx_pass2_cpu()) {
+        /* Prefetch 4 cache lines ahead from index */
+        VEC_PREFETCH(p, index + (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
* 4));
+    }
+#endif
+}


Oh come now.  This is even worse than before.  A function call protecting a
mere prefetch within the main body of an inner loop?

Did you not understand what I was asking for?

No, Could you please detail the problem?.

The thunderx check, *if it even needs to exist at all*, must happen outside the loop. Preferably not more than once, at startup time.

I strongly suspect that you do not need any check at all. That even for cpus which automatically detect the streaming loop, adding a prefetch will not hurt.

You should repeat your same benchmark, with and without the prefetch, on (1) an A57 or suchlike, and (2) an x86 of some variety.


r~

Reply via email to