On 08/02/2016 03:50 PM, vijay.kil...@gmail.com wrote:
+#define VEC_PREFETCH(base, index) \
+        asm volatile ("prfm pldl1strm, [%x[a]]\n" : : [a]"r"(&base[(index)]))

Is this not __builtin_prefetch(base + index) ?

I.e. you can defined this generically for all targets.

+#if defined (__aarch64__)
+    do_prefetch = is_thunder_pass2_cpu();
+    if (do_prefetch) {
+        VEC_PREFETCH(p, 8);
+        VEC_PREFETCH(p, 16);
+        VEC_PREFETCH(p, 24);
+    }
+#endif
+
     for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
          i < len / sizeof(VECTYPE);
          i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
+
+#if defined (__aarch64__)
+        if (do_prefetch) {
+            VEC_PREFETCH(p, i+32);
+            VEC_PREFETCH(p, i+40);
+        }
+#endif
+

Surely we shouldn't be adding a conditional around a prefetch inside of the inner loop. Does it really hurt to perform this prefetch for all aarch64 cpus?

I'll note that you're also prefetching too much, off the end of the block, and that you're probably not prefetching far enough. You'd need to break off the last iteration(s) of the loop.

I'll note that you're also prefetching too close. The loop operates on 8*vecsize units. In the case of aarch64, 128 byte units. Both i+32 and i+40 are within the current loop. I believe you want to prefetch at

  i + BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * N

where N is the number of iterations in advance to be fetched. Probably N is 1 or 2, unless the memory controller is really slow.


r~

Reply via email to