Re: [Qemu-devel] [RFC PATCH v1 2/2] utils: Add prefetch for Thunderx platform

Richard Henderson Sat, 06 Aug 2016 03:20:07 -0700

On 08/02/2016 03:50 PM, vijay.kil...@gmail.com wrote:

+#define VEC_PREFETCH(base, index) \
+        asm volatile ("prfm pldl1strm, [%x[a]]\n" : : [a]"r"(&base[(index)]))


Is this not __builtin_prefetch(base + index) ?

I.e. you can defined this generically for all targets.

+#if defined (__aarch64__)
+    do_prefetch = is_thunder_pass2_cpu();
+    if (do_prefetch) {
+        VEC_PREFETCH(p, 8);
+        VEC_PREFETCH(p, 16);
+        VEC_PREFETCH(p, 24);
+    }
+#endif
+
     for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
          i < len / sizeof(VECTYPE);
          i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
+
+#if defined (__aarch64__)
+        if (do_prefetch) {
+            VEC_PREFETCH(p, i+32);
+            VEC_PREFETCH(p, i+40);
+        }
+#endif
+

Surely we shouldn't be adding a conditional around a prefetch inside of theinner loop. Does it really hurt to perform this prefetch for all aarch64 cpus?

I'll note that you're also prefetching too much, off the end of the block, andthat you're probably not prefetching far enough. You'd need to break off thelast iteration(s) of the loop.

I'll note that you're also prefetching too close. The loop operates on8*vecsize units. In the case of aarch64, 128 byte units. Both i+32 and i+40are within the current loop. I believe you want to prefetch at


  i + BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * N

where N is the number of iterations in advance to be fetched. Probably N is 1or 2, unless the memory controller is really slow.

r~

Re: [Qemu-devel] [RFC PATCH v1 2/2] utils: Add prefetch for Thunderx platform

Reply via email to