as part of my netmap investigations, i was looking at how
expensive are memory copies, and here are a couple of findings
(first one is obvious, the second one less so)

1. especially on 64bit machines, always use multiple of at
   least 8 bytes (possibly even larger units). The bcopy code
   in amd64 seems to waste an extra 20ns (on a 3.4 GHz machine)
   when processing blocks of size 8n + {4,5,6,7}.
   The difference is relevant, on that machine i have

        bcopy(src, dst,  1) ~12.9ns     (data in L1 cache)
        bcopy(src, dst,  3) ~12.9ns     (data in L1 cache)
        bcopy(src, dst,  4) ~33.4ns     (data in L1 cache) <--- NOTE
        bcopy(src, dst, 32) ~12.9ns     (data in L1 cache)
        bcopy(src, dst, 63) ~33.4ns     (data in L1 cache) <--- NOTE
        bcopy(src, dst, 64) ~12.9ns     (data in L1 cache)
   Note how the two marked lines are much slower than the others.
   Same thing happens with data not in L1

        bcopy(src, dst, 64) ~ 22ns      (not in L1)
        bcopy(src, dst, 63) ~ 44ns      (not in L1)
                ...

   Continuing the tests on larger sizes, for the next item:
        bcopy(src, dst,256) ~19.8ns     (data in L1 cache)
        bcopy(src, dst,512) ~28.8ns     (data in L1 cache)
        bcopy(src, dst,1K)  ~39.6ns     (data in L1 cache)
        bcopy(src, dst,4K)  ~95.2ns     (data in L1 cache)


   An older P4 running FreeBSD4/32 bit the operand size seems less
   sensitive to odd sizes.

2. apparently, bcopy is not the fastest way to copy memory.
   For small blocks and multiples of 32-64 bytes, i noticed that
   the following is a lot faster (breaking even at about 1 KBytes)

        static inline void
        fast_bcopy(void *_src, void *_dst, int l)
        {
                uint64_t *src = _src;
                uint64_t *dst = _dst;
                for (; l > 0; l-=32) {
                        *dst++ = *src++;
                        *dst++ = *src++;
                        *dst++ = *src++;
                        *dst++ = *src++;
                }
        }

        fast_bcopy(src, dst, 32) ~ 1.8ns        (data in L1 cache)
        fast_bcopy(src, dst, 64) ~ 2.9ns        (data in L1 cache)
        fast_bcopy(src, dst,256) ~10.1ns        (data in L1 cache)
        fast_bcopy(src, dst,512) ~19.5ns        (data in L1 cache)
        fast_bcopy(src, dst,1K)  ~38.4ns        (data in L1 cache)
        fast_bcopy(src, dst,4K) ~152.0ns        (data in L1 cache)

        fast_bcopy(src, dst, 32) ~15.3ns        (not in L1)
        fast_bcopy(src, dst,256) ~38.7ns        (not in L1)
                ...

   The old P4/32 bit also exhibits similar results.

Conclusion: if you have to copy packets you might be better off
padding the length to a multiple of 32, and using the following
function to get the best of both worlds.

Sprinkle some prefetch() for better taste.

        // XXX only for multiples of 32 bytes, non overlapped.
        static inline void
        good_bcopy(void *_src, void *_dst, int l)
        {
                uint64_t *src = _src;
                uint64_t *dst = _dst;
        #define likely(x)       __builtin_expect(!!(x), 1)
        #define unlikely(x)       __builtin_expect(!!(x), 0)
                if (unlikely(l >= 1024)) {
                        bcopy(src, dst, l);
                        return;
                }
                for (; l > 0; l-=32) {
                        *dst++ = *src++;
                        *dst++ = *src++;
                        *dst++ = *src++;
                        *dst++ = *src++;
                }
        }

cheers
luigi
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Reply via email to