On Wed, Jan 23, 2013 at 8:03 AM, Luigi Rizzo <ri...@iet.unipi.it> wrote:
> > I'm even doubtful that it's always a win on FreeBSD. You have a > > threshold to fall back to bcopy() and who knows what the "best" value > > for various CPUs is. > > indeed. > With the attached program (which however might be affected by the > fact that data is not used after copying) it seems that on a recent > linux (using gcc 4.6.2) the fastest is __builtin_memcpy() > > ./testlock -m __builtin_memcpy -l 64 > > (by a factor of 2 or more) whereas all the other methods have > approximately the same speed. > never mind, pilot error. in my test program i had swapped the arguments to __builtin_memcpy(). With the correct ones, __builtin_memcpy() == bcopy == memcpy on both machines, and never faster than the pkt_copy(). In fact, on the machine with FreeBSD the unrolled loop still beats all other methods at small packet sizes. (e.g. (memcin my test program I had swapped the source and destination operands for __builtin_memcpy(), and this substantially changed the memory access pattern. With the correct operands, __builtin_memcpy == memcpy == bcopy on both FreeBSD and Linux. On FreeBSD pkt_copy is still faster than the other methods for small packets, whereas on Linux they are equivalent. If you are curious why swapping source and dst changed things so dramatically: the test was supposed to read from a large chunk of memory (over 1GB) to avoid always hitting L1 or L2. Swapping operands causes reads to hit always the same line, thus saving a lot of misses. The difference between the two machine then probably is due to how the cache is used on writes. cheers luigi -- -----------------------------------------+------------------------------- Prof. Luigi RIZZO, ri...@iet.unipi.it . Dip. di Ing. dell'Informazione http://www.iet.unipi.it/~luigi/ . Universita` di Pisa TEL +39-050-2211611 . via Diotisalvi 2 Mobile +39-338-6809875 . 56122 PISA (Italy) -----------------------------------------+-------------------------------