On Thu, Jan 24, 2013 at 6:35 PM, Luigi Rizzo <ri...@iet.unipi.it> wrote: > On Thu, Jan 24, 2013 at 09:54:19AM +0100, Stefan Hajnoczi wrote: >> On Wed, Jan 23, 2013 at 06:55:59PM -0800, Luigi Rizzo wrote: >> > On Wed, Jan 23, 2013 at 8:03 AM, Luigi Rizzo <ri...@iet.unipi.it> wrote: >> > >> > > > I'm even doubtful that it's always a win on FreeBSD. You have a >> > > > threshold to fall back to bcopy() and who knows what the "best" value >> > > > for various CPUs is. >> > > >> > > indeed. >> > > With the attached program (which however might be affected by the >> > > fact that data is not used after copying) it seems that on a recent >> > > linux (using gcc 4.6.2) the fastest is __builtin_memcpy() >> > > >> > > ./testlock -m __builtin_memcpy -l 64 >> > > >> > > (by a factor of 2 or more) whereas all the other methods have >> > > approximately the same speed. >> > > >> > >> > never mind, pilot error. in my test program i had swapped the >> > arguments to __builtin_memcpy(). With the correct ones, >> > __builtin_memcpy() == bcopy == memcpy on both machines, >> > and never faster than the pkt_copy(). >> >> Are the bcopy()/memcpy() calls given a length that is a multiple of 64 bytes? >> >> IIUC pkt_copy() assumes 64-byte multiple lengths and that optimization >> can matches with memcpy(dst, src, (len + 63) & ~63). Maybe it helps and >> at least ensures they are doing equal amounts of byte copying. > > the length is a parameter from the command line. > For short packets, at least on the i7-2600 and freebsd the pkt_copy() > is only slightly faster than memcpy on multiples of 64, and *a lot* > faster when the length is not a multiple.
How about dropping pkt_copy() and instead rounding the memcpy() length up to the next 64 byte multiple? Using memcpy() is more future-proof IMO, that's why I'm pushing for this. Stefan