luigi wrote: > even more orthogonal: > > I found that copying 8n + (5, 6 or 7) bytes was much much slower than > copying a multiple of 8 bytes. For n=0, 1,2,4,8 bytes are efficient, > other cases are slow (turned into 2 or 3 different writes). > > The netmap code uses a pkt_copy routine that does exactly this > rounding, gaining some 10-20ns per packet for small sizes.
I don't believe 10-20ns for just the extra bytes. memcpy() ends up with a movsb to copy the extra bytes. This can be slow, but I don't believe 10-20ns (except on machines running at i486 speeds of course). % ENTRY(memcpy) % pushl %edi % pushl %esi % movl 12(%esp),%edi % movl 16(%esp),%esi % movl 20(%esp),%ecx % movl %edi,%eax % shrl $2,%ecx /* copy by 32-bit words */ % cld /* nope, copy forwards */ % rep % movsl % movl 20(%esp),%ecx % andl $3,%ecx /* any bytes left? */ This avoids a branch. Some optimization manuals say that the branch is actually better for some machines, The above 2 instructions have a throughput of 1 per cycle each on modern x86. Latency might be 6 cycles. % rep Maybe 5-15 cycles throughput. % movsb Now hopefully at most 1 cycle/byte. Some hardware might combine the bytes as much as possible, so the whole function should use 1 single "rep movsb" and let the hardware do it all. % popl %esi % popl %edi % ret Well, it's easy to get a latency of 20 cycles 5-10 ns) and maybe even a throughput of that. But all of thus is out of order on modern x86. The extra cycles for the movsb might not cost at all if nothing accesses the part of the target that they were written to soon. With builtin memcpy, 6 bytes would be done using load/store of 4+2 bytes and thus take the same time as 8 bytes on i386, but on amd64 8 bytes would be faster. Bruce _______________________________________________ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"