On 22/10/2015 16:37, Eric Blake wrote: >> > + /* Check first 16 bytes manually. */ >> > + for (len = 0; len < 16; len++) >> > + { >> > + if (! bufsize) >> > + return true; >> > + if (*p) >> > + return false; >> > + p++; >> > + bufsize--; >> > + } >> > + >> > + /* Now we know that's zero, memcmp with self. */ >> > + return memcmp (buf, p, bufsize) == 0; >> > } > Cool trick of using a suitably-aligned overlap-to-self check to then > trigger platform-specific speedups without having to rewrite them by > hand! qemu is doing a similar check in util/cutils.c:buffer_is_zero() > that could probably benefit from the same idea.
Nice trick indeed. On the other hand, the first 16 bytes are enough to rule out 99.99% (number out of thin hair) of the non-zero blocks, so that's where you want to optimize. Checking them an unsigned long at a time, or fetching a few unsigned longs and ORing them together would probably be the best of both worlds, because you then only use the FPU in the rare case of a zero buffer. Paolo