On 08-01-2015 23:56, Anton Blanchard wrote: > I noticed ksm spending quite a lot of time in memcmp on a large > KVM box. The current memcmp loop is very unoptimised - byte at a > time compares with no loop unrolling. We can do much much better. > > Optimise the loop in a few ways: > > - Unroll the byte at a time loop > > - For large (at least 32 byte) comparisons that are also 8 byte > aligned, use an unrolled modulo scheduled loop using 8 byte > loads. This is similar to our glibc memcmp. > > A simple microbenchmark testing 10000000 iterations of an 8192 byte > memcmp was used to measure the performance: > > baseline: 29.93 s > > modified: 1.70 s > > Just over 17x faster. > > Signed-off-by: Anton Blanchard <an...@samba.org> > Why not use glibc implementations instead? All of them (ppc64, power4, and power7) avoids use byte at time compares for unaligned cases inputs; while showing the same performance for aligned one than this new implementation. To give you an example, a 8192 bytes compare with input alignment of 63/18 shows:
__memcmp_power7: 320 cycles __memcmp_power4: 320 cycles __memcmp_ppc64: 340 cycles this memcmp: 3185 cycles _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev