On Wednesday 21 January 2015 12:27:38 Anton Blanchard wrote: > I noticed ksm spending quite a lot of time in memcmp on a large > KVM box. The current memcmp loop is very unoptimised - byte at a > time compares with no loop unrolling. We can do much much better. > > Optimise the loop in a few ways: > > - Unroll the byte at a time loop > > - For large (at least 32 byte) comparisons that are also 8 byte > aligned, use an unrolled modulo scheduled loop using 8 byte > loads. This is similar to our glibc memcmp. > > A simple microbenchmark testing 10000000 iterations of an 8192 byte > memcmp was used to measure the performance: > > baseline: 29.93 s > > modified: 1.70 s > > Just over 17x faster. > > v2: Incorporated some suggestions from Segher: > > - Use andi. instead of rdlicl. > > - Convert bdnzt eq, to bdnz. It's just duplicating the earlier compare > and was a relic from a previous version. > > - Don't use cr5, we have plans to use that CR field for fast local > atomics. > > Signed-off-by: Anton Blanchard <an...@samba.org>
Would it help to also add a way for an architecture to override memcmp_pages() with its own implementation? That way you could skip the unaligned part, hardcode the loop counter and avoid the preempt_disable() in kmap_atomic(). Arnd _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev