On Mon, Mar 19, 2018 at 12:39:55PM -0700, Andiry Xu wrote: > On Sun, Mar 11, 2018 at 12:22 PM, Eric Biggers <ebigge...@gmail.com> wrote: > > On Sun, Mar 11, 2018 at 02:00:13PM +0200, Nikolay Borisov wrote: > >> [Adding Herbert Xu to CC since he is the maintainer of the crypto subsys > >> maintainer] > >> > >> On 10.03.2018 20:17, Andiry Xu wrote: > >> <snip> > >> > >> > +static inline u32 nova_crc32c(u32 crc, const u8 *data, size_t len) > >> > +{ > >> > + u8 *ptr = (u8 *) data; > >> > + u64 acc = crc; /* accumulator, crc32c value in lower 32b */ > >> > + u32 csum; > >> > + > >> > + /* x86 instruction crc32 is part of SSE-4.2 */ > >> > + if (static_cpu_has(X86_FEATURE_XMM4_2)) { > >> > + /* This inline assembly implementation should be equivalent > >> > + * to the kernel's crc32c_intel_le_hw() function used by > >> > + * crc32c(), but this performs better on test machines. > >> > + */ > >> > + while (len > 8) { > >> > + asm volatile(/* 64b quad words */ > >> > + "crc32q (%1), %0" > >> > + : "=r" (acc) > >> > + : "r" (ptr), "0" (acc) > >> > + ); > >> > + ptr += 8; > >> > + len -= 8; > >> > + } > >> > + > >> > + while (len > 0) { > >> > + asm volatile(/* trailing bytes */ > >> > + "crc32b (%1), %0" > >> > + : "=r" (acc) > >> > + : "r" (ptr), "0" (acc) > >> > + ); > >> > + ptr++; > >> > + len--; > >> > + } > >> > + > >> > + csum = (u32) acc; > >> > + } else { > >> > + /* The kernel's crc32c() function should also detect and use > >> > the > >> > + * crc32 instruction of SSE-4.2. But calling in to this > >> > function > >> > + * is about 3x to 5x slower than the inline assembly version > >> > on > >> > + * some test machines. > >> > >> That is really odd. Did you try to characterize why this is the case? Is > >> it purely the overhead of dispatching to the correct backend function? > >> That's a rather big performance hit. > >> > >> > + */ > >> > + csum = crc32c(crc, data, len); > >> > + } > >> > + > >> > + return csum; > >> > +} > >> > + > > > > Are you sure that CONFIG_CRYPTO_CRC32C_INTEL was enabled during your tests > > and > > that the accelerated version was being called? Or, perhaps > > CRC32C_PCL_BREAKEVEN > > (defined in arch/x86/crypto/crc32c-intel_glue.c) needs to be adjusted. > > Please > > don't hack around performance problems like this; if they exist, they need > > to be > > fixed for everyone. > > > > I have performed the crc32c test on a Xeon X5647 at 2.93GHz, 14G DDR3 > memory at 1066MHz platform. > You are right that enabling CONFIG_CRYPTO_CRC32C_INTEL improves the > performance significantly. nova_crc32c() is still slightly faster than > crc32c() with the flag enabled. > > Result numbers are follows: data size in bytes, latency in ns, column > 3 is crc32c() with CONFIG_CRYPTO_CRC32C_INTEL enabled and column 4 > disabled. > > data size (bytes) nova_crc32c() crc32c() -enabled > crc32c() -disabled > 64 19 21 56 > 128 28 29 99 > 256 46 43 182 > 512 82 149 354 > 1024 157 232 728 > 2048 305 415 1440 > 4096 603 725 2869 >
Probably CRC32C_PCL_BREAKEVEN needs to be adjusted for that CPU, as I suggested may be the case; notice that your measured speeds are about the same before 512 (CRC32C_PCL_BREAKEVEN) bytes, but the crypto API version is slower at >= 512 bytes. It would be possible to set the breakeven point in crc32c_intel_mod_init() depending on the CPU. Again, if the performance is not good enough you need to fix it for everyone, not hack around it. Thanks, Eric