Hi Bruce, On my laptop (Intel(R) Core(TM) i5-3320M CPU — Ivy Bridge) I still see a little worse performance with this patch. Please excuse the ugly graphs, I don't have a better graphing tool set up at this time:
https://people.freebsd.org/~cem/crc32/sse42_bde.png https://people.freebsd.org/~cem/crc32/sse42_bde_log.png Best, Conrad On Mon, Feb 27, 2017 at 6:27 PM, Bruce Evans <b...@optusnet.com.au> wrote: > On Mon, 27 Feb 2017, Conrad Meyer wrote: > >> On Thu, Feb 2, 2017 at 12:29 PM, Bruce Evans <b...@optusnet.com.au> wrote: >>> >>> I've almost finished fixing and optimizing this. I didn't manage to fix >>> all the compiler pessimizations, but the result is within 5% of optimal >>> for buffers larger than a few K. >> >> >> Did you ever get to a final patch that you are satisfied with? It >> would be good to get this improvement into the tree. > > > I'm happy with this version (attached and partly enclosed). You need to > test it in the kernel and commit it (I on;y did simple correctness tests > in userland). > > X Index: conf/files.amd64 > X =================================================================== > X --- conf/files.amd64 (revision 314363) > X +++ conf/files.amd64 (working copy) > X @@ -545,6 +545,9 @@ > X isa/vga_isa.c optional vga > X kern/kern_clocksource.c standard > X kern/link_elf_obj.c standard > X +libkern/x86/crc32_sse42.c standard > X +libkern/memmove.c standard > X +libkern/memset.c standard > > Also fix some nearby disorder. > > X ... > X Index: libkern/x86/crc32_sse42.c > X =================================================================== > X --- libkern/x86/crc32_sse42.c (revision 314363) > X +++ libkern/x86/crc32_sse42.c (working copy) > X @@ -31,15 +31,41 @@ > X */ > X #ifdef USERSPACE_TESTING > X #include <stdint.h> > X +#include <stdlib.h> > X #else > X #include <sys/param.h> > X +#include <sys/systm.h> > X #include <sys/kernel.h> > X -#include <sys/libkern.h> > X -#include <sys/systm.h> > X #endif > > Also fix minor #include errors. > > X X -#include <nmmintrin.h> > X +static __inline uint32_t > X +_mm_crc32_u8(uint32_t x, uint8_t y) > X +{ > X + /* > X + * clang (at least 3.9.[0-1]) pessimizes "rm" (y) and "m" (y) > X + * significantly and "r" (y) a lot by copying y to a different > X + * local variable (on the stack or in a register), so only use > X + * the latter. This costs a register and an instruction but > X + * not a uop. > X + */ > X + __asm("crc32b %1,%0" : "+r" (x) : "r" (y)); > X + return (x); > X +} > > Using intrinsics avoids the silly copying via the stack, and allows more > unrolling. Old gcc does more unrolling with just asms. Unrolling is > almost useless (some details below). > > X @@ -47,12 +73,14 @@ > X * Block sizes for three-way parallel crc computation. LONG and SHORT > must > X * both be powers of two. > X */ > X -#define LONG 8192 > X -#define SHORT 256 > X +#define LONG 128 > X +#define SHORT 64 > > These are aggressively low. > > Note that small buffers aren't handled very well. SHORT = 64 means that > a buffer of size 3 * 64 = 192 is handled entirely by the "SHORT" loop. > 192 is not very small, but any smaller than that and overheads for > adjustment at the end of the loop are too large for the "SHORT" loop > to be worth doing. Almost any value of LONG larger than 128 works OK > now, but if LONG is large then it gives too much work for the "SHORT" > loop, since normal buffer sizes are not a multiple of 3. E.g., with > the old LONG and SHORT, a buffer of size 128 was decomposed as 5 * 24K > (done almost optimally by the "LONG" loop) + 10 * 768 (done a bit less > optimally by the "SHORT" loop) + 10 * 768 + 64 * 8 (done pessimally). > > I didn't get around to ifdefing this for i386. On i386, the loops take > twice as many crc32 instructions for a given byte count, so the timing > is satisfed by a byte count half as large, so SHORT and LONG can be > reduced by a factor of 2 to give faster handling for small buffers without > affecting the speed for large buffers significantly. > > X X /* Tables for hardware crc that shift a crc by LONG and SHORT zeros. */ > X static uint32_t crc32c_long[4][256]; > X +static uint32_t crc32c_2long[4][256]; > X static uint32_t crc32c_short[4][256]; > X +static uint32_t crc32c_2short[4][256]; > > I didn't get around to updating the comment. 2long shifts by 2*LONG zeros, > etc. > > Shifts by 3N are done by adding shifts by 1N and 2N in parallel. I couldn't > get the direct 3N shift to run any faster. > > X @@ -190,7 +220,11 @@ > X const size_t align = 4; > X #endif > X const unsigned char *next, *end; > X - uint64_t crc0, crc1, crc2; /* need to be 64 bits for crc32q */ > X +#ifdef __amd64__ > X + uint64_t crc0, crc1, crc2; > X +#else > X + uint32_t crc0, crc1, crc2; > X +#endif > X X next = buf; > X crc0 = crc; > > 64 bits of course isn't needed for i386. It isn't needed for amd64 either. > I think the crc32 instruction zeros the top 32 bits so they can be ignored. > However, when I modified the asm to return 32 bits to tell the compiler > about > this (which the intrinsic wouldn't be able to do) and used 32 bits here, > this was just slightly slower. > > For some intermediate crc calculations, only 32 bits are used, and the > compiler can see this. clang on amd64 optimizes this better than gcc, > starting with all the intermediate crc variables declared as 64 bits. > gcc generates worst code when some of the intermediates are declared as > 32 bits. So keep using 64 bits on amd64 here. > > X @@ -202,6 +236,7 @@ > X len--; > X } > X X +#if LONG > SHORT > X /* > X * Compute the crc on sets of LONG*3 bytes, executing three > independent > X * crc instructions, each on LONG bytes -- this is optimized for the > > LONG = SHORT = 64 works OK on Haswell, but I suspect that slower CPUs > benefit from larger values and I want to keep SHORT as small as possible > for the fastest CPUs. > > X @@ -209,6 +244,7 @@ > X * have a throughput of one crc per cycle, but a latency of three > X * cycles. > X */ > X + crc = 0; > X while (len >= LONG * 3) { > X crc1 = 0; > X crc2 = 0; > X @@ -229,16 +265,64 @@ > X #endif > X next += align; > X } while (next < end); > X - crc0 = crc32c_shift(crc32c_long, crc0) ^ crc1; > X - crc0 = crc32c_shift(crc32c_long, crc0) ^ crc2; > X + /*- > X + * Update the crc. Try to do it in parallel with the inner > X + * loop. 'crc' is used to accumulate crc0 and crc1 > X + * produced by the inner loop so that the next iteration > X + * of the loop doesn't depend on anything except crc2. > X + * > X + * The full expression for the update is: > X + * crc = S*S*S*crc + S*S*crc0 + S*crc1 > X + * where the terms are polynomials modulo the CRC > polynomial. > X + * We regroup this subtly as: > X + * crc = S*S * (S*crc + crc0) + S*crc1. > > X + * This has an extra dependency which reduces possible > X + * parallelism for the expression, but it turns out to be > X + * best to intentionally delay evaluation of this expression > X + * so that it competes less with the inner loop. > X + * > X + * We also intentionally reduce parallelism by feedng back > X + * crc2 to the inner loop as crc0 instead of accumulating > X + * it in crc. This synchronizes the loop with crc update. > X + * CPU and/or compiler schedulers produced bad order without > X + * this. > X + * > X + * Shifts take about 12 cycles each, so 3 here with 2 > X + * parallelizable take about 24 cycles and the crc update > X + * takes slightly longer. 8 dependent crc32 instructions > X + * can run in 24 cycles, so the 3-way blocking is worse > X + * than useless for sizes less than 8 * <word size> = 64 > X + * on amd64. In practice, SHORT = 32 confirms these > X + * timing calculations by giving a small improvement > X + * starting at size 96. Then the inner loop takes about > X + * 12 cycles and the crc update about 24, but these are > X + * partly in parallel so the total time is less than the > X + * 36 cycles that 12 dependent crc32 instructions would > X + * take. > X + * > X + * To have a chance of completely hiding the overhead for > X + * the crc update, the inner loop must take considerably > X + * longer than 24 cycles. LONG = 64 makes the inner loop > X + * take about 24 cycles, so is not quite large enough. > X + * LONG = 128 works OK. Unhideable overheads are about > X + * 12 cycles per inner loop. All assuming timing like > X + * Haswell. > X + */ > X + crc = crc32c_shift(crc32c_long, crc) ^ crc0; > X + crc1 = crc32c_shift(crc32c_long, crc1); > X + crc = crc32c_shift(crc32c_2long, crc) ^ crc1; > X + crc0 = crc2; > X next += LONG * 2; > X len -= LONG * 3; > X } > X + crc0 ^= crc; > X +#endif /* LONG > SHORT */ > X X /* > X * Do the same thing, but now on SHORT*3 blocks for the remaining > data > X * less than a LONG*3 block > X */ > X + crc = 0; > X while (len >= SHORT * 3) { > X crc1 = 0; > X crc2 = 0; > > See the comment. > > X @@ -259,11 +343,14 @@ > X #endif > X next += align; > > When SHORT is about what it is (64), on amd64 the "SHORT" loop has 24 crc32 > instructions and compilers sometimes to unroll them all. This makes little > difference. > > X } while (next < end); > X - crc0 = crc32c_shift(crc32c_short, crc0) ^ crc1; > X - crc0 = crc32c_shift(crc32c_short, crc0) ^ crc2; > X + crc = crc32c_shift(crc32c_short, crc) ^ crc0; > X + crc1 = crc32c_shift(crc32c_short, crc1); > X + crc = crc32c_shift(crc32c_2short, crc) ^ crc1; > X + crc0 = crc2; > X next += SHORT * 2; > X len -= SHORT * 3; > X } > X + crc0 ^= crc; > > The change is perhaps easier to understand without looking at the comment. > We accumulate changes in crc instead of into crc0, so that the next > iteration > can start without waiting for accumulation. This requires more shifting > steps, and we try to arrange these optimally. > > X X /* Compute the crc on the remaining bytes at native word size. */ > X end = next + (len - (len & (align - 1))); > > The adjustments for alignment are slow if they are not null, and wasteful > if they are null, but have relatively little cost for the non-small buffers > that are handled well, so I didn't remove them. > > Bruce _______________________________________________ svn-src-all@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"