On Tue, 15 Aug 2023, Jeff Law wrote:
> Because if the compiler can optimize it automatically, then the projects have > to do literally nothing to take advantage of it. They just compile normally > and their bitwise CRC gets optimized down to either a table lookup or a clmul > variant. That's the real goal here. The only high-profile FOSS project that carries a bitwise CRC implementation I'm aware of is the 'xz' compression library. There bitwise CRC is used for populating the lookup table under './configure --enable-small': https://github.com/tukaani-project/xz/blob/2b871f4dbffe3801d0da3f89806b5935f758d5f3/src/liblzma/check/crc64_small.c It's a well-reasoned choice and your compiler would be undoing it (reintroducing the table when the bitwise CRC is employed specifically to avoid carrying the table). > One final note. Elsewhere in this thread you described performance concerns. > Right now clmuls can be implemented in 4c, fully piped. Pipelining doesn't matter in the implementation being proposed here, because the builtin is expanded to li a4,quotient li a5,polynomial xor a0,a1,a0 clmul a0,a0,a4 srli a0,a0,crc_size clmul a0,a0,a5 slli a0,a0,GET_MODE_BITSIZE (word_mode) - crc_size srli a0,a0,GET_MODE_BITSIZE (word_mode) - crc_size making CLMULs data-dependent, so the second can only be started one cycle after the first finishes, and consecutive invocations of __builtin_crc are likewise data-dependent (with three cycles between CLMUL). So even when you get CLMUL down to 3c latency, you'll have two CLMULs and 10 cycles per input block, while state of the art is one widening CLMUL per input block (one CLMUL per 32-bit block on a 64-bit CPU) limited by throughput, not latency. > I fully expect that latency to drop within the next 12-18 months. In that > world, there's not going to be much benefit to using hand-coded libraries vs > just letting the compiler do it. ... Alexander