Hi, I've been looking into other CRC32 implementations for architectures that don't have accelerated polynomial multiplication (PCLMUL/VMULL for x86-SSE/AVX or ARM-NEON), and there are improvements we can make, but they make tradeoffs for memory (both for lookup tables and for working set). The "braiding" approach from zlib seems to be the best publicly available algorithm, and I've also been developing my own algorithm "chorba" that outperforms both slicing and braiding depending on the circumstances (it ends up needing to be tuned per CPU).
Memory requirements: Sarwate (Simon's method from RFC1952): 1kB lookup table Slicing by 8: 8kB lookup table Braiding: 8kB lookup table Chorba 1952: 2kB working set Chorba 57082: 64kB working set Chorba 703380: 1MB working set Chorba 19824364: 32MB working set In terms of performance, braiding seems to be 1.2-3x faster than slicing by 8 across different systems, whereas chorba depends on the CPU (x86_64 gets better with higher working set, is even at 64kB and better with 1MB or 32MB, ARM seems to be better cross the board, Raspberry Pi 4 is better at 2kB and gets worse after that). What sort of parameters would be important and what expectations will users/maintainers have? For example, coreutils cksum reads a rolling 65kB buffer and operates on that, whereas gzip seems to take much larger chunks; are there any systems that would appreciate a low-mem mode? Is it a downside that different architectures would need to choose different parameters to benefit from the speedup? Cheers Sam