Hi,

I've been looking into other CRC32 implementations for architectures that
don't have accelerated polynomial multiplication (PCLMUL/VMULL for
x86-SSE/AVX or ARM-NEON), and there are improvements we can make, but they
make tradeoffs for memory (both for lookup tables and for working set). The
"braiding" approach from zlib seems to be the best publicly available
algorithm, and I've also been developing my own algorithm "chorba" that
outperforms both slicing and braiding depending on the circumstances (it
ends up needing to be tuned per CPU).

Memory requirements:
Sarwate (Simon's method from RFC1952): 1kB lookup table
Slicing by 8: 8kB lookup table
Braiding: 8kB lookup table
Chorba 1952: 2kB working set
Chorba 57082: 64kB working set
Chorba 703380: 1MB working set
Chorba 19824364: 32MB working set

In terms of performance, braiding seems to be 1.2-3x faster than slicing by
8 across different systems, whereas chorba depends on the CPU (x86_64 gets
better with higher working set, is even at 64kB and better with 1MB or
32MB, ARM seems to be better cross the board, Raspberry Pi 4 is better at
2kB and gets worse after that).

What sort of parameters would be important and what expectations will
users/maintainers have? For example, coreutils cksum reads a rolling 65kB
buffer and operates on that, whereas gzip seems to take much larger chunks;
are there any systems that would appreciate a low-mem mode? Is it a
downside that different architectures would need to choose different
parameters to benefit from the speedup?

Cheers
Sam

Reply via email to