The current implementation reads 64kB blocks and uses lookup tables for the final 0-31 bytes (normally 16 bytes, meaning 16 lookups). I've replaced this with the smaller folds and Barrett reduction from the intel paper. Benchmarking is hard as there's a lot of variance, but it appears to give around a noticeable improvement for a 4GB ISO (fastest time is 0.215s user compared with fastest 0m0.451s on a AMD Ryzen 5 5600).
Future work is to remove this final reduction from the loop completely as we're reading in multiples of 32 bytes and we can use the 4-fold method exclusively until we get to the end of the file stream. Open any feedback, especially as I've probably violated the code style somewhere along the line. Copyright: all my own work and have completed GNU copyright paperwork, the algorithm is based off the Intel paper that the rest of the implementation is also based on.
0001-cksum-use-pclmul-instead-of-slice-by-32-for-final-by.patch
Description: Binary data