[PATCH] cksum: use pclmul instead of slice-by-32 for final bytes

Sam Russell Sun, 24 Nov 2024 05:58:44 -0800

The current implementation reads 64kB blocks and uses lookup tables for the
final 0-31 bytes (normally 16 bytes, meaning 16 lookups). I've replaced
this with the smaller folds and Barrett reduction from the intel paper.
Benchmarking is hard as there's a lot of variance, but it appears to give
around a noticeable improvement for a 4GB ISO (fastest time is 0.215s user
compared with fastest 0m0.451s on a AMD Ryzen 5 5600).


Future work is to remove this final reduction from the loop completely as
we're reading in multiples of 32 bytes and we can use the 4-fold method
exclusively until we get to the end of the file stream.

Open any feedback, especially as I've probably violated the code style
somewhere along the line.

Copyright: all my own work and have completed GNU copyright paperwork, the
algorithm is based off the Intel paper that the rest of the implementation
is also based on.

0001-cksum-use-pclmul-instead-of-slice-by-32-for-final-by.patch
Description: Binary data

[PATCH] cksum: use pclmul instead of slice-by-32 for final bytes

Reply via email to