What do you get over 10 iterations? There's a ton of variance and a proper benchmarking tool would give a more accurate result. It's not the order of magnitude speedup from slice-by-8 to pclmul but I would expect it to be faster than the table lookup, perhaps it's a <10% improvement (1/4096 calculations is going to be in the order of <100x faster). There's also value in that we don't need to load/generate the lookup table when doing the pclmul version of CRC.
On Sun, Nov 24, 2024, 15:17 Pádraig Brady <p...@draigbrady.com> wrote: > On 24/11/2024 11:19, Sam Russell wrote: > > The current implementation reads 64kB blocks and uses lookup tables for > the > > final 0-31 bytes (normally 16 bytes, meaning 16 lookups). I've replaced > > this with the smaller folds and Barrett reduction from the intel paper. > > Benchmarking is hard as there's a lot of variance, but it appears to give > > around a noticeable improvement for a 4GB ISO (fastest time is 0.215s > user > > compared with fastest 0m0.451s on a AMD Ryzen 5 5600). > > > > Future work is to remove this final reduction from the loop completely as > > we're reading in multiples of 32 bytes and we can use the 4-fold method > > exclusively until we get to the end of the file stream. > > > > Open any feedback, especially as I've probably violated the code style > > somewhere along the line. > > > > Copyright: all my own work and have completed GNU copyright paperwork, > the > > algorithm is based off the Intel paper that the rest of the > implementation > > is also based on. > > > I see a slight perf regression on an i7-5600U CPU @ 2.60GHz: > > # truncate -s4G file > > # time taskset -c 0 chrt -f 99 src/cksum file > 4215202376 4294967296 file > real 0m3.023s > ... > real 0m3.005s > ... > real 0m3.018s > > > $ patch -p1 < > ~/0001-cksum-use-pclmul-instead-of-slice-by-32-for-final-by.patch > $ ./make --opt > > # time taskset -c 0 chrt -f 99 src/cksum file > 4215202376 4294967296 file > real 0m3.108s > ... > real 0m3.092s > ... > real 0m3.143s > > > Now that's a small enough regression on older hardware, > that a 2x improvement on newer hardware is worth doing. > However it's a bit surprising, and warrants more testing. > > cheers, > Pádraig >