Re: [PATCH] cksum: use pclmul instead of slice-by-32 for final bytes

2024-11-24 Thread Pádraig Brady
On 24/11/2024 11:19, Sam Russell wrote: The current implementation reads 64kB blocks and uses lookup tables for the final 0-31 bytes (normally 16 bytes, meaning 16 lookups). I've replaced this with the smaller folds and Barrett reduction from the intel paper. Benchmarking is hard as there's a lot

[PATCH] cksum: use pclmul instead of slice-by-32 for final bytes

2024-11-24 Thread Sam Russell
The current implementation reads 64kB blocks and uses lookup tables for the final 0-31 bytes (normally 16 bytes, meaning 16 lookups). I've replaced this with the smaller folds and Barrett reduction from the intel paper. Benchmarking is hard as there's a lot of variance, but it appears to give aroun

Re: [PATCH] cksum: use pclmul instead of slice-by-32 for final bytes

2024-11-24 Thread Sam Russell
What do you get over 10 iterations? There's a ton of variance and a proper benchmarking tool would give a more accurate result. It's not the order of magnitude speedup from slice-by-8 to pclmul but I would expect it to be faster than the table lookup, perhaps it's a <10% improvement (1/4096 calcula