On 24/11/2024 11:19, Sam Russell wrote:
The current implementation reads 64kB blocks and uses lookup tables for the
final 0-31 bytes (normally 16 bytes, meaning 16 lookups). I've replaced
this with the smaller folds and Barrett reduction from the intel paper.
Benchmarking is hard as there's a lot
The current implementation reads 64kB blocks and uses lookup tables for the
final 0-31 bytes (normally 16 bytes, meaning 16 lookups). I've replaced
this with the smaller folds and Barrett reduction from the intel paper.
Benchmarking is hard as there's a lot of variance, but it appears to give
aroun
What do you get over 10 iterations? There's a ton of variance and a proper
benchmarking tool would give a more accurate result. It's not the order of
magnitude speedup from slice-by-8 to pclmul but I would expect it to be
faster than the table lookup, perhaps it's a <10% improvement (1/4096
calcula