On Tue, 9 Jun 2026 13:27:12 +0530 Shreesh Adiga <[email protected]> wrote:
> Add a 64-byte loop that maintains 4 fold registers and processes > 64 bytes at a time. The 4x fold registers is then reduced to 16 byte > single fold, similar to AVX512 implementation. This technique is > described in the paper by Intel: > "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction" > > This results in roughly 50% performance improvement due to better ILP > for large input sizes like 1024. > > Signed-off-by: Shreesh Adiga <[email protected]> > --- Looks good applied to next-net. A couple of nits from more detailed AI review, that you still might want to look at: The current crc_autotest does not exercise the new 64-byte CRC16 path. Its CRC32 vectors are 1512 and 348 bytes, so the CRC32 4x loop is covered — but the largest CRC16 vector is 32 bytes, all three CRC16 tests being ≤32. So the new CRC16 rk1_rk2 (64-byte fold) constants ship untested in CI. My exhaustive test confirms they're correct, but a future regression there wouldn't be caught. Suggest adding a CRC16 vector ≥64 bytes, ideally a non-multiple of 64 (e.g. 80 or 100) so it hits the 4x loop, the single-fold tail, and the partial-bytes path together. In partial_bytes the comment /* k = rk1 & rk2 */ is now stale — after the patch k holds rk3_rk4 on every path reaching it. Not introduced by this patch, but the patch is what made it wrong; worth fixing in passing.

