On Thu, Jun 11, 2026 at 10:36 PM Stephen Hemminger < [email protected]> wrote:
> On Tue, 9 Jun 2026 13:27:12 +0530 > Shreesh Adiga <[email protected]> wrote: > > > Add a 64-byte loop that maintains 4 fold registers and processes > > 64 bytes at a time. The 4x fold registers is then reduced to 16 byte > > single fold, similar to AVX512 implementation. This technique is > > described in the paper by Intel: > > "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ > Instruction" > > > > This results in roughly 50% performance improvement due to better ILP > > for large input sizes like 1024. > > > > Signed-off-by: Shreesh Adiga <[email protected]> > > --- > > Looks good applied to next-net. > > A couple of nits from more detailed AI review, that you still might want > to look at: > > The current crc_autotest does not exercise the new 64-byte CRC16 path. > Its CRC32 vectors are 1512 and 348 bytes, so the CRC32 4x loop is > covered — but the largest CRC16 vector is 32 bytes, all three CRC16 > tests being ≤32. So the new CRC16 rk1_rk2 (64-byte fold) constants ship > untested in CI. My exhaustive test confirms they're correct, but a > future regression there wouldn't be caught. Suggest adding a CRC16 > vector ≥64 bytes, ideally a non-multiple of 64 (e.g. 80 or 100) so it > hits the 4x loop, the single-fold tail, and the partial-bytes path > together. > > In partial_bytes the comment /* k = rk1 & rk2 */ is now stale > — after the patch k holds rk3_rk4 on every path reaching it. > Not introduced by this patch, but the patch is what made it wrong; > worth fixing in passing. > > I've submitted couple of follow up patches that should address the above: https://patches.dpdk.org/project/dpdk/patch/[email protected]/ https://patches.dpdk.org/project/dpdk/patch/[email protected]/

