On Wed, Dec 11, 2024 at 02:08:58PM +0700, John Naylor wrote: > I added a port to x86 and poked at it, with the intent to have an easy > on-ramp to that at least accelerates computation of CRCs on FPIs. > > The 0008 patch only worked on chunks of 1024 at a time. At that size, > the presence of hardware carryless multiplication is not that > important. I removed the hard-coded constants in favor of a lookup > table, so now it can handle anything up to 8400 bytes in a single > pass. > > There are still some "taste" issues, but I like the overall shape here > and how light it was. With more hardware support, we can go much lower > than 1024 bytes, but that can be left for future work.
Nice. I'm curious how this compares to both the existing implementations and the proposed ones that require new intrinsics. I like the idea of avoiding new runtime and config checks, especially if the performance is somewhat comparable for the most popular cases (i.e., dozens of bytes to a few thousand bytes). If we still want to add new intrinsics, would it be easy enough to add them on top of this patch? Or would it require further restructuring? -- nathan