On Mon, Dec 4, 2023 at 2:27 PM Xiang Gao <xiang....@arm.com> wrote: > > [v8 patch]
I have a couple quick thoughts on this: 1. I looked at a couple implementations of this idea, and found that the constants used in the carryless multiply are tied to the length of the blocks. With a lookup table we can do the 3-way algorithm on any portion of a full block length, rather than immediately fall to doing CRC serially. That would be faster on average. See for example https://github.com/komrad36/CRC/tree/master , but I don't think we actually have to fully unroll the loop like they do there. 2. With the above, we can use a larger full block size, and so on average less time would be spent in the carryless multiply. With that, we could possibly get away with an open coded loop in normal C rather than a new intrinsic (also found in the above repo). That would be more portable. -- John Naylor Amazon Web Services.