On Thu, 3 Aug 2023, Jeff Law wrote:
> The end goal here is to actually detect bitwise CRC implementations in the > gimple optimizers and turn them into table lookups or carryless multiplies in > RTL. > > Mariam has that working end-to-end and has proposed a talk for the Cauldron on > the topic. > > The idea here is to carve out the RTL side which we think provides potential > value to end users (the ability to use the builtin to get an performant CRC > implementation) and to get community feedback on the implementation. Jeff, as I understand this all is happening only because Coremark contains use of bitwise CRC that affects benchmark scores. In another universe where - Coremark was careful to checksum outputs outside of timed sections, or - implemented CRC in a manner that is not transparent to the compiler, or - did not use CRC at all we would not be spending effort on this, correct? At best we might have a discussion on providing a __builtin_clmul for carry-less multiplication (which _is_ a fundamental primitive, unlike __builtin_crc), and move on. Note that proposed __builtin_crc does not match how a high-performance CRC over a variable-size array is implemented. You don't want to do two back-to-back CLMULs to compute a new CRC given an old CRC. That makes your loop latency-constrained to 2*L*N where L is latency of the CLMUL instruction and N is the number of loop iterations. Instead, efficient CRC loops have the following structure: - they carry an unreduced remainder in the loop, performing final reduction modulo polynomial only once after the loop — this halves the CLMUL count; - they overlap multiple CLMUL chains to make the loop throughput-bound rather than latency-bound. The typical unroll factor is about 4x-8x. A relatively easy to follow explanation is provided by Pete Cawley at https://www.corsix.org/content/alternative-exposition-crc32_4k_pclmulqdq (there are other sources for similar optimization of table-based CRC). Also note that in __builtin_crc care is needed regarding how the polynomial is specified (which term is dropped, and what bit order is used). Hence, I am concerned that proposed __builtin_crc is not useful for FOSS that actually needs high-performance CRC (the Linux kernel, compression and image libraries). I think this crosses the line of "cheating in benchmarks" and not something we should do in GCC. Alexander