On Thu, 3 Aug 2023, Jeff Law wrote:
> The end goal here is to actually detect bitwise CRC implementations in the
> gimple optimizers and turn them into table lookups or carryless multiplies in
> RTL.
> 
> Mariam has that working end-to-end and has proposed a talk for the Cauldron on
> the topic.
> 
> The idea here is to carve out the RTL side which we think provides potential
> value to end users (the ability to use the builtin to get an performant CRC
> implementation) and to get community feedback on the implementation.

Jeff, as I understand this all is happening only because Coremark contains
use of bitwise CRC that affects benchmark scores. In another universe where

- Coremark was careful to checksum outputs outside of timed sections, or
- implemented CRC in a manner that is not transparent to the compiler, or
- did not use CRC at all

we would not be spending effort on this, correct? At best we might have
a discussion on providing a __builtin_clmul for carry-less multiplication
(which _is_ a fundamental primitive, unlike __builtin_crc), and move on.

Note that proposed __builtin_crc does not match how a high-performance CRC
over a variable-size array is implemented. You don't want to do two
back-to-back CLMULs to compute a new CRC given an old CRC. That makes your
loop latency-constrained to 2*L*N where L is latency of the CLMUL instruction
and N is the number of loop iterations.

Instead, efficient CRC loops have the following structure:

- they carry an unreduced remainder in the loop, performing final reduction
  modulo polynomial only once after the loop — this halves the CLMUL count;

- they overlap multiple CLMUL chains to make the loop throughput-bound
  rather than latency-bound. The typical unroll factor is about 4x-8x.

A relatively easy to follow explanation is provided by Pete Cawley at
https://www.corsix.org/content/alternative-exposition-crc32_4k_pclmulqdq
(there are other sources for similar optimization of table-based CRC).

Also note that in __builtin_crc care is needed regarding how the
polynomial is specified (which term is dropped, and what bit order is used).

Hence, I am concerned that proposed __builtin_crc is not useful for FOSS
that actually needs high-performance CRC (the Linux kernel, compression
and image libraries).

I think this crosses the line of "cheating in benchmarks" and not something
we should do in GCC.

Alexander

Reply via email to