On Wed, 16 Aug 2023 at 21:10, Alexander Monakov <amona...@ispras.ru> wrote:
>
>
> On Tue, 15 Aug 2023, Jeff Law wrote:
>
> > Because if the compiler can optimize it automatically, then the projects 
> > have
> > to do literally nothing to take advantage of it.  They just compile normally
> > and their bitwise CRC gets optimized down to either a table lookup or a 
> > clmul
> > variant.  That's the real goal here.
>
> The only high-profile FOSS project that carries a bitwise CRC implementation
> I'm aware of is the 'xz' compression library. There bitwise CRC is used for
> populating the lookup table under './configure --enable-small':
>
> https://github.com/tukaani-project/xz/blob/2b871f4dbffe3801d0da3f89806b5935f758d5f3/src/liblzma/check/crc64_small.c
>
> It's a well-reasoned choice and your compiler would be undoing it
> (reintroducing the table when the bitwise CRC is employed specifically
> to avoid carrying the table).
>
> > One final note.  Elsewhere in this thread you described performance 
> > concerns.
> > Right now clmuls can be implemented in 4c, fully piped.
>
> Pipelining doesn't matter in the implementation being proposed here, because
> the builtin is expanded to
>
>    li      a4,quotient
>    li      a5,polynomial
>    xor     a0,a1,a0
>    clmul   a0,a0,a4
>    srli    a0,a0,crc_size
>    clmul   a0,a0,a5
>    slli    a0,a0,GET_MODE_BITSIZE (word_mode) - crc_size
>    srli    a0,a0,GET_MODE_BITSIZE (word_mode) - crc_size
>
> making CLMULs data-dependent, so the second can only be started one cycle
> after the first finishes, and consecutive invocations of __builtin_crc
> are likewise data-dependent (with three cycles between CLMUL). So even
> when you get CLMUL down to 3c latency, you'll have two CLMULs and 10 cycles
> per input block, while state of the art is one widening CLMUL per input block
> (one CLMUL per 32-bit block on a 64-bit CPU) limited by throughput, not 
> latency.
>
> > I fully expect that latency to drop within the next 12-18 months.  In that
> > world, there's not going to be much benefit to using hand-coded libraries vs
> > just letting the compiler do it.

I would also hope that the hand-coded libraries would eventually have
a code path for compilers that support the built-in.
For what it's worth, there now is CRC in Boost:
https://www.boost.org/doc/libs/1_83_0/doc/html/crc.html

Cheers,
philipp.

Reply via email to