On Tue, 8 Aug 2023, Jeff Law wrote:
> If the compiler can identify a CRC and collapse it down to a table or clmul, > that's a major win and such code does exist in the real world. That was the > whole point behind the Fedora experiment -- to determine if these things are > showing up in the real world or if this is just a benchmarking exercise. Can you share the results of the experiment and give your estimate of what sort of real-world improvement is expected? I already listed the popular FOSS projects where CRC performance is important: the Linux kernel and a few compression libraries. Those projects do not use a bitwise CRC loop, except sometimes for table generation on startup (which needs less time than a page fault that may be necessary to bring in a hardcoded table). For those projects that need a better CRC, why is the chosen solution is to optimize it in the compiler instead of offering them a library they could use with any compiler? Was there any thought given to embedded projects that use bitwise CRC exactly because they little space for a hardcoded table to spare? > > Useful to whom? The Linux kernel? zlib, bzip2, xz-utils? ffmpeg? > > These consumers need high-performance blockwise CRC, offering them > > a latency-bound elementwise CRC primitive is a disservice. And what > > should they use as a fallback when __builtin_crc is unavailable? > THe point is builtin_crc would always be available. If there is no clmul, > then the RTL backend can expand to a table lookup version. No, not if the compiler is not GCC, or its version is less than 14. And those projects are not going to sacrifice their portability just for __builtin_crc. > > I think offering a conventional library for CRC has substantial advantages. > That's not what I asked. If you think there's room for improvement to a > builtin API, I'd love to hear it. > > But it seems you don't think this is worth the effort at all. That's > unfortunate, but if that's the consensus, then so be it. I think it's a strange application of development effort. You'd get more done coding a library. > I'll note LLVM is likely going forward with CRC detection and optimization at > some point in the next ~6 months (effectively moving the implementation from > the hexagon port into the generic parts of their loop optimizer). I don't see CRC detection in the Hexagon port. There is a recognizer for polynomial multiplication (CRC is division, not multiplication). Alexander