> I have one use-case (although not published so less > important) where the table size increase is a problem, and would prefer > to use the current smaller crc.c (or even better, a smaller one that > generate the tables on the fly, I forgot that it is possible).
There's another approach for table-less lookups that is slower than slice-by-4 but faster than the Sarwate algorithm we currently use, described in "Chi, M., He, D., & Liu, J. (2018). Fast Software-based Table-less Algorithm for CRC Generation." You can do the polynomial multiplication approach but manually execute it with 27 shifts and XORs: uint32_t crc32_update_no_xor_slice_by_4 (uint32_t crc, const char *buf) { uint32_t local_buf; memcpy(&local_buf, buf, 4); local_buf = le64toh(local_buf) ^ crc; crc = local_buf ^ \ (local_buf << 6) ^ \ (local_buf << 9) ^ \ (local_buf << 10) ^ \ (local_buf << 12) ^ \ (local_buf << 16) ^ \ (local_buf << 24) ^ \ (local_buf << 25) ^ \ (local_buf << 26) ^ \ (local_buf << 28) ^ \ (local_buf << 29) ^ \ (local_buf << 30) ^ \ (local_buf << 31); crc = crc ^ \ (crc >> 1) ^ \ (crc >> 2) ^ \ (crc >> 4) ^ \ (crc >> 5) ^ \ (crc >> 7) ^ \ (crc >> 8) ^ \ (crc >> 10) ^ \ (crc >> 11) ^ \ (crc >> 12) ^ \ (crc >> 16) ^ \ (crc >> 22) ^ \ (crc >> 23) ^ \ (crc >> 26); return crc; } Is there a use case where we would need this? Either way it is faster than the current implementation so it might be worth considering for platforms that don't want slice-by-8. Performance: # single byte lookup (current) $ gltests/bench-crc 1000000 real 7.502692 user 7.503 sys 0.000 # manual polynomial multiplication $ gltests/bench-crc 1000000 real 6.627648 user 6.627 sys 0.000 So a 12% speedup on my machine, and would be a larger improvement with a greater penalty for memory lookups vs. XOR/shift instructions On Thu, 17 Oct 2024 at 09:33, Bruno Haible <br...@clisp.org> wrote: > Simon Josefsson wrote: > > If we are splitting this module up into several, maybe it even make more > > sense to have a 'crc-slice-by-8' module for that particular portable > > optimization. I have one use-case (although not published so less > > important) where the table size increase is a problem, and would prefer > > to use the current smaller crc.c (or even better, a smaller one that > > generate the tables on the fly, I forgot that it is possible). > > We use different modules for different functionality. (So, for example, > CRC with a different polynomial could be a different module. Or it > could be in the same module if most of the tables and code are the same.) > > For optimizations, we have other techniques available: > - For optimizations that the package developer can enable: > That package can invoke a particular Autoconf macro in its > configure.ac. > The effect of this macro is typically to initialize a configure-time > variable. > - For optimizations that the installer / distributor can enable: > An --enable/--disable option, through AC_ARG_ENABLE. > - For optimizations that the user can enable: > An environment variable. > - For optimizations that are enabled by default, depending on CPU > features: A feature check at runtime, whose result is cached in a > static variable. (/proc/cpuinfo, getauxval(AT_HWCAP), cpuid > instruction, > __x86_get_cpuid_feature_leaf, `ld.so --list-diagnostics`) > > Bruno > > > >