> I have one use-case (although not published so less
> important) where the table size increase is a problem, and would prefer
> to use the current smaller crc.c (or even better, a smaller one that
> generate the tables on the fly, I forgot that it is possible).

There's another approach for table-less lookups that is slower than
slice-by-4 but faster than the Sarwate algorithm we currently use,
described in "Chi, M., He, D., & Liu, J. (2018). Fast Software-based
Table-less Algorithm for CRC Generation." You can do the polynomial
multiplication approach but manually execute it with 27 shifts and XORs:

uint32_t
crc32_update_no_xor_slice_by_4 (uint32_t crc, const char *buf)
{
  uint32_t local_buf;
  memcpy(&local_buf, buf, 4);
  local_buf = le64toh(local_buf) ^ crc;
    crc = local_buf ^ \
          (local_buf << 6) ^ \
          (local_buf << 9) ^ \
          (local_buf << 10) ^ \
          (local_buf << 12) ^ \
          (local_buf << 16) ^ \
          (local_buf << 24) ^ \
          (local_buf << 25) ^ \
          (local_buf << 26) ^ \
          (local_buf << 28) ^ \
          (local_buf << 29) ^ \
          (local_buf << 30) ^ \
          (local_buf << 31);

    crc = crc ^ \
          (crc >> 1) ^ \
          (crc >> 2) ^ \
          (crc >> 4) ^ \
          (crc >> 5) ^ \
          (crc >> 7) ^ \
          (crc >> 8) ^ \
          (crc >> 10) ^ \
          (crc >> 11) ^ \
          (crc >> 12) ^ \
          (crc >> 16) ^ \
          (crc >> 22) ^ \
          (crc >> 23) ^ \
          (crc >> 26);

  return crc;
}

Is there a use case where we would need this? Either way it is faster than
the current implementation so it might be worth considering for platforms
that don't want slice-by-8.

Performance:

# single byte lookup (current)
$ gltests/bench-crc 1000000
real   7.502692
user   7.503
sys    0.000

# manual polynomial multiplication
$ gltests/bench-crc 1000000
real   6.627648
user   6.627
sys    0.000

So a 12% speedup on my machine, and would be a larger improvement with a
greater penalty for memory lookups vs. XOR/shift instructions

On Thu, 17 Oct 2024 at 09:33, Bruno Haible <br...@clisp.org> wrote:

> Simon Josefsson wrote:
> > If we are splitting this module up into several, maybe it even make more
> > sense to have a 'crc-slice-by-8' module for that particular portable
> > optimization.  I have one use-case (although not published so less
> > important) where the table size increase is a problem, and would prefer
> > to use the current smaller crc.c (or even better, a smaller one that
> > generate the tables on the fly, I forgot that it is possible).
>
> We use different modules for different functionality. (So, for example,
> CRC with a different polynomial could be a different module. Or it
> could be in the same module if most of the tables and code are the same.)
>
> For optimizations, we have other techniques available:
>   - For optimizations that the package developer can enable:
>     That package can invoke a particular Autoconf macro in its
> configure.ac.
>     The effect of this macro is typically to initialize a configure-time
>     variable.
>   - For optimizations that the installer / distributor can enable:
>     An --enable/--disable option, through AC_ARG_ENABLE.
>   - For optimizations that the user can enable:
>     An environment variable.
>   - For optimizations that are enabled by default, depending on CPU
>     features: A feature check at runtime, whose result is cached in a
>     static variable. (/proc/cpuinfo, getauxval(AT_HWCAP), cpuid
> instruction,
>     __x86_get_cpuid_feature_leaf, `ld.so --list-diagnostics`)
>
> Bruno
>
>
>
>

Reply via email to