Hi Sam, Thanks for working on this!
Sam Russell wrote: > 85% time reduction on AMD Ryzen 5 5600: > > $ ./gltests/bench-crc 1000000 > real 1.740296 > user 1.740 > sys 0.000 > > $ ../bench-crc-pclmul 1000000 > real 0.248324 > user 0.248 > sys 0.000 > > This translates to a 13% time reduction for gzip: > > $ time ./gzip_sliceby8 -k -d -c large_file.gz > /dev/null > > real 0m0.310s > user 0m0.310s > sys 0m0.000s > > $ time ./gzip_pclmul -k -d -c large_file.gz > /dev/null > > real 0m0.267s > user 0m0.267s > sys 0m0.000s This indicates that crc is no longer one of the biggest time consumer for gzip. Still, a 13% speedup is something we would like to include. Let me comment regarding the module structure and build system. I'll let Simon speak up for the rest. * I would suggest to rename the main source file from crc-pclmul.c to crc-x86_64.c. Rationale: So that immediately clear that the code is specific to the x86_64 CPUs. Not everyone is an assembly language hacker, and even some assembly language hackers (like me) don't know about the 'pclmul' instruction and of which ISA it is part of. Whereas everyone knows what x86_64 is. * Similarly, I suggest to rename the module 'crc-pclmul' to 'crc-x86_64'. * I like the fact that you have put the new code in a separate module. This will allow packages to invoke gnulib-tool with options --avoid=crc-x86_64 crc if they don't want to include the CPU specific optimization. * In the patch, one of the diffs ends with \ No newline at end of file Could you please add a newline at the end of that source file? * In crc.c: Would it make sense to cache the value of pclmul_enabled in a static variable? You can probably guess it by looking at the assembly-language code that gcc -O2 produces for the two __builtin_cpu_supports invocations. * Are the options -mpclmul -mavx understood by both gcc and clang? Or does clang use different options for the same thing? * In the module description, when you write CFLAGS += -mpclmul -mavx it has an effect on all compilations of the same Makefile (e.g. in the case of coreutils, on all the coreutils programs). Which means that the resulting binaries will *NOT WORK* on CPUs that don't have AVX extensions. So, these CFLAGS changes should be limited to the one compilation unit that needs it. Unfortunately, this is not easy to do with Automake (it requires an intermediate convenience library to be declared), and I don't know whether gnulib-tool already supports this contortion. But before I look into it in detail: Is there an alternative to these command-line options? Maybe there are some #pragmas that have the same effect? > but could someone recommend a method for > detecting alignment so we can safely upscale to a __m128i pointer? How about ((uintptr_t) ptr & 0xF) == 0 or something like that? Bruno