Hi Bruno, Presumably you've read Pádraig's comment in the other thread that I mistakenly created, there are two interesting things from this:
- coreutils is already GNU so no copyright review required, although the code appears to be inside the cksum utility so it's not in a position to be included directly if we were to reference coreutils as a submodule - coreutils uses the slice-by-8 approach by default (and the original paper was written before 64-bit CPUs were a thing) so maybe we could ignore the bitness and just rely on whether it's flagged on/off? - the comment about using pclmul instructions (since SSE4.1 so quite old) is also worth looking into as this doesn't need 8kb of tables in memory and should be in the same ballpark of performance As for your question on speed, I noticed between zstd (which uses zlib as a backend) and gzip there seems to be an improvement of maybe 30-40% for decompressing a 100MB file (part of this is due to multithreading though), and gprof shows the CRC calculation being maybe 40-50% of the CPU cycles so a 3x improvement (as per the original 8-slice intel paper) in CRC speed would translate to ~30% reduction in time required for decompressing a large file. So for next steps, I can add the #defines for HOST_CPU_C_ABI_32BIT and an option to enable/disable the whole thing (is whitelist or blacklist a better approach for a new feature like this?), and then we make sure everything is in a position to be merged. For future steps the coreutils pclmul implementation is also quite interesting, and that seems simple enough to just gate on -mavx and -mpclmul and manage them in the makefiles too. Cheers Sam On Mon, 14 Oct 2024 at 15:53, Bruno Haible <br...@clisp.org> wrote: > Hi Sam, > > Thanks for the contribution offer! > > > I've noticed that GZIP trails behind zlib in performance and part of this > > is down to the fact that zlib is using a more efficient CRC32 > > implementation. > > How much of a speedup do you obtain in gzip overall (not in CRC32 alone) > for large files, through these slice-by-4 and slice-by-8 techniques? > > > the tables are generated by > > extending the code from RFC 1952 to generate the lookups for partial > > bitfields, this can be provided on request but it's not my finest work). > > Yes, if we include your code, we would also include the generator. Doesn't > matter in which programming language it is written: C is fine, Python is > fine, > Lisp is fine. > > > - free software/patents: the Intel paper is from 2008; zlib had an > > independently-discovered version of slice-by-4 from 2002 so it seems > > unlikely there should be any patents encumbering this. > > Good. > > > the software is my > > own design and I am happy for it to be transferred to GNU and licensed > > accordingly > > Yes, we would need a copyright assignment to the FSF for this code, as it > contains 40 lines of hand-written code (-> legally significant). > > > Requests for help: > > - how do I get someone to review my code and then get it added to the > > codebase? is this done via Savannah, or the mailing list, do we email > > around PATCH files etc? > > We do it through the mailing list. Patches in 'diff -u' or 'git diff' > format > are fine, as are patch files produced by 'git format-patch'. As > attachments, > please, not inline. Please provide a ChangeLog entry; we reuse that > ChangeLog > entry for the commit message. > > > - I assume we'll want to gate this functionality based on CPU ability > > (32-bit arithmetic is presumably fine but 64-bit makes no sense unless > > running on a system with a native 64-bit bus). What is the convention > here > > for enforcing this at compile-time? > > You could add the file m4/host-cpu-c-abi.m4 to the module, invoke > gl_HOST_CPU_C_ABI_32BIT at configure time, and then dispatch on > HOST_CPU_C_ABI_32BIT. > > Probably, for embedded systems which don't want to spend 4 KB of cache, > it would be useful to have a #define for the entire optimization. > > Bruno > > > >