> Thanks! This looks nice, but please add code that generated the tables > which is important for reproduction.
Of course. > I am worried about the size increase with the new tables, what do you > think about making the new approach optional with some #ifdef, which may > be off by default to prefer your new optimized variant? Bruno mentioned this, this makes perfect sense, and thanks for confirming your preference for having this opt-out rather than opt-in. It's about 8KB of extra tables and these days you either have MBs of L2 cache or you're embedded and memory is at a massive premium so this should be an easy decision for any platform. > You make several changes to the test vectors: please make them as > ADDITIONS instead. We need some confidence that the old test vectors > work. No problem, I can restore the old tests, the new ones are generated based on the output of the old code but I can gladly add more if you like :) > Since the new code works via alignment, please add one test > vector per string size: 0, 1, 2, 3, ... up to say 20. Alignment is a good idea, the 14-byte example tests both the slice-by-8 and the final part but you're right in that we can have multiple 15-byte samples that start at xxxxx0, xxxxx1, xxxxx2 etc and they all need to return the same result. > Did you test this > on 32-bit and 64-bit platforms? Use valgrind to QA further. More good feedback, I haven't personally and don't have a real 32-bit platform here so any ideas for how best to go about this is useful. It's worth noting that the 2008 paper that raised this was done for a 32-bit system and they reported a 3x speedup so I am not expecting any nasty surprises here. > Maybe there is more room for optimization... SSE4.2+ has a hardware instruction for CRC. Unfortunately the polynomial is fixed on this instruction and different to the gzip CRC polynomial so it doesn't help us here. Pádraig has pointed out that SSE4.1+ has hardware support for carry-less multiplication and coreutils has an implementation using this, so this would be a nice addition further down the line. I'd prefer to add slice-by-8 first and get that working as this will give improvements on any 32-bit system, and then look into porting the pclmul implementation next as a separate change. > If you want to write an IETF RFC on this, maybe we can collaborate :) I would be very interested. We'd need an extensive lit-search as there has been a ton of work in the space, but definitely it would be good to standardise some approaches as I've seen 10 different implementations across 5 different open source projects in the last 2 days. On Mon, 14 Oct 2024 at 20:26, Simon Josefsson <[email protected]> wrote: > Sam Russell <[email protected]> writes: > > > I've noticed that GZIP trails behind zlib in performance and part of this > > is down to the fact that zlib is using a more efficient CRC32 > > implementation. I've written an implementation of this for gnulib based > off > > the Intel paper at > > > https://static.aminer.org/pdf/PDF/000/432/446/a_systematic_approach_to_building_high_performance_software_based_crc.pdf > > (the code is mine, written based on the paper, the tables are generated > by > > extending the code from RFC 1952 to generate the lookups for partial > > bitfields, this can be provided on request but it's not my finest work). > > > > The code: > > > https://github.com/samrussell/gnulib/commit/2d5f5d0e131feea6e04cb48d56589537506f91a8 > > (yes, I am aware you don't take contributions via github. I have an open > > ticket with GNU to get my SSH access fixed so I can have non-anonymous > > access to the repository on Savannah). > > Thanks! This looks nice, but please add code that generated the tables > which is important for reproduction. > > I am worried about the size increase with the new tables, what do you > think about making the new approach optional with some #ifdef, which may > be off by default to prefer your new optimized variant? > > You make several changes to the test vectors: please make them as > ADDITIONS instead. We need some confidence that the old test vectors > work. Since the new code works via alignment, please add one test > vector per string size: 0, 1, 2, 3, ... up to say 20. Did you test this > on 32-bit and 64-bit platforms? Use valgrind to QA further. > > Maybe there is more room for optimization... SSE4.2+ has a hardware > instruction for CRC. Support for other CRC-32 would be nice too. A > reasonable specification for it would be nice too, I can find plenty of > definitions in various RFC's but they are all duplicated. > https://en.wikipedia.org/wiki/Cyclic_redundancy_check is informative. > If you want to write an IETF RFC on this, maybe we can collaborate :) > > /Simon >
