Bastien, pigz (a parallel version of gzip) has a variable buffer size. The -b or --blocksize option allows up to 512 MB buffers, defaulting to 128K. See http://zlib.net/pigz/
Mark > On Mar 29, 2016, at 4:03 PM, Chevreux, Bastien <bastien.chevr...@dsm.com> > wrote: > >> From: meyer...@gmail.com [mailto:meyer...@gmail.com] On Behalf Of Jim >> Meyering >> [...] >> However, I suggest that you consider using xz in place of gzip. >> Not only can it compress better, it also works faster for comparable >> compression ratios. > > xz is not a viable alternative in this case: use case is not archiving. There > is a plethora of programs out there with zlib support compiled in and these > won't work on xz packed data. Furthermore, gzip -1 is approximately 4 times > faster than xz -1 on FASTQ files (sequencing data), and the use case here is > "temporary results, so ok-ish compression in a comparatively short amount of > time". Gzip is ideal in that respect as even at -1 it compresses down to > ~25-35% ... and that already helps a lot when you do not need 1 TiB of hard > disk but only ~350 GiB. Gzip -1 takes ~4.5 hrs, xz -1 almost a day. > >> That said, if you find that setting gzip.h's INBUFSIZ or OUTBUFSIZ to larger >> values makes a significant difference, we'd like to hear about the results >> and how you measured. > > Changing the INBUFSIZ did not have the effect hoped for as this is just the > buffer size allocated by gzip ... but in the end it uses only 64k at most > and the calls to the file system read() even end up to request only 32k per > call. > > I traced this down through multiple layers to the function fill_window() in > deflate.c, where things get really intricate using multiple pre-set > variables, defines and memcpy()s. It became clear that the code is geared > towards using a 64k buffer with a rolling window of 32k. Optimised for 16 bit > machines that is. > > There are a few mentions of SMALL_MEM, MEDIUM_MEM and BIG_MEM variants via > defines. However, code comments say that BIG_MEM would work on a complete > file loaded in memory ... which is a no-go for files in the area of 15 to 30 > GiB. I'm not even sure the code would be doing what the comments say. > > Long story short: I do not feel expert enough to touch said functions and > change them to provide for larger input buffering. If I were forced to > implement something I'd try it with an outer buffering layer, but I'm not > sure it would be elegant or even efficient. > > Best, > Bastien > > PS: then again I'm toying with the idea to write a simple gzip-packer > replacement which simply buffers data and passes it to zlib. > > -- > DSM Nutritional Products Microbia Inc | Bioinformatics > 60 Westview Street | Lexington, MA 02421 | United States > Phone +1 781 259 7613 | Fax +1 781 259 0615 > > > ________________________________ > > DISCLAIMER: > This e-mail is for the intended recipient only. > If you have received it by mistake please let us know by reply and then > delete it from your system; access, disclosure, copying, distribution or > reliance on any of it by anyone else is prohibited. > If you as intended recipient have received this e-mail incorrectly, please > notify the sender (via e-mail) immediately.