bug#23113: parallel gzip processes trash hard disks, need larger buffers

Mark Adler Sun, 10 Apr 2016 00:50:21 -0700

Bastien,

pigz (a parallel version of gzip) has a variable buffer size. The -b or 
--blocksize option allows up to 512 MB buffers, defaulting to 128K. See 
http://zlib.net/pigz/


Mark


> On Mar 29, 2016, at 4:03 PM, Chevreux, Bastien <bastien.chevr...@dsm.com> 
> wrote:
> 
>> From: meyer...@gmail.com [mailto:meyer...@gmail.com] On Behalf Of Jim 
>> Meyering
>> [...]
>> However, I suggest that you consider using xz in place of gzip.
>> Not only can it compress better, it also works faster for comparable 
>> compression ratios.
> 
> xz is not a viable alternative in this case: use case is not archiving. There 
> is a plethora of programs out there with zlib support compiled in and these 
> won't work on xz packed data. Furthermore, gzip -1 is approximately 4 times 
> faster than xz -1 on FASTQ files (sequencing data), and the use case here is 
> "temporary results, so ok-ish compression in a comparatively short amount of 
> time". Gzip is ideal in that respect as even at -1 it compresses down to 
> ~25-35% ... and that already helps a lot when you do not need 1 TiB of hard 
> disk but only ~350 GiB. Gzip -1 takes ~4.5 hrs, xz -1 almost a day.
> 
>> That said, if you find that setting gzip.h's INBUFSIZ or OUTBUFSIZ to larger 
>> values makes a significant difference, we'd like to hear about the results 
>> and how you measured.
> 
> Changing the INBUFSIZ did not have the effect hoped for as this is just the 
> buffer size allocated by gzip ... but in the end it uses only 64k at most  
> and the calls to the file system read() even end up to request only 32k per 
> call.
> 
> I traced this down through multiple layers to the function fill_window() in 
> deflate.c, where things get really intricate using multiple pre-set 
> variables, defines and memcpy()s. It became clear that the code is geared 
> towards using a 64k buffer with a rolling window of 32k. Optimised for 16 bit 
> machines that is.
> 
> There are a few mentions of SMALL_MEM, MEDIUM_MEM and BIG_MEM variants via 
> defines. However, code comments say that BIG_MEM would work on a complete 
> file loaded in memory ... which is a no-go for files in the area of 15 to 30 
> GiB. I'm not even sure the code would be doing what the comments say.
> 
> Long story short: I do not feel expert enough to touch said functions and 
> change them to provide for larger input buffering. If I were forced to 
> implement something I'd try it with an outer buffering layer, but I'm not 
> sure it would be elegant or even efficient.
> 
> Best,
>  Bastien
> 
> PS: then again I'm toying with the idea to write a simple gzip-packer 
> replacement which simply buffers data and passes it to zlib.
> 
> --
> DSM Nutritional Products Microbia Inc | Bioinformatics
> 60 Westview Street | Lexington, MA 02421 | United States
> Phone +1 781 259 7613 | Fax +1 781 259 0615
> 
> 
> ________________________________
> 
> DISCLAIMER:
> This e-mail is for the intended recipient only.
> If you have received it by mistake please let us know by reply and then 
> delete it from your system; access, disclosure, copying, distribution or 
> reliance on any of it by anyone else is prohibited.
> If you as intended recipient have received this e-mail incorrectly, please 
> notify the sender (via e-mail) immediately.

bug#23113: parallel gzip processes trash hard disks, need larger buffers

Reply via email to