[compress] XZ support and inconsistencies in the existing compressors

Lasse Collin Wed, 03 Aug 2011 12:23:19 -0700

Hi!

I have been working on XZ data compression implementation in Java
<http://tukaani.org/xz/java.html>. I was told that it could be nice
to get XZ support into Commons Compress.


I looked at the APIs and code in Commons Compress to see how XZ
support could be added. I was especially looking for details where
one would need to be careful to make different compressors behave
consistently compared to each other. I found a few possible problems
in the existing code:

(1) CompressorOutputStream should have finish(). Now
    BZip2CompressorOutputStream has finish() but
    GzipCompressorOutputStream doesn't. This should be easy to
    fix because java.util.zip.GZIPOutputStream supports finish().

(2) BZip2CompressorOutputStream.flush() calls out.flush() but it
    doesn't flush data buffered by BZip2CompressorOutputStream.
    Thus not all data written to the Bzip2 stream will be available
    in the underlying output stream after flushing. This kind of
    flush() implementation doesn't seem very useful.

    GzipCompressorOutputStream.flush() is the default version
    from InputStream and thus does nothing. Adding flush()
    into GzipCompressorOutputStream is hard because
    java.util.zip.GZIPOutputStream and java.util.zip.Deflater don't
    support sync flushing before Java 7. To get Gzip flushing in
    older Java versions one might need a complete reimplementation
    of the Deflate algorithm which isn't necessarily practical.

(3) BZip2CompressorOutputStream has finalize() that finishes a stream
    that hasn't been explicitly finished or closed. This doesn't seem
    useful. GzipCompressorOutputStream doesn't have an equivalent
    finalize().

(4) The decompressor streams don't support concatenated .gz and .bz2
    files. This can be OK when compressed data is used inside another
    file format or protocol, but with regular (standalone) .gz and
    .bz2 files it is bad to stop after the first compressed stream
    and silently ignore the remaining compressed data.

    Fixing this in BZip2CompressorInputStream should be relatively
    easy because it stops right after the last byte of the compressed
    stream. Fixing GzipCompressorInputStream is harder because the
    problem is inherited from java.util.zip.GZIPInputStream
    which reads input past the end of the first stream. One
    might need to reimplement .gz container support on top of
    java.util.zip.InflaterInputStream or java.util.zip.Inflater.

The XZ compressor supports finish() and flush(). The XZ decompressor
supports concatenated .xz files, but there is also a single-stream
version that behaves similarly to the current version of
BZip2CompressorInputStream.

Assuming that there will be some interest in adding XZ support into
Commons Compress, is it OK make Commons Compress depend on the XZ
package org.tukaani.xz, or should the XZ code be modified so that
it could be included as an internal part in Commons Compress? I
would prefer depending on org.tukaani.xz because then there is
just one code base to keep up to date.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

[compress] XZ support and inconsistencies in the existing compressors

Reply via email to