Hi! I have been working on XZ data compression implementation in Java <http://tukaani.org/xz/java.html>. I was told that it could be nice to get XZ support into Commons Compress.
I looked at the APIs and code in Commons Compress to see how XZ support could be added. I was especially looking for details where one would need to be careful to make different compressors behave consistently compared to each other. I found a few possible problems in the existing code: (1) CompressorOutputStream should have finish(). Now BZip2CompressorOutputStream has finish() but GzipCompressorOutputStream doesn't. This should be easy to fix because java.util.zip.GZIPOutputStream supports finish(). (2) BZip2CompressorOutputStream.flush() calls out.flush() but it doesn't flush data buffered by BZip2CompressorOutputStream. Thus not all data written to the Bzip2 stream will be available in the underlying output stream after flushing. This kind of flush() implementation doesn't seem very useful. GzipCompressorOutputStream.flush() is the default version from InputStream and thus does nothing. Adding flush() into GzipCompressorOutputStream is hard because java.util.zip.GZIPOutputStream and java.util.zip.Deflater don't support sync flushing before Java 7. To get Gzip flushing in older Java versions one might need a complete reimplementation of the Deflate algorithm which isn't necessarily practical. (3) BZip2CompressorOutputStream has finalize() that finishes a stream that hasn't been explicitly finished or closed. This doesn't seem useful. GzipCompressorOutputStream doesn't have an equivalent finalize(). (4) The decompressor streams don't support concatenated .gz and .bz2 files. This can be OK when compressed data is used inside another file format or protocol, but with regular (standalone) .gz and .bz2 files it is bad to stop after the first compressed stream and silently ignore the remaining compressed data. Fixing this in BZip2CompressorInputStream should be relatively easy because it stops right after the last byte of the compressed stream. Fixing GzipCompressorInputStream is harder because the problem is inherited from java.util.zip.GZIPInputStream which reads input past the end of the first stream. One might need to reimplement .gz container support on top of java.util.zip.InflaterInputStream or java.util.zip.Inflater. The XZ compressor supports finish() and flush(). The XZ decompressor supports concatenated .xz files, but there is also a single-stream version that behaves similarly to the current version of BZip2CompressorInputStream. Assuming that there will be some interest in adding XZ support into Commons Compress, is it OK make Commons Compress depend on the XZ package org.tukaani.xz, or should the XZ code be modified so that it could be included as an internal part in Commons Compress? I would prefer depending on org.tukaani.xz because then there is just one code base to keep up to date. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org