On 2019-10-18, Gary Gregory wrote:

> BZip2FileObject does not implement doGetContentSize() and always returns
> -1, which causes VFS to blow up if you try to read. Can this kind of
> content only be streamed?

First a "I'm not an expert in the bzip2 file format" disclaimer.

>From what I can tell the file format does not contain the information
about the uncompressed size.

BZip2 files consist of a series of blocks each of which holds the result
of compressing a multiple of 100000 uncompressed bytes. The multiple
(the block size, a number between 1 and 9) is part of the meta data. All
blocks except for the last have compressed the same number of original
bytes.

So you could count the blocks for an estimate and uncompress the last
block for the exact uncompressed size but in the end you have to
uncompress at least some part of the content to get the uncompressed
size.

Also I believe blocks don't need to start on byte boundaries so even
counting the blocks will be a bit more tricky. There are parallel
implementations of bzip2 in the Hadoop eco system (uncompressing blocks
in parallel) which must have solved this part, though.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Reply via email to