On 2019-10-18, Gary Gregory wrote: > BZip2FileObject does not implement doGetContentSize() and always returns > -1, which causes VFS to blow up if you try to read. Can this kind of > content only be streamed?
First a "I'm not an expert in the bzip2 file format" disclaimer. >From what I can tell the file format does not contain the information about the uncompressed size. BZip2 files consist of a series of blocks each of which holds the result of compressing a multiple of 100000 uncompressed bytes. The multiple (the block size, a number between 1 and 9) is part of the meta data. All blocks except for the last have compressed the same number of original bytes. So you could count the blocks for an estimate and uncompress the last block for the exact uncompressed size but in the end you have to uncompress at least some part of the content to get the uncompressed size. Also I believe blocks don't need to start on byte boundaries so even counting the blocks will be a bit more tricky. There are parallel implementations of bzip2 in the Hadoop eco system (uncompressing blocks in parallel) which must have solved this part, though. Stefan --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org