Hiroshi Miura <miur...@linux.com> added the comment:
Lasse Collin gives me explanation of LZMA1 data format and suggestion how to implement. I'd like to change an issue to a documentation issue to add more description about limitation on FORMAT_ALONE and LZMA1. A suggestion from Lasse is as follows: > liblzma cannot be used to decode data from .7z files except in certain > cases. This isn't a bug, it's a missing feature. > > The raw encoder and decoder APIs only support streams that contain an > end of payload marker (EOPM) alias end of stream (EOS) marker. .7z > files use LZMA1 without such an end marker. Instead, the end is handled > by the decoder knowing the exact uncompressed size of the data. > > The API of liblzma supports LZMA1 without end marker via > lzma_alone_decoder(). That API can be abused to properly decode raw > LZMA1 with known uncompressed size by feeding the decoder a fake 13-byte > header. Everything else in the public API requires some end marker. > > Decoding LZMA1 without BCJ or other extra filters from .7z with > lzma_raw_decoder() kind of works but you will notice that it will never > return LZMA_STREAM_END, only LZMA_OK. This is because it will never see > an end marker. A minor downside is that it won't then do a small > integrity check at the end either (one variable in the range decoder > state will be 0 at the end of any valid LZMA1 stream); > lzma_alone_decoder() does this check even when end marker is missing. > > If you use lzma_raw_decoder() for end-markerless LZMA1, make sure that > you never give it more output space than the real uncompressed size. In > rare cases this could result in extra output or an error since the > decoder would try to decode more output using the input it has gotten > so far. Overall I think the hack with lzma_alone_decoder() is a better > way with the current API. > > BCJ filters process the input data in chunks of a few bytes long, thus > they need to hold a few bytes of look-ahead buffer. With some filters > like ARM the look-ahead is aligned and if the uncompressed size is a > multiple of this alignment, lzma_raw_decoder() will give you all the > data even when the LZMA1 layer doesn't have an end marker. The x86 > filter has one-byte alignment but needs to see five bytes from the > future before producing output. When LZMA1 layer doesn't return > LZMA_STREAM_END, the x86 filter doesn't know that the end was reached > and cannot flush the last bytes out. > > Using liblzma to decode .7z works in these cases: > > - LZMA1 using a fake 13-byte header with lzma_alone_decoder(): > > 1 byte LZMA properties byte that encodes lc/lp/pb > 4 bytes dictionary size as little endian uint32_t > 8 bytes uncompressed size as little endian uint64_t; > UINT64_MAX means unknown and then (and only then) > EOPM must be present ---------- title: LZMADecompressor.decompress(FORMAT_RAW) truncate output when input is paticular LZMA+BCJ data -> Docs: More description of reason about LZMA1 data handling with FORMAT_ALONE _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue41210> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com