[issue41210] Docs: More description of reason about LZMA1 data handling with FORMAT_ALONE

Hiroshi Miura Sun, 12 Jul 2020 19:35:14 -0700


Hiroshi Miura <miur...@linux.com> added the comment:


Lasse Collin gives me explanation of LZMA1 data format and suggestion how to 
implement.

I'd like to change an issue to a documentation issue to add more description 
about limitation on FORMAT_ALONE and LZMA1.

A suggestion from Lasse is as follows:

> liblzma cannot be used to decode data from .7z files except in certain
> cases. This isn't a bug, it's a missing feature.
>
> The raw encoder and decoder APIs only support streams that contain an
> end of payload marker (EOPM) alias end of stream (EOS) marker. .7z
> files use LZMA1 without such an end marker. Instead, the end is handled
> by the decoder knowing the exact uncompressed size of the data.
>
> The API of liblzma supports LZMA1 without end marker via
> lzma_alone_decoder(). That API can be abused to properly decode raw
> LZMA1 with known uncompressed size by feeding the decoder a fake 13-byte
> header. Everything else in the public API requires some end marker.
>
> Decoding LZMA1 without BCJ or other extra filters from .7z with
> lzma_raw_decoder() kind of works but you will notice that it will never
> return LZMA_STREAM_END, only LZMA_OK. This is because it will never see
> an end marker. A minor downside is that it won't then do a small
> integrity check at the end either (one variable in the range decoder
> state will be 0 at the end of any valid LZMA1 stream);
> lzma_alone_decoder() does this check even when end marker is missing.
>
> If you use lzma_raw_decoder() for end-markerless LZMA1, make sure that
> you never give it more output space than the real uncompressed size. In
> rare cases this could result in extra output or an error since the
> decoder would try to decode more output using the input it has gotten
> so far. Overall I think the hack with lzma_alone_decoder() is a better
> way with the current API.
>
> BCJ filters process the input data in chunks of a few bytes long, thus
> they need to hold a few bytes of look-ahead buffer. With some filters
> like ARM the look-ahead is aligned and if the uncompressed size is a
> multiple of this alignment, lzma_raw_decoder() will give you all the
> data even when the LZMA1 layer doesn't have an end marker. The x86
> filter has one-byte alignment but needs to see five bytes from the
> future before producing output. When LZMA1 layer doesn't return
> LZMA_STREAM_END, the x86 filter doesn't know that the end was reached
> and cannot flush the last bytes out.
>
> Using liblzma to decode .7z works in these cases:
>
> - LZMA1 using a fake 13-byte header with lzma_alone_decoder():
>
> 1 byte LZMA properties byte that encodes lc/lp/pb
> 4 bytes dictionary size as little endian uint32_t
> 8 bytes uncompressed size as little endian uint64_t;
> UINT64_MAX means unknown and then (and only then)
> EOPM must be present

----------
title: LZMADecompressor.decompress(FORMAT_RAW) truncate output when input is 
paticular LZMA+BCJ  data -> Docs: More description of reason about LZMA1 data 
handling with FORMAT_ALONE

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue41210>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue41210] Docs: More description of reason about LZMA1 data handling with FORMAT_ALONE

Reply via email to