Re: RFR: 8322256: Define and document GZIPInputStream concatenated stream semantics [v8]

Archie Cobbs Wed, 31 Jul 2024 15:54:19 -0700

On Sat, 27 Jul 2024 15:00:51 GMT, Archie Cobbs <aco...@openjdk.org> wrote:


>> `GZIPInputStream` supports reading data from multiple concatenated GZIP data 
>> streams since [JDK-4691425](https://bugs.openjdk.org/browse/JDK-4691425). In 
>> order to do this, after a GZIP trailer frame is read, it attempts to read a 
>> GZIP header frame and, if successful, proceeds onward to decompress the new 
>> stream. If the attempt to decode a GZIP header frame fails, or happens to 
>> trigger an `IOException`, it just ignores the trailing garbage and/or the 
>> `IOException` and returns EOF.
>> 
>> There are several issues with this:
>> 
>> 1. The behaviors of (a) supporting concatenated streams and (b) ignoring 
>> trailing garbage are not documented, much less precisely specified.
>> 2. Ignoring trailing garbage is dubious because it could easily hide errors 
>> or other data corruption that an application would rather be notified about. 
>> Moreover, the API claims that a `ZipException` will be thrown when corrupt 
>> data is read, but obviously that doesn't happen in the trailing garbage 
>> scenario (so N concatenated streams where the last one has a corrupted 
>> header frame is indistinguishable from N-1 valid streams).
>> 3. There's no way to create a `GZIPInputStream` that does _not_ support 
>> stream concatenation.
>> 
>> On the other hand, `GZIPInputStream` is an old class with lots of existing 
>> usage, so it's important to preserve the existing behavior, warts and all 
>> (note: my the definition of "existing behavior" here includes the bug fix in 
>> [JDK-7036144](https://bugs.openjdk.org/browse/JDK-7036144)).
>> 
>> So this patch adds a new constructor that takes two new parameters 
>> `allowConcatenation` and `allowTrailingGarbage`. The legacy behavior is 
>> enabled by setting both to true; otherwise, they do what they sound like. In 
>> particular, when `allowTrailingGarbage` is false, then the underlying input 
>> must contain exactly one (if `allowConcatenation` is false) or exactly N (if 
>> `allowConcatenation` is true) concatenated GZIP data streams, otherwise an 
>> exception is guaranteed.
>
> Archie Cobbs has updated the pull request incrementally with two additional 
> commits since the last revision:
> 
>  - Refactor to eliminate "lenient" mode (still failng test: GZIPInZip.java).
>  - Add GZIP input test for various gzip(1) command outputs.

It seems like we're going around in circles a bit here... but maybe the circles 
are getting smaller? But I'm also running out of steam...

> We don't want to change this long standing behavior as it has the potential 
> of breaking existing applications and it is consistent with gzip and also 
> winzip.

Sounds great. In fact my original plan was to not change the existing behavior 
at all (by putting all of the new/improved behavior behind a new constructor)...

> So through this PR, we should clarify the javadoc as to what current 
> GZIPInputStream implementation does and add additional tests to expand the 
> coverage

Event just doing this has some subtlety to it... Note, none of the current 
behavior is actually specified (the entirety of the Javadoc says: "This class 
implements a stream filter for reading compressed data in the GZIP file 
format"), so any description we add creates a new actual specification for the 
future.

Are you sure we want to do that for all of the current behavior?

I would think maybe for some we do and for some we don't:
* Swallowing `IOException`'s - this we should not specify, so that we may 
someday eliminate it to make this class reliable
* Concatenated GZIP's - this would be good to document
* Reading extra bytes - the only thing to document would be that "there is a 
chance extra bytes will be read" which is not very useful, so what's the point?

What of that would you want to specify?

My original idea was to only add specification for the new/improved 
functionality, and leave the existing functionality. If we go this route, then 
the only thing to decide would be what new/improved functionality we want to 
add.

> A separate discussion can take place to discuss the merits of whether there 
> is perceived value in throwing an IOException when trailing garbage is 
> encountered as well as any benefit of not supporting concatenated gzip files. 
> It will also allow time for further review of other tools/apis that support 
> gzip to see what they may or may not do.

Guess I thought we were sort-of having those discussions now...

What do we agree on?
* Supporting concatenation is a good thing
* Supporting ignoring trailing garbage is a good thing

What is there debate about?
* Swallowing trailing garbage `IOException`'s - I think it is a bad thing, but 
it's required for backward compatibility
* Precise stops when `markSupported() = true` - I think it is a good thing (and 
it is backward compatible)
* Support for reading only one GZIP stream - I think it is a good thing, 
especially useful when combined with previous

IMO doing nothing is a bad option, because as it stands today the class is 
fundamentally unreliable, and we should at minimum provide some new way to get 
reliable behavior (doing so doesn't require changing the existing behavior).

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18385#issuecomment-2261597142

Re: RFR: 8322256: Define and document GZIPInputStream concatenated stream semantics [v8]

Reply via email to