Re: RFR: 8322256: Define and document GZIPInputStream concatenated stream semantics [v5]

Jaikiran Pai Mon, 15 Jul 2024 03:05:37 -0700

On Thu, 6 Jun 2024 17:03:57 GMT, Archie Cobbs <[email protected]> wrote:


>> `GZIPInputStream` supports reading data from multiple concatenated GZIP data 
>> streams since [JDK-4691425](https://bugs.openjdk.org/browse/JDK-4691425). In 
>> order to do this, after a GZIP trailer frame is read, it attempts to read a 
>> GZIP header frame and, if successful, proceeds onward to decompress the new 
>> stream. If the attempt to decode a GZIP header frame fails, or happens to 
>> trigger an `IOException`, it just ignores the trailing garbage and/or the 
>> `IOException` and returns EOF.
>> 
>> There are several issues with this:
>> 
>> 1. The behaviors of (a) supporting concatenated streams and (b) ignoring 
>> trailing garbage are not documented, much less precisely specified.
>> 2. Ignoring trailing garbage is dubious because it could easily hide errors 
>> or other data corruption that an application would rather be notified about. 
>> Moreover, the API claims that a `ZipException` will be thrown when corrupt 
>> data is read, but obviously that doesn't happen in the trailing garbage 
>> scenario (so N concatenated streams where the last one has a corrupted 
>> header frame is indistinguishable from N-1 valid streams).
>> 3. There's no way to create a `GZIPInputStream` that does _not_ support 
>> stream concatenation.
>> 
>> On the other hand, `GZIPInputStream` is an old class with lots of existing 
>> usage, so it's important to preserve the existing behavior, warts and all 
>> (note: my the definition of "existing behavior" here includes the bug fix in 
>> [JDK-7036144](https://bugs.openjdk.org/browse/JDK-7036144)).
>> 
>> So this patch adds a new constructor that takes two new parameters 
>> `allowConcatenation` and `allowTrailingGarbage`. The legacy behavior is 
>> enabled by setting both to true; otherwise, they do what they sound like. In 
>> particular, when `allowTrailingGarbage` is false, then the underlying input 
>> must contain exactly one (if `allowConcatenation` is false) or exactly N (if 
>> `allowConcatenation` is true) concatenated GZIP data streams, otherwise an 
>> exception is guaranteed.
>
> Archie Cobbs has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains 12 commits:
> 
>  - Bump @since from 23 to 24.
>  - Merge branch 'master' into JDK-8322256
>  - Relabel "trailing garbage" as "extra bytes" to sound less accusatory.
>  - Simplify code by eliminating an impossible case.
>  - Field name change and Javadoc wording tweaks.
>  - Merge branch 'master' into JDK-8322256
>  - Javadoc wording tweaks.
>  - Merge branch 'master' into JDK-8322256
>  - Clarify exceptions: sometimes ZipException, sometimes EOFException.
>  - Merge branch 'master' into JDK-8322256
>  - ... and 2 more: https://git.openjdk.org/jdk/compare/75dc2f85...f845a75b

Hello Archie, sorry it has taken long to review this PR.

The proposal here is to have `java.util.zip.GZIPInputStream` specify (and 
improve) how it deals with concatenated GZIP stream(s) and also allow for 
applications instantiating `GZIPInputStream` to decide whether or not the 
underlying `InputStream` passed to the `GZIPInputStream` instance is expected 
to contain a concatenated GZIP stream(s).

I'll include here the text that I had sent in a private communication to Archie:

I think this comes down to introducing a new optional boolean parameter to the 
constructor of GZIPInputStream.

The boolean when "true" will attempt to read a potential additional GZIP stream 
after the previous GZIP stream's trailer has been read. In that case we need to 
handle 2 cases - one where we successfully find the header of the next 
(concatenated) GZIP stream and the other where we don't find a valid GZIP 
stream header. In the first case where we do find a header successfully, then 
we continue reading the stream as usual and return the uncompressed data from 
the "read()" call. In the case where we fail to find a valid header and yet 
there were bytes past the previous GZIP stream, then I think the 
GZIPInputStream.read() should throw an IOException, since that stream no longer 
represents a GZIP stream (remember, we are talking about this only when the 
GZIPInputStream was constructed with the new boolean parameter = true).

Coming to the case where the GZIPInputStream was constructed using boolean = 
false - In this case when we reach and read the trailer of a GZIP stream, if 
there is no more bytes, then we consider this a completed GZIP stream and 
return the uncompressed data. If there is any more bytes past the GZIP stream's 
trailer, then I think we should throw a IOException (since we aren't expecting 
any concatenated GZIP stream).

As for the default value of this optional boolean parameter, I think it should 
default to "true" implying it will read any concatenated GZIP streams. That 
would match the current implementation of GZIPInputStream.read() which has the 
ability to read (valid) GZIP concatenated input stream.

I think this would then allow us to keep the implementation simple as well as 
allow the calling application to have control over whether or not the passed 
should be considered as having a concatenated GZIP stream.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18385#issuecomment-2228129529

Re: RFR: 8322256: Define and document GZIPInputStream concatenated stream semantics [v5]

Reply via email to