Re: RFR: 8322256: Define and document GZIPInputStream concatenated stream semantics [v7]

Jaikiran Pai Fri, 26 Jul 2024 09:15:22 -0700

On Wed, 24 Jul 2024 14:57:05 GMT, Archie Cobbs <aco...@openjdk.org> wrote:


>> `GZIPInputStream` supports reading data from multiple concatenated GZIP data 
>> streams since [JDK-4691425](https://bugs.openjdk.org/browse/JDK-4691425). In 
>> order to do this, after a GZIP trailer frame is read, it attempts to read a 
>> GZIP header frame and, if successful, proceeds onward to decompress the new 
>> stream. If the attempt to decode a GZIP header frame fails, or happens to 
>> trigger an `IOException`, it just ignores the trailing garbage and/or the 
>> `IOException` and returns EOF.
>> 
>> There are several issues with this:
>> 
>> 1. The behaviors of (a) supporting concatenated streams and (b) ignoring 
>> trailing garbage are not documented, much less precisely specified.
>> 2. Ignoring trailing garbage is dubious because it could easily hide errors 
>> or other data corruption that an application would rather be notified about. 
>> Moreover, the API claims that a `ZipException` will be thrown when corrupt 
>> data is read, but obviously that doesn't happen in the trailing garbage 
>> scenario (so N concatenated streams where the last one has a corrupted 
>> header frame is indistinguishable from N-1 valid streams).
>> 3. There's no way to create a `GZIPInputStream` that does _not_ support 
>> stream concatenation.
>> 
>> On the other hand, `GZIPInputStream` is an old class with lots of existing 
>> usage, so it's important to preserve the existing behavior, warts and all 
>> (note: my the definition of "existing behavior" here includes the bug fix in 
>> [JDK-7036144](https://bugs.openjdk.org/browse/JDK-7036144)).
>> 
>> So this patch adds a new constructor that takes two new parameters 
>> `allowConcatenation` and `allowTrailingGarbage`. The legacy behavior is 
>> enabled by setting both to true; otherwise, they do what they sound like. In 
>> particular, when `allowTrailingGarbage` is false, then the underlying input 
>> must contain exactly one (if `allowConcatenation` is false) or exactly N (if 
>> `allowConcatenation` is true) concatenated GZIP data streams, otherwise an 
>> exception is guaranteed.
>
> Archie Cobbs has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains 16 commits:
> 
>  - Merge branch 'master' into JDK-8322256
>  - Wording tweak.
>  - Change concatenation policy config from booleans to new ConcatPolicy enum.
>  - Merge branch 'master' into JDK-8322256
>  - Bump @since from 23 to 24.
>  - Merge branch 'master' into JDK-8322256
>  - Relabel "trailing garbage" as "extra bytes" to sound less accusatory.
>  - Simplify code by eliminating an impossible case.
>  - Field name change and Javadoc wording tweaks.
>  - Merge branch 'master' into JDK-8322256
>  - ... and 6 more: https://git.openjdk.org/jdk/compare/332df83e...72b60092

Hello Archie, I don't think we need any enum in the API. GZIPInputStream class 
should only allow parsing GZIP streams. The InputStream passed to the 
GZIPInputStream constructor must be a InputStream which has either 1 GZIP 
stream or N GZIP streams (concatenated). The user should be allowed to state 
whether the passed InputStream is expected contain more than 1 GZIP stream. 
This is the only addition that I think we should be introducing. 

The way to allow users to state that the InputStream being passed is allowed to 
have more than 1 GZIP stream is through the new proposed boolean to the 
constructor. The boolean can be named "allowConcatenated".

When allowConcatenated is true, then during the read() operation on the 
GZIPInputStream, when we internally read a GZIP stream trailer, we should check 
if the underlying InputStream has a next GZIP stream header:
        - If there's no bytes present after the trailer, we just return 
whatever data we have inflated so far and that also would be the end-of-stream 
of the GZIPInputStream. 
        - If there's bytes after the trailer, then those next bytes MUST be the 
next GZIP stream header. If those bytes aren't a GZIP stream header, then we 
throw an IOException. This implies whatever data we had deflated and any 
additional bytes that we had read after the trailer will not be returned to the 
user. That's fine because the underlying InputStream was "corrupt" - the 
GZIPInputStream shouldn't expect any other data other than GZIP stream(s) in 
the underlying InputStream.

When allowConcatenated is false, then during the read() operation on the 
GZIPInputStream, when we internally read a GZIP stream trailer, we should check 
if the underlying InputStream has any more bytes remaining:
        - If there are bytes remaining then we consider that an error and raise 
an IOException. We do this because we were instructed not to allow concatenated 
GZIP streams. So presence of any data after the first GZIP stream trailer 
should be an exception.
        - If there are no more bytes remaining then we have reached the 
end-of-stream and we return back whatever GZIP data we have inflated during the 
read operation and that also would be the end-of-stream of the GZIPStream 
instance.

The default value for allowConcatenated should be true to allow for supporting 
concatenated GZIP streams (like we do now). There however will be a change in 
behaviour if the stream content after the trailer isn't a GZIP stream. With 
this new proposed change we will throw an IOException (unlike previously). I 
think that's fine and in fact the correct thing to do. It of course will need a 
CSR and a release note to note this change.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18385#issuecomment-2253075317

Re: RFR: 8322256: Define and document GZIPInputStream concatenated stream semantics [v7]

Reply via email to