On Sat, 27 Jul 2024 15:00:51 GMT, Archie Cobbs <aco...@openjdk.org> wrote:

>> `GZIPInputStream` supports reading data from multiple concatenated GZIP data 
>> streams since [JDK-4691425](https://bugs.openjdk.org/browse/JDK-4691425). In 
>> order to do this, after a GZIP trailer frame is read, it attempts to read a 
>> GZIP header frame and, if successful, proceeds onward to decompress the new 
>> stream. If the attempt to decode a GZIP header frame fails, or happens to 
>> trigger an `IOException`, it just ignores the trailing garbage and/or the 
>> `IOException` and returns EOF.
>> 
>> There are several issues with this:
>> 
>> 1. The behaviors of (a) supporting concatenated streams and (b) ignoring 
>> trailing garbage are not documented, much less precisely specified.
>> 2. Ignoring trailing garbage is dubious because it could easily hide errors 
>> or other data corruption that an application would rather be notified about. 
>> Moreover, the API claims that a `ZipException` will be thrown when corrupt 
>> data is read, but obviously that doesn't happen in the trailing garbage 
>> scenario (so N concatenated streams where the last one has a corrupted 
>> header frame is indistinguishable from N-1 valid streams).
>> 3. There's no way to create a `GZIPInputStream` that does _not_ support 
>> stream concatenation.
>> 
>> On the other hand, `GZIPInputStream` is an old class with lots of existing 
>> usage, so it's important to preserve the existing behavior, warts and all 
>> (note: my the definition of "existing behavior" here includes the bug fix in 
>> [JDK-7036144](https://bugs.openjdk.org/browse/JDK-7036144)).
>> 
>> So this patch adds a new constructor that takes two new parameters 
>> `allowConcatenation` and `allowTrailingGarbage`. The legacy behavior is 
>> enabled by setting both to true; otherwise, they do what they sound like. In 
>> particular, when `allowTrailingGarbage` is false, then the underlying input 
>> must contain exactly one (if `allowConcatenation` is false) or exactly N (if 
>> `allowConcatenation` is true) concatenated GZIP data streams, otherwise an 
>> exception is guaranteed.
>
> Archie Cobbs has updated the pull request incrementally with two additional 
> commits since the last revision:
> 
>  - Refactor to eliminate "lenient" mode (still failng test: GZIPInZip.java).
>  - Add GZIP input test for various gzip(1) command outputs.

> > With this new proposed change we will throw an IOException (unlike 
> > previously). I think that's fine and in fact the correct thing to do.
> 
> I've run into a problem.
> 
> With this change, `GZIPInZip.java` fails; the test is annotated like this:
> 
> ```java
> /* @test
>  * @bug 7021870 8023431 8026756
>  * @summary Reading last gzip chain member must not close the input stream.
>  *          Garbage following gzip entry must be ignored.
>  */
> ```
> 
> Note the "Garbage following gzip entry must be ignored" line. This must be 
> what bugs 8023431 8026756 refer to, but those bugs are not accessible ("You 
> can't view this issue. It may have been deleted or you don't have permission 
> to view it.") so I can't see what the problem was that required ignoring the 
> extra garbage. Maybe you guys have access or know what this was about? 
> Thanks. See also [this stackoverflow 
> question](https://stackoverflow.com/questions/4928560/how-can-i-work-with-gzip-files-which-contain-extra-data)
>  - it seems to be a thing with ZIP files.


I don't know the full history,  but 
[JDK-4691425](https://bugs.openjdk.org/browse/JDK-4691425) added support for 
concatenated gzip files and also the code for ignoring extra trailing garbage 
bytes.

As part of the fix for  
[JDK-8026756](https://bugs.openjdk.org/browse/JDK-8026756) a test was added for 
validating that trailing garbage  bytes after the end header are ignored,  but 
was not mentioned in the bug.

In [gzip manual](https://www.gnu.org/software/gzip/manual/gzip.html):

> 6 Using gzip on tapes
> 
> When writing compressed data to a tape, it is generally necessary to pad the 
> output with zeroes up to a block > boundary. When the data is read and the 
> whole block is passed to gunzip for decompression, gunzip detects
 > that there is extra trailing garbage after the compressed data and emits a 
 > warning by default if the garbage
 > contains nonzero bytes. You can use the --quiet option to suppress the 
 > warning.

Based on the above and the test that was added via 
[JDK-8026756](https://bugs.openjdk.org/browse/JDK-8026756),  the existing 
behavior for the handling of extra garbage bytes should be left as is.

So where does that leave us:

- Keep the code as is and document the current behavior
- Continue to add additional test coverage for the current API
- We probably do not need a new constructor given it probably adds no new 
additional value the existing API

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18385#issuecomment-2255888787

Reply via email to