Re: RFR: 8322256: Define and document GZIPInputStream concatenated stream semantics [v10]

Eirik Bjørsnøs Thu, 29 Aug 2024 06:59:27 -0700

On Wed, 28 Aug 2024 21:56:50 GMT, Archie Cobbs <aco...@openjdk.org> wrote:


>> `GZIPInputStream` supports reading data from multiple concatenated GZIP data 
>> streams since [JDK-4691425](https://bugs.openjdk.org/browse/JDK-4691425). In 
>> order to do this, after a GZIP trailer frame is read, it attempts to read a 
>> GZIP header frame and, if successful, proceeds onward to decompress the new 
>> stream. If the attempt to decode a GZIP header frame fails, or happens to 
>> trigger an `IOException`, it just ignores the trailing garbage and/or the 
>> `IOException` and returns EOF.
>> 
>> There are several issues with this:
>> 
>> 1. The behaviors of (a) supporting concatenated streams and (b) ignoring 
>> trailing garbage are not documented, much less precisely specified.
>> 2. Ignoring trailing garbage is dubious because it could easily hide errors 
>> or other data corruption that an application would rather be notified about. 
>> Moreover, the API claims that a `ZipException` will be thrown when corrupt 
>> data is read, but obviously that doesn't happen in the trailing garbage 
>> scenario (so N concatenated streams where the last one has a corrupted 
>> header frame is indistinguishable from N-1 valid streams).
>> 3. There's no way to create a `GZIPInputStream` that does _not_ support 
>> stream concatenation.
>> 
>> On the other hand, `GZIPInputStream` is an old class with lots of existing 
>> usage, so it's important to preserve the existing behavior, warts and all 
>> (note: my the definition of "existing behavior" here includes the bug fix in 
>> [JDK-7036144](https://bugs.openjdk.org/browse/JDK-7036144)).
>> 
>> So this patch adds a new constructor that takes two new parameters 
>> `allowConcatenation` and `allowTrailingGarbage`. The legacy behavior is 
>> enabled by setting both to true; otherwise, they do what they sound like. In 
>> particular, when `allowTrailingGarbage` is false, then the underlying input 
>> must contain exactly one (if `allowConcatenation` is false) or exactly N (if 
>> `allowConcatenation` is true) concatenated GZIP data streams, otherwise an 
>> exception is guaranteed.
>
> Archie Cobbs has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Shorten the description of concatenation behavior per review comments.

I think the `testScenario` method would be easier to read if we dropped the 
assigments to the `input` and `output` variables, and if the `testDecomp` 
method was split into separate methods for expecting a successful vs. failing 
deompression. See comments for details.

test/jdk/java/util/zip/GZIP/GZIPInputStreamConcat.java line 48:

> 46:         // Test concat vs. non-concat, garbage vs. no-garbage, and 
> various buffer sizes on random data
> 47:         Random random = new Random();
> 48:         final ArrayList<List<Object>> scenarios = new ArrayList<>();

Suggestion:

        List<List<Object>> scenarios = new ArrayList<>();

test/jdk/java/util/zip/GZIP/GZIPInputStreamConcat.java line 49:

> 47:         Random random = new Random();
> 48:         final ArrayList<List<Object>> scenarios = new ArrayList<>();
> 49:         for (int size = 1; size < 1024; size += random.nextInt(32) + 1) {

It's not obviously clear that `size` is the buffer size passed to 
GZIPInputStream here and not related to the size of the uncompressed input data.

Perhaps rename this to `bufferSize` and/or document the parameters via a 
comment?

`// Parameters: byte[] data, int bufferSize`

test/jdk/java/util/zip/GZIP/GZIPInputStreamConcat.java line 69:

> 67:         // Decompress a single stream with no extra garbage - should 
> always work
> 68:         byte[] input = compressed;
> 69:         byte[] output = uncompressed;

The assignment of `input` and `output` variables in this method add a lot of 
noise. Could this be simplified to something like this?


        // Compress the test data
        byte[] compressed = deflate(uncompressed);

        // Decompress a single stream with no extra garbage - should always work
        testDecomp(compressed, uncompressed);

        // Decompress a truncated GZIP header
        testDecompFail(oneByteShort(gzipHeader()), EOFException.class);

        // Decompress a single stream that is one byte short - should always 
fail
        testDecompFail(oneByteShort(compressed), EOFException.class);

        // Decompress a single stream with one byte of extra garbage (trying 
all 256 possible values)
        for (int extra = 0; extra < 0x100; extra++) {
            testDecomp(oneByteLong(compressed, extra), uncompressed);
        }

        // Decompress a single stream followed by a truncated GZIP header
        testDecomp(concat(compressed, oneByteShort(gzipHeader())), 
uncompressed);

        // Decompress a single stream followed by another stream that is one 
byte short
        testDecompFail(concat(compressed, oneByteShort(compressed)), 
IOException.class);

        // Decompress two streams concatenated
        testDecomp(concat(compressed, compressed), concat(uncompressed, 
uncompressed));

        // Decompress three streams concatenated
        testDecomp(concat(compressed, compressed, compressed),
                concat(uncompressed, uncompressed, uncompressed));

        // Decompress three streams concatenated followed by a truncated GZIP 
header
        testDecomp(concat(compressed, compressed, compressed, 
oneByteShort(gzipHeader())),
                concat(uncompressed, uncompressed, uncompressed));

test/jdk/java/util/zip/GZIP/GZIPInputStreamConcat.java line 142:

> 140:         System.arraycopy(array, 0, array2, 0, array.length);
> 141:         array2[array.length] = (byte)value;
> 142:         return array2;

Would it make sense to do the following here?

Suggestion:

        return concat(array, new byte[] {(byte) value});

test/jdk/java/util/zip/GZIP/GZIPInputStreamConcat.java line 167:

> 165: 
> 166:     // GZIP compress data
> 167:     public static byte[] deflate(byte[] data) throws IOException {

The returned data is in the GZIP format, not the DEFLATE format. So could be 
`gzip` rather than `deflate`.

test/jdk/java/util/zip/GZIP/GZIPInputStreamConcat.java line 176:

> 174: 
> 175:     // GZIP decompress data
> 176:     public byte[] inflate(InputStream in) throws IOException {

The stream here is in the GZIP format, not the DEFLATE format. So could be 
`gunzip` rather than `inflate`.

-------------

PR Review: https://git.openjdk.org/jdk/pull/18385#pullrequestreview-2268596523
PR Review Comment: https://git.openjdk.org/jdk/pull/18385#discussion_r1736128032
PR Review Comment: https://git.openjdk.org/jdk/pull/18385#discussion_r1736160530
PR Review Comment: https://git.openjdk.org/jdk/pull/18385#discussion_r1736229023
PR Review Comment: https://git.openjdk.org/jdk/pull/18385#discussion_r1736125278
PR Review Comment: https://git.openjdk.org/jdk/pull/18385#discussion_r1736119267
PR Review Comment: https://git.openjdk.org/jdk/pull/18385#discussion_r1736120221

Re: RFR: 8322256: Define and document GZIPInputStream concatenated stream semantics [v10]

Reply via email to