Re: RFR: 8322256: Define and document GZIPInputStream concatenated stream semantics [v7]

Archie Cobbs Fri, 26 Jul 2024 08:48:08 -0700

On Fri, 26 Jul 2024 13:36:43 GMT, Lance Andersen <lan...@openjdk.org> wrote:


>> Archie Cobbs has updated the pull request with a new target base due to a 
>> merge or a rebase. The pull request now contains 16 commits:
>> 
>>  - Merge branch 'master' into JDK-8322256
>>  - Wording tweak.
>>  - Change concatenation policy config from booleans to new ConcatPolicy enum.
>>  - Merge branch 'master' into JDK-8322256
>>  - Bump @since from 23 to 24.
>>  - Merge branch 'master' into JDK-8322256
>>  - Relabel "trailing garbage" as "extra bytes" to sound less accusatory.
>>  - Simplify code by eliminating an impossible case.
>>  - Field name change and Javadoc wording tweaks.
>>  - Merge branch 'master' into JDK-8322256
>>  - ... and 6 more: https://git.openjdk.org/jdk/compare/332df83e...72b60092
>
> First thank you for your efforts and patience with this PR Archie.
> 
> I am trying to better grasp the problem to be solved outside of better 
> clarity surrounding how GZIPInputStream handles concatenated gzip files
> 
> - I could see possibly  providing an enum which indicates to stop 
> decompressing after processing the first gzip file or to decompress until the 
> end of input.  
> - This would be similar to what  the commons-compress 
> GzipCompressorInputStream class does.
> - It is also what Eirik and Jai are suggesting I believe.  
> - If support for processing multiple gzip files is enabled, which would be 
> the default, then the existing behavior is not changed.  
> - As I have mentioned previously, we should add more testing including taking 
> a couple of concatenated examples created via the gzip tool (or equivalent) 
> give how light our coverage is.
> 
> I don't believe we want to do more than is needed given the age of the API 
> and there is been minimal requests for improvements at this time.

Hi @LanceAndersen,

Thanks for taking a look. Let's achieve consensus on what the goal(s) should be 
here, and then I think a proper API for that will come naturally. Up until now 
the discussion has been a shifting blend of API design and API goals which 
makes things a little bit chaotic (probably my fault).

**Proposed Goal №1:**

This one seems relatively non-controversial: Allow creation of a 
`GZIPInputStream` that only decompresses one GZIP stream.

OK but what does that mean exactly?
* Option A: The `GZIPInputStream` will never read past the end of the GZIP 
trailer frame
  * Sounds great, but how exactly would that work?
    * Option A.1: Caller must provide a `BufferedInputStream`, so we can back 
it up if any extra bytes have been read past the trailer frame
    * Option A.2: `GZIPInputStream` provides a way for caller to access the 
extra bytes read once EOF is detected
    * Option A.3: `GZIPInputStream` never buffers more than 8 bytes (size of 
trailer frame)
* Option B: `GZIPInputStream` attempts to read the byte after the trailer frame 
and throws `IOException` if EOF is not returned

My opinion: All of the A options are unsatisfactory (although A.1 is the least 
bad), so it seems option B is the answer here.

**Proposed Goal №2:**

This one seems more controversial: Allow the caller to configure 
`GZIPInputStream` for "strict mode".

Just to be clear and restate what this is referring to. Currently, 
`GZIPInputStream` is hard-wired for "lenient mode". This means that it is 
indeterminate (a) how many additional bytes (if any) were read beyond the GZIP 
trailer frame, and (b) whether reading stopped due to EOF, invalid data, or an 
underlying`IOException`.

My opinion: this is, at best, a random intermittent bug - **and at worst a 
giant security hole** - waiting to happen.

Reasoning: If the caller is OK with lenient mode, then the caller obviously 
doesn't (or shouldn't) care about the data that follows the input stream (if 
any), because the caller won't even know whether there was any, how much of it 
has been discarded (if any), or whether it was actually another valid GZIP 
stream that got corrupted or threw an`IOException`.

So why would any caller actually want to use this feature to concatenate 
anything useful after a GZIP trailer frame?

In particular: if an underlying `IOException` occurs at just the wrong time, an 
application that wrote `GZIP-STREAM-1`, `GZIP-STREAM-2`, `GZIP-STREAM-3` could 
read back `GZIP-STREAM-1`, `GZIP-STREAM-2` and never know that anything was 
wrong.

How useful is concatenated GZIP stream support if it's not even reliable??

This just seems completely wonky to me (and oh, did I mention it's a potential 
security hole waiting to happen? :)

It seems like if concatenation is going to be a supported feature, then it 
should be possible to access a _reliable_ version of it. In other words, there 
should be a way for an application to say "I want to decode exactly ONE (or 
exactly N) whole GZIP streams _or else guarantee me an exception_".

Maybe the answer is simply to tell people "Don't use this class, it's not 
reliable". I'd be OK with that I guess but it seems a bit of a cop out.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18385#issuecomment-2253031572

Re: RFR: 8322256: Define and document GZIPInputStream concatenated stream semantics [v7]

Reply via email to