Re: RFR: 8282648: Problems due to conflicting specification of Inflater::inflate(..) and InflaterInputStream::read(..) [v10]

Mark Reinhold Thu, 21 Jul 2022 10:49:33 -0700

On Mon, 9 May 2022 09:56:19 GMT, Volker Simonis <simo...@openjdk.org> wrote:


>> Add an API note to `InflaterInputStream::read(byte[] b, int off, int len)` 
>> to highlight that it might  write more bytes than the returned number of 
>> inflated bytes into the buffer `b`.
>> 
>> The superclass `java.io.InputStream` specifies that `read(byte[] b, int off, 
>> int len)` will leave the content beyond the last read byte in the read 
>> buffer `b` unaffected. However, the overridden `read` method in 
>> `InflaterInputStream` passes the read buffer `b` to 
>> `Inflater::inflate(byte[] b, int off, int len)` which doesn't provide this 
>> guarantee. Depending on implementation details, `Inflater::inflate` might 
>> write more than the returned number of inflated bytes into the buffer `b`.
>> 
>> ### TL;DR
>> 
>> `java.util.zip.Inflater` is the Java wrapper class for zlib's inflater 
>> functionality. `Inflater::inflate(byte[] output, int off, int len)` 
>> currently calls zlib's native `inflate(..)` function and passes the address 
>> of `output[off]` and `len` to it via JNI.
>> 
>> The specification of zlib's `inflate(..)` function (i.e. the [API 
>> documentation in the original zlib 
>> implementation](https://github.com/madler/zlib/blob/cacf7f1d4e3d44d871b605da3b647f07d718623f/zlib.h#L400))
>>  doesn't give any guarantees with regard to usage of the output buffer. It 
>> only states that upon completion the function will return the number of 
>> bytes that have been written (i.e. "inflated") into the output buffer.
>> 
>> The original zlib implementation only wrote as many bytes into the output 
>> buffer as it inflated. However, this is not a hard requirement and newer, 
>> more performant implementations of the zlib library like 
>> [zlib-chromium](https://chromium.googlesource.com/chromium/src/third_party/zlib/)
>>  or [zlib-cloudflare](https://github.com/cloudflare/zlib) can use more bytes 
>> of the output buffer than they actually inflate as a scratch buffer. See 
>> https://github.com/simonis/zlib-chromium for a more detailed description of 
>> their approach and its performance benefit.
>> 
>> These new zlib versions can still be used transparently from Java (e.g. by 
>> putting them into the `LD_LIBRARY_PATH` or by using `LD_PRELOAD`), because 
>> they still fully comply to specification of `Inflater::inflate(..)`. 
>> However, we might run into problems when using the `Inflater` functionality 
>> from the `InflaterInputStream` class. `InflaterInputStream` is derived from 
>> from `InputStream` and as such, its `read(byte[] b, int off, int len)` 
>> method is quite constrained. It specifically specifies that if *k* bytes 
>> have been read, then "these bytes will be stored in elements `b[off]` 
>> through `b[off+`*k*`-1]`, leaving elements `b[off+`*k*`]` through 
>> `b[off+len-1]` **unaffected**". But `InflaterInputStream::read(byte[] b, int 
>> off, int len)` (which is constrained by `InputStream::read(..)`'s 
>> specification) calls `Inflater::inflate(byte[] b, int off, int len)` and 
>> directly passes its output buffer down to the native zlib `inflate(..)` 
>> method which is free to change the bytes beyond `b[off+`
 *k*`]` (where *k* is the number of inflated bytes).
>> 
>> From a practical point of view, I don't see this as a big problem, because 
>> callers of `InflaterInputStream::read(byte[] b, int off, int len)` can never 
>> know how many bytes will be written into the output buffer `b` (and in fact 
>> its content can always be completely overwritten). It therefore makes no 
>> sense to depend on any data there being untouched after the call. Also, 
>> having used zlib-cloudflare productively for about two years, we haven't 
>> seen real-world issues because of this behavior yet. However, from a 
>> specification point of view it is easy to artificially construct a program 
>> which violates `InflaterInputStream::read(..)`'s postcondition if using one 
>> of the alterantive zlib implementations. A recently integrated JTreg test 
>> (test/jdk/jdk/nio/zipfs/ZipFSOutputStreamTest.java) "unintentionally" fails 
>> with zlib-chromium but can fixed easily to run with alternative 
>> implementations as well (see 
>> [JDK-8283756](https://bugs.openjdk.java.net/browse/JDK-8283756)).
>
> Volker Simonis has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Updated wording based on @JoeDarcy's third CSR review

First, I’ll observe that there is no specification conflict here.  The 
specifications of `InflaterInputStream::read(byte[] b, int off, int len)` and 
`Inflater::inflate(byte[] b, int off, int len)` are not in conflict at all.  
The latter is simply weaker than the former.

What is true is that if you build the JDK in the default manner, without any 
patches, then what you wind up with is an `InflaterInputStream::read(...)` 
implementation that satisfies the inherited `InputStream::read(...)` 
specification, including the constraint that a read operation must not 
_scribble_ in the caller’s output buffer, _i.e._, modify any byte in that 
buffer beyond the returned byte count.  (That constraint is, so far as we know, 
satisfied by every `InputStream` subclass defined in the JDK.)

A default JDK build’s `InflaterInputStream::read(...)` method doesn’t scribble 
because

  - The implementation of `InflaterInputStream::read(...)` assumes that the 
implementation of `Inflater::inflate(...)` doesn’t scribble, and

  - The implementation of `Inflater::inflate(...)` satisfies that assumption 
because it invokes the corresponding function in the [standard native Zip 
implementation](https://zlib.net/), which itself doesn’t scribble.

So, not only does the specification forbid `InflaterInputStream::read(...)` 
from scribbling, but in the official Java SE Reference Implementation — which 
is simply a default JDK build — that method does not scribble.  (Perhaps by 
accident, but that is immaterial here.)

Therefore, if in your own JDK build you substitute an alternative Zip 
implementation whose `inflate` function does scribble then you wind up with a 
JDK that both violates the specification and behaves differently than the RI.  
It is incompatible.  If you use such a JDK in production then you risk breaking 
existing code.

Now, compatibility is one of the key values of the Java Platform, but 
performance is not unimportant.  Is there something we can do to accommodate 
[alternative, faster, scribbling Zip implementations][faster-zlib]?

@simonis lists three options in the [issue].  I’ll cast the options somewhat 
differently.

  1. Do not change the specification in any way.  The specification text that 
forbids scribbling dates back to Java 1.1, released over 25 years ago.  
(Arguably that text over-specifies the `InputStream::read(...)` method, but 
that is also immaterial.)  It’s all too possible that existing code — no matter 
how odd or poorly written we think it might be — depends upon this constraint.

     With this option, a JDK build that uses a scribbling Zip implementation 
would have to buffer that implementation’s output and copy only the returned 
number of bytes into the caller’s buffer.  This would degrade any performance 
benefit.

  2. Weaken the specification of `InflaterInputStream` — and, thereby, that of 
its subclasses `GZipInputStream` and `ZipInputStream` — to allow scribbling.  
This is essentially @simonis’s initial suggestion, as embodied in [63ae357] 
earlier in this PR.

     This option risks breaking existing code that depends upon the 25-year-old 
promise that `InflaterInputStream::read(...)` will not scribble.  Changing the 
specification of an indirect subclass of `InputStream`, moreover, is apt to go 
unnoticed by many developers, who routinely deal with instances of 
`InputStream` without any knowledge of those instances’ concrete classes.  
(@jaikiran gives one example above.)

  3. Weaken the specification of the `InputStream::read(...)` method to allow 
scribbling, as [suggested by @jddarcy in the CSR][jddarcy].

     This option is at least as risky to existing code as the second option, 
though it’s marginally more likely that developers writing new code will notice 
the specification change.

     What’s worse, however, is that this option gives developers writing new 
subclasses of `InputStream` permission to deviate from both that class’s 
longstanding specification and the actual behavior of every `InputStream` 
subclass defined in the JDK.  In the long term, that could wreak subtle bugs 
throughout the ecosystem, even breaking code that does absolutely nothing with 
Zip-related input streams.

In favor of both the second and third options is @simonis’s statement above 
that his employer (Amazon) has used JDK builds with a scribbling Zip 
implementation in production for two years.  He has further elaborated 
(privately) that this includes hundreds of services built with many of the 
usual popular frameworks and libraries (Spring, Hibernate, ASM, etc.).  This 
suggests that the practical impact, out in the wild, of allowing scribbling Zip 
implementations is acceptable.

On that basis I am — just barely — comfortable with the second option, _i.e._, 
weakening the specification of `InflaterInputStream` to allow scribbling, 
thereby contradicting the specification of its `InputStream` superclass.

Explicitly specifying a subclass in a way that’s inconsistent with the 
specification of one of its superclasses is generally a really bad idea.  It 
violates the [Liskov Substitution Principle][lsp], which enables instances of a 
subclass to be substituted freely for instances of one of its superclasses.  
Thus we do not condone such changes lightly.  There are, however, precedents in 
a handful of special cases in the Java Platform; _e.g._, in the 
[`java.util.IdentityHashMap`][ihm] and [`java.sql.Timestamp`][ts] classes.

The third option above would not violate the Liskov substitution principle, but 
in my view could prove more harmful in the long run.

Recommendations:

  - Revert this PR back to [63ae357], and make corresponding updates to the 
[issue] and the [CSR].

  - Additionally, find every method in every `java.*` class that’s declared to 
return an `InputStream` (_e.g._, `ZipFile::getInputStream(...)` but actually 
returns an instance of `InflaterInputStream` or one of its subclasses.  Add an 
API note to each such method to let developers know that the returned input 
stream might scribble, per @jaikiran’s suggestion above.  These notes can all 
use the same text, linking to the revised `InflaterInputStream::read(...)` 
specification.

  - Revise the summary of this PR, the issue, and the CSR to reflect the true 
nature of the proposed change rather than assert a nonexistent conflict.  
Consider “Weaken the `InflaterInputStream` specification in order to allow 
faster Zip implementations”.

  - Change the issue type from a bug — which it’s not — to an enhancement.

  - Create a related release-note issue, as @RogerRiggs suggests above.


[faster-zlib]: 
https://github.com/simonis/zlib-chromium#inflater-improvements-in-zlib-chromium
[issue]: https://bugs.openjdk.org/browse/JDK-8282648
[63ae357]: 
https://github.com/openjdk/jdk/pull/7986/commits/63ae3572da674a0926bb9d0adcea88d89bb56707
[jddarcy]: 
https://bugs.openjdk.org/browse/JDK-8283758?focusedCommentId=14492930&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14492930
[CSR]: https://bugs.openjdk.org/browse/JDK-8283758
[lsp]: https://en.wikipedia.org/wiki/Liskov_substitution_principle
[ihm]: 
https://docs.oracle.com/en/java/javase/18/docs/api/java.base/java/util/IdentityHashMap.html
[ts]: 
https://docs.oracle.com/en/java/javase/18/docs/api/java.sql/java/sql/Timestamp.html

-------------

PR: https://git.openjdk.org/jdk/pull/7986

Re: RFR: 8282648: Problems due to conflicting specification of Inflater::inflate(..) and InflaterInputStream::read(..) [v10]

Reply via email to