Re: RFR: 8282648: Weaken the InflaterInputStream specification in order to allow faster Zip implementations [v11]

Mark Reinhold Thu, 25 Aug 2022 13:54:31 -0700

On Wed, 27 Jul 2022 10:47:44 GMT, Volker Simonis <simo...@openjdk.org> wrote:


>> Add an API note to `InflaterInputStream::read(byte[] b, int off, int len)` 
>> to highlight that it might  write more bytes than the returned number of 
>> inflated bytes into the buffer `b`.
>> 
>> The superclass `java.io.InputStream` specifies that `read(byte[] b, int off, 
>> int len)` will leave the content beyond the last read byte in the read 
>> buffer `b` unaffected. However, the overridden `read` method in 
>> `InflaterInputStream` passes the read buffer `b` to 
>> `Inflater::inflate(byte[] b, int off, int len)` which doesn't provide this 
>> guarantee. Depending on implementation details, `Inflater::inflate` might 
>> write more than the returned number of inflated bytes into the buffer `b`.
>> 
>> ### TL;DR
>> 
>> `java.util.zip.Inflater` is the Java wrapper class for zlib's inflater 
>> functionality. `Inflater::inflate(byte[] output, int off, int len)` 
>> currently calls zlib's native `inflate(..)` function and passes the address 
>> of `output[off]` and `len` to it via JNI.
>> 
>> The specification of zlib's `inflate(..)` function (i.e. the [API 
>> documentation in the original zlib 
>> implementation](https://github.com/madler/zlib/blob/cacf7f1d4e3d44d871b605da3b647f07d718623f/zlib.h#L400))
>>  doesn't give any guarantees with regard to usage of the output buffer. It 
>> only states that upon completion the function will return the number of 
>> bytes that have been written (i.e. "inflated") into the output buffer.
>> 
>> The original zlib implementation only wrote as many bytes into the output 
>> buffer as it inflated. However, this is not a hard requirement and newer, 
>> more performant implementations of the zlib library like 
>> [zlib-chromium](https://chromium.googlesource.com/chromium/src/third_party/zlib/)
>>  or [zlib-cloudflare](https://github.com/cloudflare/zlib) can use more bytes 
>> of the output buffer than they actually inflate as a scratch buffer. See 
>> https://github.com/simonis/zlib-chromium for a more detailed description of 
>> their approach and its performance benefit.
>> 
>> These new zlib versions can still be used transparently from Java (e.g. by 
>> putting them into the `LD_LIBRARY_PATH` or by using `LD_PRELOAD`), because 
>> they still fully comply to specification of `Inflater::inflate(..)`. 
>> However, we might run into problems when using the `Inflater` functionality 
>> from the `InflaterInputStream` class. `InflaterInputStream` is derived from 
>> from `InputStream` and as such, its `read(byte[] b, int off, int len)` 
>> method is quite constrained. It specifically specifies that if *k* bytes 
>> have been read, then "these bytes will be stored in elements `b[off]` 
>> through `b[off+`*k*`-1]`, leaving elements `b[off+`*k*`]` through 
>> `b[off+len-1]` **unaffected**". But `InflaterInputStream::read(byte[] b, int 
>> off, int len)` (which is constrained by `InputStream::read(..)`'s 
>> specification) calls `Inflater::inflate(byte[] b, int off, int len)` and 
>> directly passes its output buffer down to the native zlib `inflate(..)` 
>> method which is free to change the bytes beyond `b[off+`
 *k*`]` (where *k* is the number of inflated bytes).
>> 
>> From a practical point of view, I don't see this as a big problem, because 
>> callers of `InflaterInputStream::read(byte[] b, int off, int len)` can never 
>> know how many bytes will be written into the output buffer `b` (and in fact 
>> its content can always be completely overwritten). It therefore makes no 
>> sense to depend on any data there being untouched after the call. Also, 
>> having used zlib-cloudflare productively for about two years, we haven't 
>> seen real-world issues because of this behavior yet. However, from a 
>> specification point of view it is easy to artificially construct a program 
>> which violates `InflaterInputStream::read(..)`'s postcondition if using one 
>> of the alterantive zlib implementations. A recently integrated JTreg test 
>> (test/jdk/jdk/nio/zipfs/ZipFSOutputStreamTest.java) "unintentionally" fails 
>> with zlib-chromium but can fixed easily to run with alternative 
>> implementations as well (see 
>> [JDK-8283756](https://bugs.openjdk.java.net/browse/JDK-8283756)).
>
> Volker Simonis has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains eight commits:
> 
>  - Add apiNote to methods which return InflaterInputStreams as InputStreams
>  - Updated wording based on @JoeDarcy's CSR review
>  - Updated wording based on @LanceAndersen's review
>  - Adapted wording based on @AlanBateman's review, removed implementation 
> note on ZipFile::getInputStream and aligned wording for all ::read methods
>  - Adapted wording and copyrights based on @jaikiran review
>  - Added API-refinement to GZIPInputStream::read()/ZipInputStream::read() and 
> an Implementation note to ZipFile::getInputStream()
>  - Extended API-doc based on reviewer feedback
>  - 8282648: Weaken the InflaterInputStream specification in order to allow 
> faster Zip implementations

Please update the CSR once you’ve finalized the specification changes.

src/java.base/share/classes/java/net/URLConnection.java line 848:

> 846:      * last inflated byte undefined after a read operation (see {@link
> 847:      * java.util.zip.InflaterInputStream#read(byte[], int, int)
> 848:      * InflaterInputStream.read(byte[], int, int)}).

Consider this wording, which makes the danger clearer, here and elsewhere:

     * @apiNote The {@code InputStream} returned by this method can wrap an
     * {@link java.util.zip.InflaterInputStream InflaterInputStream}, whose
     * {@link java.util.zip.InflaterInputStream#read(byte[], int, int)
     * read(byte[], int, int)} method can modify any element of the output
     * buffer.

src/java.base/share/classes/java/util/zip/InflaterInputStream.java line 136:

> 134:      * are undefined (an implementation is free to change them during 
> the inflate
> 135:      * operation). If the return value is -1 or an exception is thrown, 
> then the content of
> 136:      * {@code b[off]} to {@code b[off+}<i>len</i>{@code -1]} is 
> undefined.

Consider this more precise wording, both here and in `GZIPInputStream`, 
`ZipInputStream`, and `JarInputStream`:

     * If this method returns a nonzero integer <i>n</i> then {@code buf[off]}
     * through {@code buf[off+}<i>n</i>{@code -1]} contain the uncompressed
     * data.  The content of elements {@code buf[off+}<i>n</i>{@code ]} through
     * {@code buf[off+}<i>len</i>{@code -1]} is undefined, contrary to the
     * specification of the {@link java.io.InputStream InputStream} superclass,
     * so an implementation is free to modify these elements during the inflate
     * operation. If this method returns {@code -1} or throws an exception then
     * the content of {@code buf[off]} through {@code buf[off+}<i>len</i>{@code
     * -1]} is undefined.

src/java.base/share/classes/java/util/zip/InflaterInputStream.java line 141:

> 139:      * @param off the start offset in the destination array {@code b}
> 140:      * @param len the maximum number of bytes read
> 141:      * @return the actual number of inflated bytes, or -1 if the end of 
> the

s/inflated bytes/bytes inflated/

-------------

Changes requested by mr (Lead).

PR: https://git.openjdk.org/jdk/pull/7986

Re: RFR: 8282648: Weaken the InflaterInputStream specification in order to allow faster Zip implementations [v11]

Reply via email to