On Mon, 31 Oct 2022 09:52:06 GMT, Alan Bateman <al...@openjdk.org> wrote:
>> **TL;DR:** Your optimization is now contained in this PR and outperforms the >> previous code. Nevertheless I plead for Alternative C (double buffering) >> instead, which is even faster but much simpler. >> >> **Justification** >> >> I took some time and did intensive JMH testing (with two buffer sizes and >> with four percental start positions within the buffer to simulate >> non-pre-read, slightly-pre-read, mostly-pre-read and fully-pre-read buffer >> situations) with three variants: The current code with early buffer resize >> (which you say is not necessary), the current code without early buffer >> resize (which you propose instead), and the so-called Alternative C (which >> uses double-buffering instead of buffer-replacing). >> >> * Alternative B: Optimization outperforms unoptimized code >> The results are shown below, and as you can see, early buffer resize is only >> a problem in *some*, but not all, cases. Nevertheless I agree that should be >> avoided, so I already pushed it as commit >> https://github.com/openjdk/jdk/pull/10525/commits/9c2f502c06083694deefe6a9d4c8ee4b77a92c68 >> and hope that you are fine with this latest piece of code. >> >> * Alternative C: Outperforms Alternative B (optimized *and* unoptimized >> code) unless buffer is huge >> Interesting to see BTW is that Alternative C (double-buffering instead of >> buffer-replacing) even outperforms that optimized code for the default >> buffer size (8K), and is only slightly slower in the case of large (1M) >> buffers (apparently `System.arraycopy` is faster than `Arrays.fill` unless >> the array is huge, which I did not expect originally). As the code is >> super-simple compared to the current code, and as the originally assumed >> risk of OOME does not exists (if it would, the `fill()` method would fail >> already, as it uses double-buffering, too, and is called by the `read` >> happening *before* `transferTo`) I would propose that we go with this code >> instead of the optimized Alternative B: >> >> private long implTransferTo(OutputStream out) throws IOException { >> if (getClass() == BufferedInputStream.class && markpos == -1) { >> int avail = count - pos; >> if (avail > 0) { >> byte[] buffer = new byte[avail]; // Double buffering prevents >> poisoning and leaking >> System.arraycopy(getBufIfOpen(), pos, buffer, 0, avail); >> out.write(buffer); >> pos = count; >> } >> return avail + getInIfOpen().transferTo(out); >> } else { >> return super.transferTo(out); >> } >> } >> >> >> Here are the numbers in full length for your reference: >> >> Alternative B optimized (no early buffer resize) >> >> Benchmark (bufSize) (posPercent) >> Mode Cnt Score Error Units >> BufferedInputStreamTransferToPerformance.transferTo 8192 0 >> thrpt 25 4289,949 ┬▒ 217,452 ops/s >> BufferedInputStreamTransferToPerformance.transferTo 8192 10 >> thrpt 25 3823,443 ┬▒ 314,171 ops/s >> BufferedInputStreamTransferToPerformance.transferTo 8192 90 >> thrpt 25 4122,949 ┬▒ 52,131 ops/s >> BufferedInputStreamTransferToPerformance.transferTo 8192 100 >> thrpt 25 4675,491 ┬▒ 46,194 ops/s >> BufferedInputStreamTransferToPerformance.transferTo 1048576 0 >> thrpt 25 3477,497 ┬▒ 111,229 ops/s >> BufferedInputStreamTransferToPerformance.transferTo 1048576 10 >> thrpt 25 4767,220 ┬▒ 18,274 ops/s >> BufferedInputStreamTransferToPerformance.transferTo 1048576 90 >> thrpt 25 43900,923 ┬▒ 799,471 ops/s >> BufferedInputStreamTransferToPerformance.transferTo 1048576 100 >> thrpt 25 1680163,414 ┬▒ 29823,617 ops/s >> >> Alternative B original (with early buffer resize) >> >> Benchmark (bufSize) (posPercent) >> Mode Cnt Score Error Units >> BufferedInputStreamTransferToPerformance.transferTo 8192 0 >> thrpt 25 4346,835 ┬▒ 273,140 ops/s >> BufferedInputStreamTransferToPerformance.transferTo 8192 10 >> thrpt 25 3816,744 ┬▒ 220,669 ops/s >> BufferedInputStreamTransferToPerformance.transferTo 8192 90 >> thrpt 25 3624,021 ┬▒ 340,279 ops/s >> BufferedInputStreamTransferToPerformance.transferTo 8192 100 >> thrpt 25 4484,060 ┬▒ 95,201 ops/s >> BufferedInputStreamTransferToPerformance.transferTo 1048576 0 >> thrpt 25 3507,334 ┬▒ 140,604 ops/s >> BufferedInputStreamTransferToPerformance.transferTo 1048576 10 >> thrpt 25 4800,050 ┬▒ 44,987 ops/s >> BufferedInputStreamTransferToPerformance.transferTo 1048576 90 >> thrpt 25 44327,841 ┬▒ 256,322 ops/s >> BufferedInputStreamTransferToPerformance.transferTo 1048576 100 >> thrpt 25 1699685,163 ┬▒ 44257,421 ops/s >> >> Alternative C: Double buffering >> >> Benchmark (bufSize) (posPercent) >> Mode Cnt Score Error Units >> BufferedInputStreamTransferToPerformance.transferTo 8192 0 >> thrpt 25 4555,523 ┬▒ 83,370 ops/s >> BufferedInputStreamTransferToPerformance.transferTo 8192 10 >> thrpt 25 3972,562 ┬▒ 489,380 ops/s >> BufferedInputStreamTransferToPerformance.transferTo 8192 90 >> thrpt 25 4181,716 ┬▒ 49,527 ops/s >> BufferedInputStreamTransferToPerformance.transferTo 8192 100 >> thrpt 25 4602,409 ┬▒ 125,641 ops/s >> BufferedInputStreamTransferToPerformance.transferTo 1048576 0 >> thrpt 25 3373,734 ┬▒ 161,989 ops/s >> BufferedInputStreamTransferToPerformance.transferTo 1048576 10 >> thrpt 25 4684,583 ┬▒ 87,257 ops/s >> BufferedInputStreamTransferToPerformance.transferTo 1048576 90 >> thrpt 25 44082,232 ┬▒ 364,210 ops/s >> BufferedInputStreamTransferToPerformance.transferTo 1048576 100 >> thrpt 25 1690855,778 ┬▒ 53885,439 ops/s >> >> (N.B.: Huge buffers appear to have better numbers than small buffers simply >> because the test always transported 1 MB of data, so with a huge buffer this >> happens in few steps, while with small buffers this happens in many steps.) > >> **TL;DR:** Your optimization is now contained in this PR and outperforms the >> previous code. Nevertheless I plead for Alternative C (double buffering) >> instead, which is even faster but much simpler. > > You had dismissed this previously so it's useful to have results now to > compare. I agree it's much simpler, it can probably use Arrays.copyOfRange > too. > > For the results, can you say what the underlying input stream is and what the > target output stream is? Also it would be useful to get a baseline. I suspect > you have that, it's just not pasted in the table above. @AlanBateman Alternative C+ Arrays.copyOfRange() is implemented in https://github.com/openjdk/jdk/pull/10525/commits/92ed32cd28dab1ef5a03c6a8b1e13fef04a52753. So I think we finally got it? :-) ------------- PR: https://git.openjdk.org/jdk/pull/10525