I have a suggestion for a performance improvement in sun.nio.cs.UTF_8,
the workhorse for stream based UTF-8 encoding and decoding, but don't
know if this has been discussed before.
I explain my idea for the decoding case:
Claes Redestad describes in his blog
https://cl4es.github.io/2021/02/23/Faster-Charset-Decoding.html how he
used SIMD intrinsics (now JavaLangAccess.decodeASCII) to speed UTF_8
decoding when buffers are backed by arrays:
https://github.com/openjdk/jdk/blob/0258d9998ebc523a6463818be00353c6ac8b7c9c/src/java.base/share/classes/sun/nio/cs/UTF_8.java#L231
* first a call to JLA.decodeASCII harvests all ASCII-characters
(=1-byte UTF-8 sequence) at the beginning of the input
* then enters the slow loop of looking at UTF-8 byte sequences in the
input buffer and writing to the output buffer (this is basically the
old implementation)
If the input is all ASCII all decoding work is done in JLA.decodeASCII
resulting in an extreme performance boost. But if the input contains a
non-ASCII it will fall back to the slow array loop.
Now here is my idea: Why not call JLA.decodeASCI whenever an ASCII input
is seen:
while (sp < sl) {
int b1 = sa[sp];
if (b1 >= 0) {
// 1 byte, 7 bits: 0xxxxxxx
if (dp >= dl)
return xflow(src, sp, sl, dst, dp, 1);
// my change
* int n = JLA.decodeASCII(sa, sp, da, dp, Math.min(sl - sp, dl -
dp));
sp += n;
dp += n;
* } else if ((b1 >> 5) == -2 && (b1 & 0x1e) != 0) {
I setup a small improvised benchmark to get an idea of the impact:
Benchmark (data) Mode Cnt Score Error Units
DecoderBenchmark.jdkDecoder TD_8000 thrpt 2 2045960,037 ops/s
DecoderBenchmark.jdkDecoder TD_3999 thrpt 2 263744,675 ops/s
DecoderBenchmark.jdkDecoder TD_999 thrpt 2 154232,940 ops/s
DecoderBenchmark.jdkDecoder TD_499 thrpt 2 142239,763 ops/s
DecoderBenchmark.jdkDecoder TD_99 thrpt 2 128678,229 ops/s
DecoderBenchmark.jdkDecoder TD_9 thrpt 2 127388,649 ops/s
DecoderBenchmark.jdkDecoder TD_4 thrpt 2 119834,183 ops/s
DecoderBenchmark.jdkDecoder TD_2 thrpt 2 111733,115 ops/s
DecoderBenchmark.jdkDecoder TD_1 thrpt 2 102397,455 ops/s
DecoderBenchmark.newDecoder TD_8000 thrpt 2 2022997,518 ops/s
DecoderBenchmark.newDecoder TD_3999 thrpt 2 2909450,005 ops/s
DecoderBenchmark.newDecoder TD_999 thrpt 2 2140307,712 ops/s
DecoderBenchmark.newDecoder TD_499 thrpt 2 1171970,809 ops/s
DecoderBenchmark.newDecoder TD_99 thrpt 2 686771,614 ops/s
DecoderBenchmark.newDecoder TD_9 thrpt 2 95181,541 ops/s
DecoderBenchmark.newDecoder TD_4 thrpt 2 65656,184 ops/s
DecoderBenchmark.newDecoder TD_2 thrpt 2 45439,240 ops/s
DecoderBenchmark.newDecoder TD_1 thrpt 2 36994,738 ops/s
(The benchmark uses only memory buffers, each test input is a UTF-8
encoded byte buffer which produces 8000 chars and consists of various
length of pure ascii bytes, followed by a 2-byte UTF-8 sequence
producing a non-ASCII char:
TD_8000: 8000 ascii bytes -> 1 call to JLA.decodeASCII
TD_3999: 3999 ascii bytes + 2 non-ascii bytes, repeated 2 times -> 2
calls to JLA.decodeASCII
...
TD_1: 1 ascii byte + 2 non-ascii bytes, repeated 4000 times -> 4000
calls to JLA.decodeASCII
Interpretation:
* Input all ASCII: same performance as before
* Input contains pure ASCII sequence of considerable length
interrupted by non ASCII bytes: now seeing huge performance
improvements similar to the pure ASCII case.
* Input has lot of short sequences of ASCII-bytes interrupted by non
ASCII bytes: at some point performance drops below current
implementation.
Thanks for reading and happy to hear your opinions,
Johannes