I have a suggestion for a performance improvement in sun.nio.cs.UTF_8, the workhorse for stream based UTF-8 encoding and decoding, but don't know if this has been discussed before.
I explain my idea for the decoding case:
Claes Redestad describes in his blog https://cl4es.github.io/2021/02/23/Faster-Charset-Decoding.html  how he used SIMD intrinsics (now JavaLangAccess.decodeASCII) to speed UTF_8 decoding when buffers are backed by arrays:

https://github.com/openjdk/jdk/blob/0258d9998ebc523a6463818be00353c6ac8b7c9c/src/java.base/share/classes/sun/nio/cs/UTF_8.java#L231

 * first a call to JLA.decodeASCII harvests all ASCII-characters
   (=1-byte UTF-8 sequence) at the beginning of the input
 * then enters the slow loop of looking at UTF-8 byte sequences in the
   input buffer and writing to the output buffer (this is basically the
   old implementation)

If the input is all ASCII all decoding work is done in JLA.decodeASCII resulting in an extreme performance boost. But if the input contains a non-ASCII it will fall back to the slow array loop.

Now here is my idea: Why not call JLA.decodeASCI whenever an ASCII input is seen:

while (sp < sl) {
    int b1 = sa[sp];
    if (b1 >= 0) {
        // 1 byte, 7 bits: 0xxxxxxx
        if (dp >= dl)
            return xflow(src, sp, sl, dst, dp, 1);
        // my change
*        int n = JLA.decodeASCII(sa, sp, da, dp, Math.min(sl - sp, dl - dp));
        sp += n;
        dp += n;
*    } else if ((b1 >> 5) == -2 && (b1 & 0x1e) != 0) {

I setup a small improvised benchmark to get an idea of the impact:

Benchmark                     (data) Mode  Cnt        Score   Error  Units
DecoderBenchmark.jdkDecoder  TD_8000  thrpt    2 2045960,037          ops/s
DecoderBenchmark.jdkDecoder  TD_3999  thrpt    2 263744,675          ops/s
DecoderBenchmark.jdkDecoder   TD_999  thrpt    2 154232,940          ops/s
DecoderBenchmark.jdkDecoder   TD_499  thrpt    2 142239,763          ops/s
DecoderBenchmark.jdkDecoder    TD_99  thrpt    2 128678,229          ops/s
DecoderBenchmark.jdkDecoder     TD_9  thrpt    2 127388,649          ops/s
DecoderBenchmark.jdkDecoder     TD_4  thrpt    2 119834,183          ops/s
DecoderBenchmark.jdkDecoder     TD_2  thrpt    2 111733,115          ops/s
DecoderBenchmark.jdkDecoder     TD_1  thrpt    2 102397,455          ops/s
DecoderBenchmark.newDecoder  TD_8000  thrpt    2 2022997,518          ops/s
DecoderBenchmark.newDecoder  TD_3999  thrpt    2 2909450,005          ops/s
DecoderBenchmark.newDecoder   TD_999  thrpt    2 2140307,712          ops/s
DecoderBenchmark.newDecoder   TD_499  thrpt    2 1171970,809          ops/s
DecoderBenchmark.newDecoder    TD_99  thrpt    2 686771,614          ops/s
DecoderBenchmark.newDecoder     TD_9  thrpt    2 95181,541          ops/s
DecoderBenchmark.newDecoder     TD_4  thrpt    2 65656,184          ops/s
DecoderBenchmark.newDecoder     TD_2  thrpt    2 45439,240          ops/s
DecoderBenchmark.newDecoder     TD_1  thrpt    2 36994,738          ops/s

(The benchmark uses only memory buffers, each test input is a UTF-8 encoded byte buffer which produces 8000 chars and consists of various length of pure ascii bytes, followed by a 2-byte UTF-8 sequence producing a non-ASCII char:
TD_8000: 8000 ascii bytes -> 1 call to JLA.decodeASCII
TD_3999: 3999 ascii bytes + 2 non-ascii bytes, repeated 2 times -> 2 calls to JLA.decodeASCII
...
TD_1: 1 ascii byte + 2 non-ascii bytes, repeated 4000 times -> 4000 calls to JLA.decodeASCII

Interpretation:

 * Input all ASCII: same performance as before
 * Input contains pure ASCII sequence of considerable length
   interrupted by non ASCII bytes: now seeing huge performance
   improvements similar to the pure ASCII case.
 * Input has lot of short sequences of ASCII-bytes interrupted by non
   ASCII bytes: at some point performance drops below current
   implementation.

Thanks for reading and happy to hear your opinions,
Johannes

Reply via email to