potential performance improvement in sun.nio.cs.UTF_8

Johannes Döbler Mon, 12 May 2025 04:17:14 -0700

I have a suggestion for a performance improvement in sun.nio.cs.UTF_8,the workhorse for stream based UTF-8 encoding and decoding, but don'tknow if this has been discussed before.

I explain my idea for the decoding case:

Claes Redestad describes in his bloghttps://cl4es.github.io/2021/02/23/Faster-Charset-Decoding.html how heused SIMD intrinsics (now JavaLangAccess.decodeASCII) to speed UTF_8decoding when buffers are backed by arrays:


https://github.com/openjdk/jdk/blob/0258d9998ebc523a6463818be00353c6ac8b7c9c/src/java.base/share/classes/sun/nio/cs/UTF_8.java#L231

 * first a call to JLA.decodeASCII harvests all ASCII-characters
   (=1-byte UTF-8 sequence) at the beginning of the input
 * then enters the slow loop of looking at UTF-8 byte sequences in the
   input buffer and writing to the output buffer (this is basically the
   old implementation)

If the input is all ASCII all decoding work is done in JLA.decodeASCIIresulting in an extreme performance boost. But if the input contains anon-ASCII it will fall back to the slow array loop.

Now here is my idea: Why not call JLA.decodeASCI whenever an ASCII inputis seen:


while (sp < sl) {
    int b1 = sa[sp];
    if (b1 >= 0) {
        // 1 byte, 7 bits: 0xxxxxxx
        if (dp >= dl)
            return xflow(src, sp, sl, dst, dp, 1);
        // my change

* int n = JLA.decodeASCII(sa, sp, da, dp, Math.min(sl - sp, dl -dp));

        sp += n;
        dp += n;
*    } else if ((b1 >> 5) == -2 && (b1 & 0x1e) != 0) {

I setup a small improvised benchmark to get an idea of the impact:

Benchmark                     (data) Mode  Cnt        Score   Error  Units
DecoderBenchmark.jdkDecoder  TD_8000  thrpt    2 2045960,037          ops/s
DecoderBenchmark.jdkDecoder  TD_3999  thrpt    2 263744,675          ops/s
DecoderBenchmark.jdkDecoder   TD_999  thrpt    2 154232,940          ops/s
DecoderBenchmark.jdkDecoder   TD_499  thrpt    2 142239,763          ops/s
DecoderBenchmark.jdkDecoder    TD_99  thrpt    2 128678,229          ops/s
DecoderBenchmark.jdkDecoder     TD_9  thrpt    2 127388,649          ops/s
DecoderBenchmark.jdkDecoder     TD_4  thrpt    2 119834,183          ops/s
DecoderBenchmark.jdkDecoder     TD_2  thrpt    2 111733,115          ops/s
DecoderBenchmark.jdkDecoder     TD_1  thrpt    2 102397,455          ops/s
DecoderBenchmark.newDecoder  TD_8000  thrpt    2 2022997,518          ops/s
DecoderBenchmark.newDecoder  TD_3999  thrpt    2 2909450,005          ops/s
DecoderBenchmark.newDecoder   TD_999  thrpt    2 2140307,712          ops/s
DecoderBenchmark.newDecoder   TD_499  thrpt    2 1171970,809          ops/s
DecoderBenchmark.newDecoder    TD_99  thrpt    2 686771,614          ops/s
DecoderBenchmark.newDecoder     TD_9  thrpt    2 95181,541          ops/s
DecoderBenchmark.newDecoder     TD_4  thrpt    2 65656,184          ops/s
DecoderBenchmark.newDecoder     TD_2  thrpt    2 45439,240          ops/s
DecoderBenchmark.newDecoder     TD_1  thrpt    2 36994,738          ops/s

(The benchmark uses only memory buffers, each test input is a UTF-8encoded byte buffer which produces 8000 chars and consists of variouslength of pure ascii bytes, followed by a 2-byte UTF-8 sequenceproducing a non-ASCII char:

TD_8000: 8000 ascii bytes -> 1 call to JLA.decodeASCII

TD_3999: 3999 ascii bytes + 2 non-ascii bytes, repeated 2 times -> 2calls to JLA.decodeASCII

...

TD_1: 1 ascii byte + 2 non-ascii bytes, repeated 4000 times -> 4000calls to JLA.decodeASCII


Interpretation:

 * Input all ASCII: same performance as before
 * Input contains pure ASCII sequence of considerable length
   interrupted by non ASCII bytes: now seeing huge performance
   improvements similar to the pure ASCII case.
 * Input has lot of short sequences of ASCII-bytes interrupted by non
   ASCII bytes: at some point performance drops below current
   implementation.

Thanks for reading and happy to hear your opinions,
Johannes

potential performance improvement in sun.nio.cs.UTF_8

Reply via email to