Re: String encodeUTF8 latin1 with negatives

Roger Riggs Mon, 28 Jul 2025 13:18:45 -0700

Hi Brett,

Extra care is needed if the input array might be modified concurrentlywith the method execution.When control flow decisions are made based on array contents, theintegrity of the result depends on reading each byte of the arrayexactly once.


Regards, Roger



On 7/27/25 4:45 PM, Brett Okken wrote:

In String.encodeUTF8, when the coder is latin1, there is a call toStringCoding.hasNegatives to determine if any special handling isneeded. If not, a clone of the val is returned.If there are negative values, it then loops, from the beginning,through all the values to handle any individual negative values.

Would it be better to call StringCoding.countPositives? If the resultequals the length, the clone can still be returned. But if it doesnot, all the values which are positive can be simply copied to thetarget byte[] and only values beyond that point need to be checked again.


https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/String.java#L1287-L1300

        if (!StringCoding.hasNegatives(val, 0, val.length)) {
            return val.clone();
        }

        int dp = 0;
        byte[] dst = StringUTF16.newBytesFor(val.length);
        for (byte c : val) {
            if (c < 0) {
                dst[dp++] = (byte) (0xc0 | ((c & 0xff) >> 6));
                dst[dp++] = (byte) (0x80 | (c & 0x3f));
            } else {
                dst[dp++] = c;
            }
        }


Can be changed to look like:

        int positives = StringCoding.countPositives(val, 0, val.length);
        if (positives == val.length) {
            return val.clone();
        }

int dp = positives;
        byte[] dst = StringUTF16.newBytesFor(val.length);
        if (positives > 0) {
            System.arraycopy(val, 0, dst, 0, positives);
        }
        for (int i=dp; i<val.length; ++i) {
            byte c = val[i];
            if (c < 0) {
                dst[dp++] = (byte) (0xc0 | ((c & 0xff) >> 6));
                dst[dp++] = (byte) (0x80 | (c & 0x3f));
            } else {
                dst[dp++] = c;
            }
        }

I have done a bit of testing with the StringEncode jmh benchmark on mylocal windows device.


encodeLatin1LongEnd speeds up significantly (~70%)
encodeLatin1LongStart slows down (~20%)
encodeLatin1Mixed speeds up by ~30%

The remaining tests do not show much difference either way.

Brett

Re: String encodeUTF8 latin1 with negatives

Reply via email to