Dmitry Konstantinov created CASSANDRA-21075:
-----------------------------------------------

             Summary: Optimize UTF8Validator.validate for ASCII prefixed Strings
                 Key: CASSANDRA-21075
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21075
             Project: Apache Cassandra
          Issue Type: Improvement
          Components: CQL/Interpreter
            Reporter: Dmitry Konstantinov
            Assignee: Dmitry Konstantinov


In UTF8Validator.validate we can apply the same optimization as Guava and JDK 
does: they use a plain loop to check if it is ASCII symbol before going into 
more complicated UTF8 parsing:
 * 
[https://github.com/google/guava/blob/master/guava/src/com/google/common/base/Utf8.java#L123]

{code:java}
for (int i = off; i < end; i++) {
    if (bytes[i] < 0) {
        return isWellFormedSlowPath(bytes, i, end);
    }
} {code}

 * java.lang.StringCoding#decodeUTF8 

{code:java}
// ascii-bais, which has a relative impact to the non-ascii-only bytes
if (COMPACT_STRINGS && !hasNegatives(src, sp, len))
    return resultCached().with(Arrays.copyOfRange(src, sp, sp + len),
                                   LATIN1);
return decodeUTF8_0(src, sp, len, doReplace);

where:

public static boolean hasNegatives(byte[] ba, int off, int len) {
    for (int i = off; i < off + len; i++) {
        if (ba[i] < 0) {
            return true;
        }
    }
    return false;
} {code}

See also: https://lemire.me/blog/2018/10/16/validating-utf-8-bytes-java-edition/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to