[
https://issues.apache.org/jira/browse/CASSANDRA-21075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitry Konstantinov updated CASSANDRA-21075:
--------------------------------------------
Description:
In my batch write test, UTF8 validation contributes 2.1% of CPU:
[^before_cpu.html]
In UTF8Validator.validate we can apply the same optimization as Guava and JDK
does: they use a plain loop to check if it is ASCII symbol before going into
more complicated UTF8 parsing:
*
[https://github.com/google/guava/blob/master/guava/src/com/google/common/base/Utf8.java#L123]
{code:java}
for (int i = off; i < end; i++) {
if (bytes[i] < 0) {
return isWellFormedSlowPath(bytes, i, end);
}
} {code}
* java.lang.StringCoding#decodeUTF8
{code:java}
// ascii-bais, which has a relative impact to the non-ascii-only bytes
if (COMPACT_STRINGS && !hasNegatives(src, sp, len))
return resultCached().with(Arrays.copyOfRange(src, sp, sp + len),
LATIN1);
return decodeUTF8_0(src, sp, len, doReplace);
where:
public static boolean hasNegatives(byte[] ba, int off, int len) {
for (int i = off; i < off + len; i++) {
if (ba[i] < 0) {
return true;
}
}
return false;
} {code}
See also: https://lemire.me/blog/2018/10/16/validating-utf-8-bytes-java-edition/
Additionally, ValueAccessor is not free and avoiding it we can get extra boost,
especially in non-monomorphic cases.
was:
In my batch write test, UTF8 validation contributes 2.1% of CPU:
[^before_cpu.html]
In UTF8Validator.validate we can apply the same optimization as Guava and JDK
does: they use a plain loop to check if it is ASCII symbol before going into
more complicated UTF8 parsing:
*
[https://github.com/google/guava/blob/master/guava/src/com/google/common/base/Utf8.java#L123]
{code:java}
for (int i = off; i < end; i++) {
if (bytes[i] < 0) {
return isWellFormedSlowPath(bytes, i, end);
}
} {code}
* java.lang.StringCoding#decodeUTF8
{code:java}
// ascii-bais, which has a relative impact to the non-ascii-only bytes
if (COMPACT_STRINGS && !hasNegatives(src, sp, len))
return resultCached().with(Arrays.copyOfRange(src, sp, sp + len),
LATIN1);
return decodeUTF8_0(src, sp, len, doReplace);
where:
public static boolean hasNegatives(byte[] ba, int off, int len) {
for (int i = off; i < off + len; i++) {
if (ba[i] < 0) {
return true;
}
}
return false;
} {code}
See also: https://lemire.me/blog/2018/10/16/validating-utf-8-bytes-java-edition/
> Optimize UTF8Validator.validate for ASCII prefixed Strings
> ----------------------------------------------------------
>
> Key: CASSANDRA-21075
> URL: https://issues.apache.org/jira/browse/CASSANDRA-21075
> Project: Apache Cassandra
> Issue Type: Improvement
> Components: CQL/Interpreter
> Reporter: Dmitry Konstantinov
> Assignee: Dmitry Konstantinov
> Priority: Normal
> Fix For: 5.x
>
> Attachments: before_cpu.html
>
>
> In my batch write test, UTF8 validation contributes 2.1% of CPU:
> [^before_cpu.html]
> In UTF8Validator.validate we can apply the same optimization as Guava and JDK
> does: they use a plain loop to check if it is ASCII symbol before going into
> more complicated UTF8 parsing:
> *
> [https://github.com/google/guava/blob/master/guava/src/com/google/common/base/Utf8.java#L123]
> {code:java}
> for (int i = off; i < end; i++) {
> if (bytes[i] < 0) {
> return isWellFormedSlowPath(bytes, i, end);
> }
> } {code}
> * java.lang.StringCoding#decodeUTF8
> {code:java}
> // ascii-bais, which has a relative impact to the non-ascii-only bytes
> if (COMPACT_STRINGS && !hasNegatives(src, sp, len))
> return resultCached().with(Arrays.copyOfRange(src, sp, sp + len),
> LATIN1);
> return decodeUTF8_0(src, sp, len, doReplace);
> where:
> public static boolean hasNegatives(byte[] ba, int off, int len) {
> for (int i = off; i < off + len; i++) {
> if (ba[i] < 0) {
> return true;
> }
> }
> return false;
> } {code}
> See also:
> https://lemire.me/blog/2018/10/16/validating-utf-8-bytes-java-edition/
> Additionally, ValueAccessor is not free and avoiding it we can get extra
> boost, especially in non-monomorphic cases.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]