[
https://issues.apache.org/jira/browse/CASSANDRA-21075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18045495#comment-18045495
]
Dmitry Konstantinov edited comment on CASSANDRA-21075 at 12/16/25 3:23 PM:
---------------------------------------------------------------------------
More tests for different kinds of Strings, Temurin-17.0.12+7:
https://github.com/netudima/cassandra/blob/CASSANDRA-21075-trunk-experiments/test/microbench/org/apache/cassandra/test/microbench/UTF8ValidatorBench.java
{code:java}
[java] Benchmark
(stringType) Mode Cnt Score Error Units
[java] UTF8ValidatorBench.testOldBimorphic
short ASCII avgt 15 29.167 ± 1.170 ns/op
[java] UTF8ValidatorBench.testNewBimorphic
short ASCII avgt 15 11.354 ± 0.181 ns/op
[java] UTF8ValidatorBench.testOldBimorphic
long ASCII avgt 15 967.613 ± 23.208 ns/op
[java] UTF8ValidatorBench.testNewBimorphic
long ASCII avgt 15 597.225 ± 33.805 ns/op
[java] UTF8ValidatorBench.testOldBimorphic short ASCII
prefix non-ASCII avgt 15 462.977 ± 30.108 ns/op
[java] UTF8ValidatorBench.testNewBimorphic short ASCII
prefix non-ASCII avgt 15 181.694 ± 5.890 ns/op
[java] UTF8ValidatorBench.testOldBimorphic
short non-ASCII avgt 15 211.181 ± 8.713 ns/op
[java] UTF8ValidatorBench.testNewBimorphic
short non-ASCII avgt 15 168.981 ± 2.655 ns/op
[java] UTF8ValidatorBench.testOldBimorphic
long non-ASCII avgt 15 3377.540 ± 275.862 ns/op
[java] UTF8ValidatorBench.testNewBimorphic
long non-ASCII avgt 15 2664.422 ± 32.996 ns/op
[java] UTF8ValidatorBench.testOldMonomorphicArray
short ASCII avgt 15 18.870 ± 2.499 ns/op
[java] UTF8ValidatorBench.testNewMonomorphicArray
short ASCII avgt 15 9.554 ± 0.106 ns/op
[java] UTF8ValidatorBench.testOldMonomorphicArray
long ASCII avgt 15 800.848 ± 9.572 ns/op
[java] UTF8ValidatorBench.testNewMonomorphicArray
long ASCII avgt 15 503.032 ± 2.471 ns/op
[java] UTF8ValidatorBench.testOldMonomorphicArray short ASCII
prefix non-ASCII avgt 15 182.673 ± 1.329 ns/op
[java] UTF8ValidatorBench.testNewMonomorphicArray short ASCII
prefix non-ASCII avgt 15 159.818 ± 9.852 ns/op
[java] UTF8ValidatorBench.testOldMonomorphicArray
short non-ASCII avgt 15 142.486 ± 32.870 ns/op
[java] UTF8ValidatorBench.testNewMonomorphicArray
short non-ASCII avgt 15 146.139 ± 7.165 ns/op
[java] UTF8ValidatorBench.testOldMonomorphicArray
long non-ASCII avgt 15 2048.347 ± 33.241 ns/op
[java] UTF8ValidatorBench.testNewMonomorphicArray
long non-ASCII avgt 15 2183.468 ± 254.497 ns/op
[java] UTF8ValidatorBench.testOldMonomorphicHeapByteBuffer
short ASCII avgt 15 28.627 ± 0.471 ns/op
[java] UTF8ValidatorBench.testNewMonomorphicHeapByteBuffer
short ASCII avgt 15 11.703 ± 0.140 ns/op
[java] UTF8ValidatorBench.testOldMonomorphicHeapByteBuffer
long ASCII avgt 15 885.982 ± 8.191 ns/op
[java] UTF8ValidatorBench.testNewMonomorphicHeapByteBuffer
long ASCII avgt 15 709.429 ± 10.381 ns/op
[java] UTF8ValidatorBench.testOldMonomorphicHeapByteBuffer short ASCII
prefix non-ASCII avgt 15 251.002 ± 2.288 ns/op
[java] UTF8ValidatorBench.testNewMonomorphicHeapByteBuffer short ASCII
prefix non-ASCII avgt 15 185.308 ± 2.353 ns/op
[java] UTF8ValidatorBench.testOldMonomorphicHeapByteBuffer
short non-ASCII avgt 15 185.362 ± 11.047 ns/op
[java] UTF8ValidatorBench.testNewMonomorphicHeapByteBuffer
short non-ASCII avgt 15 164.486 ± 5.407 ns/op
[java] UTF8ValidatorBench.testOldMonomorphicHeapByteBuffer
long non-ASCII avgt 15 2439.308 ± 41.196 ns/op
[java] UTF8ValidatorBench.testNewMonomorphicHeapByteBuffer
long non-ASCII avgt 15 3025.147 ± 210.834 ns/op
{code}
In most cases we have better results with the new logic, for short ASCII
strings it is 2x-3x improvement. For non-ASCII the results are almost the same
except a sight degradation for MonomorphicHeapByteBuffer case for long
non-ASCII (which I cannot explain so far because there is no such one for short
non-ASCII)
was (Author: dnk):
More tests for different kinds of Strings, Temurin-17.0.12+7:
{code:java}
[java] Benchmark
(stringType) Mode Cnt Score Error Units
[java] UTF8ValidatorBench.testOldBimorphic
short ASCII avgt 15 29.167 ± 1.170 ns/op
[java] UTF8ValidatorBench.testNewBimorphic
short ASCII avgt 15 11.354 ± 0.181 ns/op
[java] UTF8ValidatorBench.testOldBimorphic
long ASCII avgt 15 967.613 ± 23.208 ns/op
[java] UTF8ValidatorBench.testNewBimorphic
long ASCII avgt 15 597.225 ± 33.805 ns/op
[java] UTF8ValidatorBench.testOldBimorphic short ASCII
prefix non-ASCII avgt 15 462.977 ± 30.108 ns/op
[java] UTF8ValidatorBench.testNewBimorphic short ASCII
prefix non-ASCII avgt 15 181.694 ± 5.890 ns/op
[java] UTF8ValidatorBench.testOldBimorphic
short non-ASCII avgt 15 211.181 ± 8.713 ns/op
[java] UTF8ValidatorBench.testNewBimorphic
short non-ASCII avgt 15 168.981 ± 2.655 ns/op
[java] UTF8ValidatorBench.testOldBimorphic
long non-ASCII avgt 15 3377.540 ± 275.862 ns/op
[java] UTF8ValidatorBench.testNewBimorphic
long non-ASCII avgt 15 2664.422 ± 32.996 ns/op
[java] UTF8ValidatorBench.testOldMonomorphicArray
short ASCII avgt 15 18.870 ± 2.499 ns/op
[java] UTF8ValidatorBench.testNewMonomorphicArray
short ASCII avgt 15 9.554 ± 0.106 ns/op
[java] UTF8ValidatorBench.testOldMonomorphicArray
long ASCII avgt 15 800.848 ± 9.572 ns/op
[java] UTF8ValidatorBench.testNewMonomorphicArray
long ASCII avgt 15 503.032 ± 2.471 ns/op
[java] UTF8ValidatorBench.testOldMonomorphicArray short ASCII
prefix non-ASCII avgt 15 182.673 ± 1.329 ns/op
[java] UTF8ValidatorBench.testNewMonomorphicArray short ASCII
prefix non-ASCII avgt 15 159.818 ± 9.852 ns/op
[java] UTF8ValidatorBench.testOldMonomorphicArray
short non-ASCII avgt 15 142.486 ± 32.870 ns/op
[java] UTF8ValidatorBench.testNewMonomorphicArray
short non-ASCII avgt 15 146.139 ± 7.165 ns/op
[java] UTF8ValidatorBench.testOldMonomorphicArray
long non-ASCII avgt 15 2048.347 ± 33.241 ns/op
[java] UTF8ValidatorBench.testNewMonomorphicArray
long non-ASCII avgt 15 2183.468 ± 254.497 ns/op
[java] UTF8ValidatorBench.testOldMonomorphicHeapByteBuffer
short ASCII avgt 15 28.627 ± 0.471 ns/op
[java] UTF8ValidatorBench.testNewMonomorphicHeapByteBuffer
short ASCII avgt 15 11.703 ± 0.140 ns/op
[java] UTF8ValidatorBench.testOldMonomorphicHeapByteBuffer
long ASCII avgt 15 885.982 ± 8.191 ns/op
[java] UTF8ValidatorBench.testNewMonomorphicHeapByteBuffer
long ASCII avgt 15 709.429 ± 10.381 ns/op
[java] UTF8ValidatorBench.testOldMonomorphicHeapByteBuffer short ASCII
prefix non-ASCII avgt 15 251.002 ± 2.288 ns/op
[java] UTF8ValidatorBench.testNewMonomorphicHeapByteBuffer short ASCII
prefix non-ASCII avgt 15 185.308 ± 2.353 ns/op
[java] UTF8ValidatorBench.testOldMonomorphicHeapByteBuffer
short non-ASCII avgt 15 185.362 ± 11.047 ns/op
[java] UTF8ValidatorBench.testNewMonomorphicHeapByteBuffer
short non-ASCII avgt 15 164.486 ± 5.407 ns/op
[java] UTF8ValidatorBench.testOldMonomorphicHeapByteBuffer
long non-ASCII avgt 15 2439.308 ± 41.196 ns/op
[java] UTF8ValidatorBench.testNewMonomorphicHeapByteBuffer
long non-ASCII avgt 15 3025.147 ± 210.834 ns/op
{code}
In most cases we have better results with the new logic, for short ASCII
strings it is 2x-3x improvement. For non-ASCII the results are almost the same
except a sight degradation for MonomorphicHeapByteBuffer case for long
non-ASCII (which I cannot explain so far because there is no such one for short
non-ASCII)
> Optimize UTF8Validator.validate for ASCII prefixed Strings
> ----------------------------------------------------------
>
> Key: CASSANDRA-21075
> URL: https://issues.apache.org/jira/browse/CASSANDRA-21075
> Project: Apache Cassandra
> Issue Type: Improvement
> Components: CQL/Interpreter
> Reporter: Dmitry Konstantinov
> Assignee: Dmitry Konstantinov
> Priority: Normal
> Fix For: 5.x
>
> Attachments: before_cpu.html
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> During a write we validate every string received from a client, String (text)
> type is very popular and frequently while we declare type as text many values
> are actually ASCII or ASCII-prefixed. For example if we have a table
> partition key + clustering key + 5 value columns it means 7 validations per
> row, in case of 10 rows batch -> 70 validations. It is not very rare to have
> more complicated table structure with UDTs/collections, in this case the
> number of string values to validate can be quite high. So, even a small
> improvement here can be beneficial.
> In my batch write test, UTF8 validation contributes 2.1% of CPU:
> [^before_cpu.html]
> In UTF8Validator.validate we can apply the same optimization as Guava and JDK
> does: they use a plain loop to check if it is ASCII symbol before going into
> more complicated UTF8 parsing:
> *
> [https://github.com/google/guava/blob/master/guava/src/com/google/common/base/Utf8.java#L123]
> {code:java}
> for (int i = off; i < end; i++) {
> if (bytes[i] < 0) {
> return isWellFormedSlowPath(bytes, i, end);
> }
> } {code}
> * java.lang.StringCoding#decodeUTF8
> {code:java}
> // ascii-bais, which has a relative impact to the non-ascii-only bytes
> if (COMPACT_STRINGS && !hasNegatives(src, sp, len))
> return resultCached().with(Arrays.copyOfRange(src, sp, sp + len),
> LATIN1);
> return decodeUTF8_0(src, sp, len, doReplace);
> where:
> public static boolean hasNegatives(byte[] ba, int off, int len) {
> for (int i = off; i < off + len; i++) {
> if (ba[i] < 0) {
> return true;
> }
> }
> return false;
> } {code}
> See also:
> [https://lemire.me/blog/2018/10/16/validating-utf-8-bytes-java-edition/]
> Additionally, using of ValueAccessor is not a free lunch and by avoiding it
> we can get extra boost, especially in non-monomorphic cases.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]