bart-samwel commented on code in PR #45816:
URL: https://github.com/apache/spark/pull/45816#discussion_r1551312118
##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -447,6 +447,37 @@ private UTF8String toUpperCaseSlow() {
return fromString(toString().toUpperCase());
}
+ /**
+ * Optimized lowercase comparison for UTF8_BINARY_LCASE collation
+ */
+ public int compareLowercase(UTF8String other) {
+ int curr;
+ for (curr = 0; curr < numBytes && curr < other.numBytes; curr++) {
+ byte left = getByte(curr);
+ byte right = other.getByte(curr);
+ if (numBytesForFirstByte(left) != 1 || numBytesForFirstByte(right) != 1)
{
Review Comment:
This is expensive -- you don't want to know the number of bytes, you just
want to know if it's more than 1. If you look at the UTF-8 spec, you see that
the multibyte characters all have the high bit set, and the single-byte
characters all have the high bit unset. So you could just test for the high
bit. Assuming that `toLowerCase` will not go from ASCII to non-ASCII, this also
gets rid of the next check in line 463, which is in essence also a test for the
high bit.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]