Re: [PR] [SPARK-47693][SQL] Add optimization for lowercase comparison of UTF8String used in UTF8_BINARY_LCASE collation [spark]

via GitHub Thu, 04 Apr 2024 02:26:05 -0700


bart-samwel commented on code in PR #45816:
URL: https://github.com/apache/spark/pull/45816#discussion_r1551312118



##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -447,6 +447,37 @@ private UTF8String toUpperCaseSlow() {
     return fromString(toString().toUpperCase());
   }
 
+  /**
+   * Optimized lowercase comparison for UTF8_BINARY_LCASE collation
+   */
+  public int compareLowercase(UTF8String other) {
+    int curr;
+    for (curr = 0; curr < numBytes && curr < other.numBytes; curr++) {
+      byte left = getByte(curr);
+      byte right = other.getByte(curr);
+      if (numBytesForFirstByte(left) != 1 || numBytesForFirstByte(right) != 1) 
{

Review Comment:
   This is expensive -- you don't want to know the number of bytes, you just 
want to know if it's more than 1. If you look at the UTF-8 spec, you see that 
the multibyte characters all have the high bit set, and the single-byte 
characters all have the high bit unset. So you could just test for the high 
bit. Assuming that `toLowerCase` will not go from ASCII to non-ASCII, this also 
gets rid of the next check in line 463, which is in essence also a test for the 
high bit. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-47693][SQL] Add optimization for lowercase comparison of UTF8String used in UTF8_BINARY_LCASE collation [spark]

Reply via email to