Re: [PR] Specialize ASCII case for substr() [datafusion]

via GitHub Fri, 13 Sep 2024 13:25:31 -0700


alamb commented on code in PR #12444:
URL: https://github.com/apache/datafusion/pull/12444#discussion_r1759454152



##########
datafusion/functions/src/unicode/substr.rs:
##########
@@ -197,6 +213,29 @@ fn string_view_substr(
 
     let start_array = as_int64_array(&args[0])?;
 
+    // Notes for ASCII-only optimization:
+    //
+    // String characters are variable length encoded in UTF-8, `substr()` 
function's
+    // arguments are character-based, converting them into byte-based indices
+    // requires expensive decoding.
+    // However, checking if a string is ASCII-only is relatively cheap.
+    // If strings are ASCII only, use byte-based indices instead.
+    //
+    // A common pattern to call `substr()` is taking a small prefix of a long
+    // string, such as `substr(long_str_with_1k_chars, 1, 32)`.
+    // In such case the overhead of ASCII-validation may not be worth it, so
+    // skip the validation for long strings for now.
+    // TODO: A better heuristic is to use the ratio to decide whether to 
validate
+    // like `(start + count) / estimate_avg_strlen > threshold`, but it 
requires
+    // specialized implementation for `ScalarValue` input.
+    let estimate_avg_strlen =

Review Comment:
   Can we please pull this check into its own method (so it is easier to find), 
perhaps something like
   
   ```rust
   fn enable_ascii_fast_path(arr: &StringViewArray) -> bool { 
   ...
   }
   ```
   



##########
datafusion/functions/src/unicode/substr.rs:
##########
@@ -197,6 +213,29 @@ fn string_view_substr(
 
     let start_array = as_int64_array(&args[0])?;
 
+    // Notes for ASCII-only optimization:
+    //
+    // String characters are variable length encoded in UTF-8, `substr()` 
function's
+    // arguments are character-based, converting them into byte-based indices
+    // requires expensive decoding.
+    // However, checking if a string is ASCII-only is relatively cheap.
+    // If strings are ASCII only, use byte-based indices instead.
+    //
+    // A common pattern to call `substr()` is taking a small prefix of a long
+    // string, such as `substr(long_str_with_1k_chars, 1, 32)`.
+    // In such case the overhead of ASCII-validation may not be worth it, so
+    // skip the validation for long strings for now.

Review Comment:
   I am not quite sure if it is the same question that @findepi  is asking, but 
I wonder if we could get back the performance loss by also using the 
information on the # bytes are we requesting?  Like if the prefix length is 
less than 32 say, don't bother checking for ascii. 🤔 
   
   I thinking short prefixes are likely common (looking for `http://` as a url 
prefix, for example). 🤔 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Specialize ASCII case for substr() [datafusion]

Reply via email to