alamb commented on code in PR #12444:
URL: https://github.com/apache/datafusion/pull/12444#discussion_r1759454152
##########
datafusion/functions/src/unicode/substr.rs:
##########
@@ -197,6 +213,29 @@ fn string_view_substr(
let start_array = as_int64_array(&args[0])?;
+ // Notes for ASCII-only optimization:
+ //
+ // String characters are variable length encoded in UTF-8, `substr()`
function's
+ // arguments are character-based, converting them into byte-based indices
+ // requires expensive decoding.
+ // However, checking if a string is ASCII-only is relatively cheap.
+ // If strings are ASCII only, use byte-based indices instead.
+ //
+ // A common pattern to call `substr()` is taking a small prefix of a long
+ // string, such as `substr(long_str_with_1k_chars, 1, 32)`.
+ // In such case the overhead of ASCII-validation may not be worth it, so
+ // skip the validation for long strings for now.
+ // TODO: A better heuristic is to use the ratio to decide whether to
validate
+ // like `(start + count) / estimate_avg_strlen > threshold`, but it
requires
+ // specialized implementation for `ScalarValue` input.
+ let estimate_avg_strlen =
Review Comment:
Can we please pull this check into its own method (so it is easier to find),
perhaps something like
```rust
fn enable_ascii_fast_path(arr: &StringViewArray) -> bool {
...
}
```
##########
datafusion/functions/src/unicode/substr.rs:
##########
@@ -197,6 +213,29 @@ fn string_view_substr(
let start_array = as_int64_array(&args[0])?;
+ // Notes for ASCII-only optimization:
+ //
+ // String characters are variable length encoded in UTF-8, `substr()`
function's
+ // arguments are character-based, converting them into byte-based indices
+ // requires expensive decoding.
+ // However, checking if a string is ASCII-only is relatively cheap.
+ // If strings are ASCII only, use byte-based indices instead.
+ //
+ // A common pattern to call `substr()` is taking a small prefix of a long
+ // string, such as `substr(long_str_with_1k_chars, 1, 32)`.
+ // In such case the overhead of ASCII-validation may not be worth it, so
+ // skip the validation for long strings for now.
Review Comment:
I am not quite sure if it is the same question that @findepi is asking, but
I wonder if we could get back the performance loss by also using the
information on the # bytes are we requesting? Like if the prefix length is
less than 32 say, don't bother checking for ascii. 🤔
I thinking short prefixes are likely common (looking for `http://` as a url
prefix, for example). 🤔
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]