2010YOUY01 opened a new issue, #12306:
URL: https://github.com/apache/datafusion/issues/12306
### Is your feature request related to a problem or challenge?
String operations on UTF8 encoding are relatively more expensive, due to
UTF8 being variable length encoding, and each character can be encoded with 1~4
bytes
For example, a UTF8 string "Hello🌏世界" in-memory representation is (x for 1
byte)
```
[x][x][x][x][x][xxxx][xxx][xxx]
```
Some seemingly cheap operation liks `substr(utf8_col, i, j)`,
`character_length(utf8_col)` will actually decode the whole string, instead of
doing some O(1) operation. If we can assume one string column batch is ASCII
only, then those operations are indeed cheap.
However:
- Many data are ASCII encoded (1 Byte encoding subset of UTF8), which
includes the most common English characters, numbers, etc.
- Validating if a string array is ASCII-encoded is fast
- Validation implementation is compiler/CPU friendly, can run ~memory
bandwidth
- It's possible to check in batch, for each string array
So it's possible to first do the check within those functions. If the string
array is ASCII-only, then run the specialized path. The ASCII validation
overhead should be worth the performance gain in the general cases.
This should be a common trick which has been implemented in
[Velox](https://vldb.org/pvldb/vol15/p3372-pedreira.pdf) and
[Photon](https://people.eecs.berkeley.edu/~matei/papers/2022/sigmod_photon.pdf),
as their paper has mentioned.
Below is the numbers from Velox

I did a quick experiment on `character_length()`/ `substr()` scalar
functions, and got some speedup for ASCII cases, the validation overhead is
very little.
`substr()` can get another 80% faster upon
https://github.com/apache/datafusion/pull/12044, for some microbenches with
string length 128B
### Describe the solution you'd like
For scalar functions applicable to ASCII specialization, within function
implementation, first validate whether String array is ASCII only, if so enable
the fast path.
Functions possible to speed up: `character_length()`, `substr()`, `lower()`,
`upper()`
(And maybe some more like regex functions, need some further investigation)
### Describe alternatives you've considered
Add an option to let users to specify whether a column is fully ASCII
Since the always-validate approach is easier to use, and not so expensive,
we can leave this to the future
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]