2010YOUY01 opened a new issue, #12306:
URL: https://github.com/apache/datafusion/issues/12306

   ### Is your feature request related to a problem or challenge?
   
   String operations on UTF8 encoding are relatively more expensive, due to 
UTF8 being variable length encoding, and each character can be encoded with 1~4 
bytes
   
   For example, a UTF8 string "Hello🌏世界" in-memory representation is (x for 1 
byte)
   ```
   [x][x][x][x][x][xxxx][xxx][xxx]
   ```
   Some seemingly cheap operation liks `substr(utf8_col, i, j)`, 
`character_length(utf8_col)` will actually decode the whole string, instead of 
doing some O(1) operation. If we can assume one string column batch is ASCII 
only, then those operations are indeed cheap.
   
   However:
   - Many data are ASCII encoded (1 Byte encoding subset of UTF8), which 
includes the most common English characters, numbers, etc.
   - Validating if a string array is ASCII-encoded is fast
       - Validation implementation is compiler/CPU friendly, can run ~memory 
bandwidth
       - It's possible to check in batch, for each string array
   
   So it's possible to first do the check within those functions. If the string 
array is ASCII-only, then run the specialized path. The ASCII validation 
overhead should be worth the performance gain in the general cases.
   
   This should be a common trick which has been implemented in 
[Velox](https://vldb.org/pvldb/vol15/p3372-pedreira.pdf) and 
[Photon](https://people.eecs.berkeley.edu/~matei/papers/2022/sigmod_photon.pdf),
 as their paper has mentioned.
   Below is the numbers from Velox
   
![image](https://github.com/user-attachments/assets/080be523-bd5e-432f-a2d4-1fdd61ef5810)
   
   I did a quick experiment on `character_length()`/ `substr()` scalar 
functions, and got some speedup for ASCII cases, the validation overhead is 
very little.
   `substr()` can get another 80% faster upon 
https://github.com/apache/datafusion/pull/12044,  for some microbenches with 
string length 128B
   
   ### Describe the solution you'd like
   
   For scalar functions applicable to ASCII specialization, within function 
implementation, first validate whether String array is ASCII only, if so enable 
the fast path.
   Functions possible to speed up: `character_length()`, `substr()`, `lower()`, 
`upper()`
   (And maybe some more like regex functions, need some further investigation)
   
   ### Describe alternatives you've considered
   
   Add an option to let users to specify whether a column is fully ASCII
   Since the always-validate approach is easier to use, and not so expensive, 
we can leave this to the future
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to