GitHub user pepijnve closed a discussion: Optimising regex functions (and 
literal function arguments in general)

I'm very new to both Rust and the DataFusion codebase, so apologies up front if 
I'm misinterpreting the code.

If I understand it correctly regexes are currently compiled at least once per 
batch and sometimes cached per batch (e.g. in the 
`datafusion_functions::regex::regexpcount::compile_and_cache_regex`). The cache 
lookup seems to be allocating a new `String` per hash table lookup. My 
intuition says this is probably not going to be very fast, and the simple 
experiments I ran today seem to confirm that.

If the regex argument is a string literal, it seems like it should in theory be 
feasible to optimize this by compiling the regex up front and reusing the 
compiled version for every invocation of the function. I spent some time today 
digging around the codebase trying to figure out how to implement this idea, 
but it wasn't immediately obvious to me if it's actually possible within the 
constraints of the current APIs.

The best option I came up with was to instantiate a curried version of the 
scalar function in the `simplify` implementation for the function in question. 
The part I couldn't quite figure out is what the appropriate place would be to 
store the compiled regex. Would a custom `PhysicalExpr` be the way to go?

Taking a step back, the more general question I had is if there is already some 
facility present in the codebase that supports performing some initial 
precomputation work per individual call site of a function as a way to improve 
performance.
The other instance of this I was looking at were string functions like 
`start_with` and `ends_with` when invoked with a literal. Currently the code 
expands the scalar to an array per batch. Perhaps that array could be created 
once and then sliced to the appropriate length per invocation instead.

Does any of this make sense or am I looking at this problem the wrong way?

GitHub link: https://github.com/apache/datafusion/discussions/13852

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to