GitHub user pepijnve closed a discussion: Optimising regex functions (and literal function arguments in general)
I'm very new to both Rust and the DataFusion codebase, so apologies up front if I'm misinterpreting the code. If I understand it correctly regexes are currently compiled at least once per batch and sometimes cached per batch (e.g. in the `datafusion_functions::regex::regexpcount::compile_and_cache_regex`). The cache lookup seems to be allocating a new `String` per hash table lookup. My intuition says this is probably not going to be very fast, and the simple experiments I ran today seem to confirm that. If the regex argument is a string literal, it seems like it should in theory be feasible to optimize this by compiling the regex up front and reusing the compiled version for every invocation of the function. I spent some time today digging around the codebase trying to figure out how to implement this idea, but it wasn't immediately obvious to me if it's actually possible within the constraints of the current APIs. The best option I came up with was to instantiate a curried version of the scalar function in the `simplify` implementation for the function in question. The part I couldn't quite figure out is what the appropriate place would be to store the compiled regex. Would a custom `PhysicalExpr` be the way to go? Taking a step back, the more general question I had is if there is already some facility present in the codebase that supports performing some initial precomputation work per individual call site of a function as a way to improve performance. The other instance of this I was looking at were string functions like `start_with` and `ends_with` when invoked with a literal. Currently the code expands the scalar to an array per batch. Perhaps that array could be created once and then sliced to the appropriate length per invocation instead. Does any of this make sense or am I looking at this problem the wrong way? GitHub link: https://github.com/apache/datafusion/discussions/13852 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
