Omega359 opened a new issue, #10744:
URL: https://github.com/apache/datafusion/issues/10744
### Is your feature request related to a problem or challenge?
During work on adding a 'safe' mode to to_timestamp and to_date functions
I've come across an issue that I would like feedback on before proceeding.
### The feature
Currently for timestamp and date parsing if a source string cannot be parsed
using any of the provided chrono formats datafusion will return an error. This
is normal for a database-type solution however it is not ideal for a system
that is parsing billions of dates in batches - some of which are human entered.
Systems such as Spark default to a null value for anything that cannot be
parsed and this feature enables a mode ('safe' to mirror the name and same
behaviour as CastOptions) that allows the to_timestamp* and to_date UDFs to
have the same behaviour (return null on error).
### The problem
Since UDF functions have no context provided and there isn't a way I know of
to statically get access to config to add the avove mentioned functionality I
resorted to using a new constructor function to allow the UDF to switch
behaviour:
```
#[derive(Debug)]
pub struct ToTimestampFunc {
signature: Signature,
/// how to handle cast or parsing failures, either return NULL
(safe=true) or return ERR (safe=false)
pub safe: bool,
}
impl ToTimestampFunc {
pub fn new() -> Self {
Self {
signature: Signature::variadic_any(Volatility::Immutable),
safe: false,
}
}
pub fn new_with_safe(safe: bool) -> Self {
Self {
signature: Signature::variadic_any(Volatility::Immutable),
safe,
}
}
}
```
To use the alternative 'safe' mode for these functions is as simple as
```
let to_timestamp = ToTimestampFunc::new_with_safe(true);
session_context.register_udf(ScalarUDF::new_from_impl(to_timestamp));
```
Unfortunately this only affect sql queries - any calls to the
[to_timestamp(args:
Vec<Expr>)](https://github.com/apache/datafusion/blob/09dde27be39ad054f85dfb5c37b7468a3f68d652/datafusion/functions/src/datetime/mod.rs#L257)
function will not use the new definition as registered in the function
registry. This is because that function and every other function like it use a
static singleton instance that only uses a ::new() call to initial it and there
is no way that I can see to replace that instance.
### Describe the solution you'd like
I see a few possible solutions to this:
- Acknowledge the difference in behaviour in documentation and recommend
using `ctx.udf("to_timestamp").unwrap().call(args)` instead of the
to_timestamp() function anytime 'safe' mode is required. This is less than
ideal imho as it can lead to confusion and unintuitive behavior.
- Provide a mechanism (function argument to invoke perhaps) to provide a
SessionContext to all UDF's so that they can adjust behaviour based on
configuration at call time.
- Update the to_timestamp() function and all similar functions to have
alternatives that do not use the statically defined version of the UDF but
rather create it on-demand or use the definition saved in the session state. As
examples:
```
// the existing function
pub fn to_timestamp(args: Vec<Expr>) -> Expr {
super::to_timestamp().call(args)
}
pub fn to_timestamp_safe(args: Vec<Expr>) -> Expr {
ScalarUDF::new_from_impl(ToTimestampFunc::new_with_safe(true)).call(args)
}
// or
pub fn to_timestamp_from_context(ctx &Arc<SessionContext>, args:
Vec<Expr>) -> Expr {
ctx.udf("to_timestamp").unwrap().call(args)
}
// varargs would have been a really nice solution as well but 🤷
```
- something else I haven't thought of.
Any opinions, suggestions and critiques would be welcome.
### Describe alternatives you've considered
_No response_
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]