vbarua commented on code in PR #13511: URL: https://github.com/apache/datafusion/pull/13511#discussion_r1979981659
########## datafusion/expr/src/expr.rs: ########## @@ -295,6 +295,8 @@ pub enum Expr { /// See also [`ExprFunctionExt`] to set these fields. /// /// [`ExprFunctionExt`]: crate::expr_fn::ExprFunctionExt + /// + /// cf. `WITHIN GROUP` is converted to `ORDER BY` internally in `datafusion/sql/src/expr/function.rs` Review Comment: minor/opinionated: I'm not sure if it's worth mentioning this at all here. `WITHIN GROUP` is effectively an `ORDER BY` specified differently. This only matters at the SQL layer, and you handle and explain it there already. ########## datafusion/sql/src/expr/function.rs: ########## @@ -349,15 +365,49 @@ impl<S: ContextProvider> SqlToRel<'_, S> { } else { // User defined aggregate functions (UDAF) have precedence in case it has the same name as a scalar built-in function if let Some(fm) = self.context_provider.get_aggregate_meta(&name) { - let order_by = self.order_by_to_sort_expr( - order_by, - schema, - planner_context, - true, - None, - )?; - let order_by = (!order_by.is_empty()).then_some(order_by); - let args = self.function_args_to_expr(args, schema, planner_context)?; + if fm.is_ordered_set_aggregate() && within_group.is_empty() { + return plan_err!("WITHIN GROUP clause is required when calling ordered set aggregate function({})", fm.name()); + } + + if null_treatment.is_some() && !fm.supports_null_handling_clause() { + return plan_err!( + "[IGNORE | RESPECT] NULLS are not permitted for {}", + fm.name() + ); + } Review Comment: Per my point about `[IGNORE | RESPECT] NULLS` being a property of _window functions_, I don't think we need this check here. ########## datafusion/sql/src/expr/function.rs: ########## @@ -177,8 +180,14 @@ impl FunctionArgs { } } - if !within_group.is_empty() { - return not_impl_err!("WITHIN GROUP is not supported yet: {within_group:?}"); + if !within_group.is_empty() && order_by.is_some() { + return plan_err!("ORDER BY clause is only permitted in WITHIN GROUP clause when a WITHIN GROUP is used"); + } + + if within_group.len() > 1 { + return not_impl_err!( + "Multiple column ordering in WITHIN GROUP clause is not supported" Review Comment: Minor wording suggestion ``` Only a single ordering expression is permitted in a WITHIN GROUP clause ``` which explicitly points users to what they should do, instead of telling them what they can't. ########## datafusion/functions-aggregate/src/approx_percentile_cont.rs: ########## @@ -51,29 +52,39 @@ create_func!(ApproxPercentileCont, approx_percentile_cont_udaf); /// Computes the approximate percentile continuous of a set of numbers pub fn approx_percentile_cont( - expression: Expr, + within_group: Sort, Review Comment: minor/opinionated: I think `order_by` would be a clearer name for this, as the `WITHIN GROUP` is really just a wrapper around the `ORDER BY` clause. ########## datafusion/sql/src/expr/function.rs: ########## @@ -177,8 +180,14 @@ impl FunctionArgs { } } - if !within_group.is_empty() { - return not_impl_err!("WITHIN GROUP is not supported yet: {within_group:?}"); + if !within_group.is_empty() && order_by.is_some() { + return plan_err!("ORDER BY clause is only permitted in WITHIN GROUP clause when a WITHIN GROUP is used"); Review Comment: minor: I just noticed that in the block above there is a check for duplicate order bys. I think it would be good to fold this into that check ```rust FunctionArgumentClause::OrderBy(oby) => { if order_by.is_some() { // can check for within group here return not_impl_err!("Calling {name}: Duplicated ORDER BY clause in function arguments"); } order_by = Some(oby); } ``` to consolidate the handling into one place. ########## datafusion/functions-aggregate/src/approx_percentile_cont.rs: ########## @@ -131,6 +147,19 @@ impl ApproxPercentileCont { args: AccumulatorArgs, ) -> Result<ApproxPercentileAccumulator> { let percentile = validate_input_percentile_expr(&args.exprs[1])?; + + let is_descending = args + .ordering_req + .first() + .map(|sort_expr| sort_expr.options.descending) + .unwrap_or(false); + + let percentile = if is_descending { + 1.0 - percentile + } else { + percentile + }; Review Comment: This seems reasonable to me, but I don't have that much experience on the execution side of things. ########## datafusion/expr/src/udaf.rs: ########## @@ -845,6 +861,19 @@ pub trait AggregateUDFImpl: Debug + Send + Sync { ScalarValue::try_from(data_type) } + /// If this function supports `[IGNORE NULLS | RESPECT NULLS]` clause, return true + /// If the function does not, return false + /// Otherwise return None (the default) + fn supports_null_handling_clause(&self) -> Option<bool> { + None + } Review Comment: This was smelling odd, so I dug a bit deeper. I think you've inadvertantly stumbled into something even weirder than you anticipated The example you've linked is ```sql SELECT FIRST_VALUE(column1) RESPECT NULLS FROM t; ``` which I don't think is a valid query because `first_value` _should not_ be an aggregate function, or at the very least the above query is not valid in most SQL dialects. `first_value` is actually a window function in other engines (eg. [Trino](https://trino.io/docs/current/functions/window.html#first_value), [Postgres](https://www.postgresql.org/docs/17/functions-window.html), [MySQL](https://dev.mysql.com/doc/refman/8.4/en/window-function-descriptions.html#function_first-value)). If you try running something like ```sql SELECT first_value(column1) FROM t; ``` against Postgres you get an error like ``` Query Error: window function first_value requires an OVER clause ``` [dbfiddle](https://www.db-fiddle.com/f/oMU1DDB2915nrHLKqUVj9a/0) The `RESPECT NULLS | IGNORE NULLS` options is only a property of certain _window_ functions, hence we shouldn't need to track it for aggregate functions. I'm going to file a ticket for the above. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org