Re: [PR] Support WITHIN GROUP syntax to standardize certain existing aggregate functions [datafusion]

via GitHub Tue, 04 Mar 2025 12:32:31 -0800


vbarua commented on code in PR #13511:
URL: https://github.com/apache/datafusion/pull/13511#discussion_r1979981659



##########
datafusion/expr/src/expr.rs:
##########
@@ -295,6 +295,8 @@ pub enum Expr {
     /// See also [`ExprFunctionExt`] to set these fields.
     ///
     /// [`ExprFunctionExt`]: crate::expr_fn::ExprFunctionExt
+    ///
+    /// cf. `WITHIN GROUP` is converted to `ORDER BY` internally in 
`datafusion/sql/src/expr/function.rs`

Review Comment:
   minor/opinionated: I'm not sure if it's worth mentioning this at all here. 
`WITHIN GROUP` is effectively an `ORDER BY` specified differently. This only 
matters at the SQL layer, and you handle and explain it there already.



##########
datafusion/sql/src/expr/function.rs:
##########
@@ -349,15 +365,49 @@ impl<S: ContextProvider> SqlToRel<'_, S> {
         } else {
             // User defined aggregate functions (UDAF) have precedence in case 
it has the same name as a scalar built-in function
             if let Some(fm) = self.context_provider.get_aggregate_meta(&name) {
-                let order_by = self.order_by_to_sort_expr(
-                    order_by,
-                    schema,
-                    planner_context,
-                    true,
-                    None,
-                )?;
-                let order_by = (!order_by.is_empty()).then_some(order_by);
-                let args = self.function_args_to_expr(args, schema, 
planner_context)?;
+                if fm.is_ordered_set_aggregate() && within_group.is_empty() {
+                    return plan_err!("WITHIN GROUP clause is required when 
calling ordered set aggregate function({})", fm.name());
+                }
+
+                if null_treatment.is_some() && 
!fm.supports_null_handling_clause() {
+                    return plan_err!(
+                        "[IGNORE | RESPECT] NULLS are not permitted for {}",
+                        fm.name()
+                    );
+                }

Review Comment:
   Per my point about `[IGNORE | RESPECT] NULLS` being a property of _window 
functions_, I don't think we need this check here.



##########
datafusion/sql/src/expr/function.rs:
##########
@@ -177,8 +180,14 @@ impl FunctionArgs {
             }
         }
 
-        if !within_group.is_empty() {
-            return not_impl_err!("WITHIN GROUP is not supported yet: 
{within_group:?}");
+        if !within_group.is_empty() && order_by.is_some() {
+            return plan_err!("ORDER BY clause is only permitted in WITHIN 
GROUP clause when a WITHIN GROUP is used");
+        }
+
+        if within_group.len() > 1 {
+            return not_impl_err!(
+                "Multiple column ordering in WITHIN GROUP clause is not 
supported"

Review Comment:
   Minor wording suggestion
   ```
   Only a single ordering expression is permitted in a WITHIN GROUP clause
   ```
   which explicitly points users to what they should do, instead of telling 
them what they can't.
   



##########
datafusion/functions-aggregate/src/approx_percentile_cont.rs:
##########
@@ -51,29 +52,39 @@ create_func!(ApproxPercentileCont, 
approx_percentile_cont_udaf);
 
 /// Computes the approximate percentile continuous of a set of numbers
 pub fn approx_percentile_cont(
-    expression: Expr,
+    within_group: Sort,

Review Comment:
   minor/opinionated: I think `order_by` would be a clearer name for this, as 
the `WITHIN GROUP` is really just a wrapper around the `ORDER BY` clause.



##########
datafusion/sql/src/expr/function.rs:
##########
@@ -177,8 +180,14 @@ impl FunctionArgs {
             }
         }
 
-        if !within_group.is_empty() {
-            return not_impl_err!("WITHIN GROUP is not supported yet: 
{within_group:?}");
+        if !within_group.is_empty() && order_by.is_some() {
+            return plan_err!("ORDER BY clause is only permitted in WITHIN 
GROUP clause when a WITHIN GROUP is used");

Review Comment:
   minor: I just noticed that in the block above there is a check for duplicate 
order bys. I think it would be good to fold this into that check
   ```rust
                   FunctionArgumentClause::OrderBy(oby) => {
                       if order_by.is_some() { // can check for within group 
here
                           return not_impl_err!("Calling {name}: Duplicated 
ORDER BY clause in function arguments");
                       }
                       order_by = Some(oby);
                   }
   ```
   to consolidate the handling into one place. 
   



##########
datafusion/functions-aggregate/src/approx_percentile_cont.rs:
##########
@@ -131,6 +147,19 @@ impl ApproxPercentileCont {
         args: AccumulatorArgs,
     ) -> Result<ApproxPercentileAccumulator> {
         let percentile = validate_input_percentile_expr(&args.exprs[1])?;
+
+        let is_descending = args
+            .ordering_req
+            .first()
+            .map(|sort_expr| sort_expr.options.descending)
+            .unwrap_or(false);
+
+        let percentile = if is_descending {
+            1.0 - percentile
+        } else {
+            percentile
+        };

Review Comment:
   This seems reasonable to me, but I don't have that much experience on the 
execution side of things.



##########
datafusion/expr/src/udaf.rs:
##########
@@ -845,6 +861,19 @@ pub trait AggregateUDFImpl: Debug + Send + Sync {
         ScalarValue::try_from(data_type)
     }
 
+    /// If this function supports `[IGNORE NULLS | RESPECT NULLS]` clause, 
return true
+    /// If the function does not, return false
+    /// Otherwise return None (the default)
+    fn supports_null_handling_clause(&self) -> Option<bool> {
+        None
+    }

Review Comment:
   This was smelling odd, so I dug a bit deeper. I think you've inadvertantly 
stumbled into something even weirder than you anticipated
   
   The example you've linked is
   ```sql
   SELECT FIRST_VALUE(column1) RESPECT NULLS FROM t;
   ```
   which I don't think is a valid query because `first_value` _should not_ be 
an aggregate function, or at the very least the above query is not valid in 
most SQL dialects. `first_value` is actually a window function in other engines 
(eg. [Trino](https://trino.io/docs/current/functions/window.html#first_value), 
[Postgres](https://www.postgresql.org/docs/17/functions-window.html), 
[MySQL](https://dev.mysql.com/doc/refman/8.4/en/window-function-descriptions.html#function_first-value)).
   
   If you try running something like
   ```sql
   SELECT first_value(column1) FROM t;
   ```
   against Postgres you get an error like
   ```
   Query Error: window function first_value requires an OVER clause
   ```
   [dbfiddle](https://www.db-fiddle.com/f/oMU1DDB2915nrHLKqUVj9a/0)
   
   The `RESPECT NULLS | IGNORE NULLS` options is only a property of certain 
_window_ functions, hence we shouldn't need to track it for aggregate functions.
   
   I'm going to file a ticket for the above.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Support WITHIN GROUP syntax to standardize certain existing aggregate functions [datafusion]

Reply via email to