Dandandan commented on issue #15676: URL: https://github.com/apache/datafusion/issues/15676#issuecomment-2796042803
> Is it possible to maintain order in group case The point I was trying to make is `FIRST` and `LAST` are not deterministic, so users can not trust on the behavior to be consistent in future versions. [Spark](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.first.html)) says for example "The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle" ) The aggregation function doesn't give **any promise at all** you'll be getting really the first and last values of something. E.g. for `SELECT FIRST(x) GROUP BY t` there is just nothing in the query specifying that `t` should be ordered in a way or should maintain order. Based on how `t` is scanned (it might read it sequentially from parquet or in some order from a database...`) and the order in which the values are grouped and tasks are being scheduled, nothing prevents it from returning the rows in a different order.. Based on the semantics of the function, the engine could convert it to a function like `ANY_VALUE` and don't break any contract. IMO we should at least clearly document this behaviour and/or maybe either deprecate these functions (users can include said function into contrib themselves) or move them to a contrib crate. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org