Re: [I] Regression in `last_value` functionality [datafusion]

via GitHub Fri, 11 Apr 2025 00:08:30 -0700


Dandandan commented on issue #15676:
URL: https://github.com/apache/datafusion/issues/15676#issuecomment-2796042803


   > Is it possible to maintain order in group case
   
   The point I was trying to make is `FIRST` and `LAST` are not deterministic, 
so users can not trust on the behavior to be consistent in future versions.
   
   
[Spark](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.first.html))
 says for example "The function is non-deterministic because its results 
depends on the order of the rows which may be non-deterministic after a 
shuffle" )
   
   The aggregation function doesn't give **any promise at all** you'll be 
getting really the first and last values of something. E.g. for `SELECT 
FIRST(x) GROUP BY t` there is just nothing in the query specifying that `t` 
should be ordered in a way or should maintain order.
   Based on how `t` is scanned (it might read it sequentially from parquet or 
in some order from a database...`) and the order in which the values are 
grouped and tasks are being scheduled, nothing prevents it from returning the 
rows in a different order..
   Based on the semantics of the function, the engine could convert it to a 
function like `ANY_VALUE` and don't break any contract.
   
   IMO we should at least clearly document this behaviour and/or maybe either 
deprecate these functions (users can include said function into contrib 
themselves) or move them to a contrib crate.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Regression in `last_value` functionality [datafusion]

Reply via email to