Hi, An issue I have encountered frequently is the need to look at data in an ordered manner per key. A common way of doing this can be seen in the classic map reduce as the shuffle stage provides sorted data per key and one can therefore do a lot with that. It is of course relatively easy to achieve this by using RDD but that would mean moving to RDD and back which has a non-insignificant performance penalty (beyond the fact that we lose any catalyst optimization). We can use SQL window functions but that is not an ideal solution either. Beyond the fact that window functions are much slower than aggregate functions (as we need to generate a result for every record), we also can't join them together (i.e. if we have two window functions on the same window, it is still two separate scans).
Ideally, I would have liked to see something like: df.groupBy(col1).sortBy(col2).agg(...) and have the aggregations work on the sorted data. That would enable to use both the existing functions and UDAF where we can assume the order (and do any windowing we need as part of the function itself which is relatively easy in many cases). I have tried to look for this and seen many questions on the subject but no answers. I was hoping I missed something (I have seen that the SQL CLUSTER BY command repartitions and sorts accordingly but from my understanding it does not promise that this would remain true if we do a groupby afterwards). If I didn't, I believe this should be a feature to add (I can open a JIRA if people think it is a good idea). Assaf. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Aggregating-over-sorted-data-tp19999.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.