Hi,
An issue I have encountered frequently is the need to look at data in an 
ordered manner per key.
A common way of doing this can be seen in the classic map reduce as the shuffle 
stage provides sorted data per key and one can therefore do a lot with that.
It is of course relatively easy to achieve this by using RDD but that would 
mean moving to RDD and back which has a non-insignificant performance penalty 
(beyond the fact that we lose any catalyst optimization).
We can use SQL window functions but that is not an ideal solution either. 
Beyond the fact that window functions are much slower than aggregate functions 
(as we need to generate a result for every record), we also can't join them 
together (i.e. if we have two window functions on the same window, it is still 
two separate scans).

Ideally, I would have liked to see something like: 
df.groupBy(col1).sortBy(col2).agg(...) and have the aggregations work on the 
sorted data. That would enable to use both the existing functions and UDAF 
where we can assume the order (and do any windowing we need as part of the 
function itself which is relatively easy in many cases).

I have tried to look for this and seen many questions on the subject but no 
answers.

I was hoping I missed something (I have seen that the SQL CLUSTER BY command 
repartitions and sorts accordingly but from my understanding it does not 
promise that this would remain true if we do a groupby afterwards). If I 
didn't, I believe this should be a feature to add (I can open a JIRA if people 
think it is a good idea).
Assaf.





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Aggregating-over-sorted-data-tp19999.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Reply via email to