It seems that this aggregation is for dataset operations only. I would have hoped to be able to do dataframe aggregation. Something along the line of: sort_df(df).agg(my_agg_func)
In any case, note that this kind of sorting is less efficient than the sorting done in window functions for example. Specifically here what is happening is that first the data is shuffled and then the entire partition is sorted. It is possible to do it another way (although I have no idea how to do it in spark without writing a UDAF which is probably very inefficient). The other way would be to collect everything by key in each partition, sort within the key (which would be a lot faster since there are fewer elements) and then merge the results. I was hoping to find something like: Efficient sortByKey to work with… From: Koert Kuipers [via Apache Spark Developers List] [mailto:ml-node+s1001551n20332...@n3.nabble.com] Sent: Thursday, December 22, 2016 7:14 AM To: Mendelson, Assaf Subject: Re: Aggregating over sorted data it can also be done with repartition + sortWithinPartitions + mapPartitions. perhaps not as convenient but it does not rely on undocumented behavior. i used this approach in spark-sorted. see here: https://github.com/tresata/spark-sorted/blob/master/src/main/scala/com/tresata/spark/sorted/sql/GroupSortedDataset.scala On Wed, Dec 21, 2016 at 9:44 PM, Liang-Chi Hsieh <[hidden email]</user/SendEmail.jtp?type=node&node=20332&i=0>> wrote: I agreed that to make sure this work, you might need to know the Spark internal implementation for APIs such as `groupBy`. But without any more changes to current Spark implementation, I think this is the one possible way to achieve the required function to aggregate on sorted data per key. ----- Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Aggregating-over-sorted-data-tp19999p20331.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: [hidden email]</user/SendEmail.jtp?type=node&node=20332&i=1> ________________________________ If you reply to this email, your message will be added to the discussion below: http://apache-spark-developers-list.1001551.n3.nabble.com/Aggregating-over-sorted-data-tp19999p20332.html To start a new topic under Apache Spark Developers List, email ml-node+s1001551n1...@n3.nabble.com<mailto:ml-node+s1001551n1...@n3.nabble.com> To unsubscribe from Apache Spark Developers List, click here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=YXNzYWYubWVuZGVsc29uQHJzYS5jb218MXwtMTI4OTkxNTg1Mg==>. NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Aggregating-over-sorted-data-tp19999p20334.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.