it can also be done with repartition + sortWithinPartitions + mapPartitions. perhaps not as convenient but it does not rely on undocumented behavior. i used this approach in spark-sorted. see here: https://github.com/tresata/spark-sorted/blob/master/src/main/scala/com/tresata/spark/sorted/sql/GroupSortedDataset.scala
On Wed, Dec 21, 2016 at 9:44 PM, Liang-Chi Hsieh <vii...@gmail.com> wrote: > > I agreed that to make sure this work, you might need to know the Spark > internal implementation for APIs such as `groupBy`. > > But without any more changes to current Spark implementation, I think this > is the one possible way to achieve the required function to aggregate on > sorted data per key. > > > > > > ----- > Liang-Chi Hsieh | @viirya > Spark Technology Center > http://www.spark.tc/ > -- > View this message in context: http://apache-spark- > developers-list.1001551.n3.nabble.com/Aggregating-over- > sorted-data-tp19999p20331.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >