You can't use existing aggregation functions with that. Besides, the execution plan of `mapPartitions` doesn't support wholestage codegen. Without that and some optimization around aggregation, that might be possible performance degradation. Also when you have more than one keys in a partition, you will need to take care of that in your function applied to each partition.
Koert Kuipers wrote > it can also be done with repartition + sortWithinPartitions + > mapPartitions. > perhaps not as convenient but it does not rely on undocumented behavior. > i used this approach in spark-sorted. see here: > https://github.com/tresata/spark-sorted/blob/master/src/main/scala/com/tresata/spark/sorted/sql/GroupSortedDataset.scala > > On Wed, Dec 21, 2016 at 9:44 PM, Liang-Chi Hsieh < > viirya@ > > wrote: > >> >> I agreed that to make sure this work, you might need to know the Spark >> internal implementation for APIs such as `groupBy`. >> >> But without any more changes to current Spark implementation, I think >> this >> is the one possible way to achieve the required function to aggregate on >> sorted data per key. >> >> >> >> >> >> ----- >> Liang-Chi Hsieh | @viirya >> Spark Technology Center >> http://www.spark.tc/ >> -- >> View this message in context: http://apache-spark- >> developers-list.1001551.n3.nabble.com/Aggregating-over- >> sorted-data-tp19999p20331.html >> Sent from the Apache Spark Developers List mailing list archive at >> Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: > dev-unsubscribe@.apache >> >> ----- Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Aggregating-over-sorted-data-tp19999p20333.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org