yes it's less optimal because an abstraction is missing and with mapPartitions it is done without optimizations. but aggregator is not the right abstraction to begin with, is assumes a monoid which means no ordering guarantees. you need a fold operation.
On Dec 22, 2016 02:20, "Liang-Chi Hsieh" <vii...@gmail.com> wrote: > > You can't use existing aggregation functions with that. Besides, the > execution plan of `mapPartitions` doesn't support wholestage codegen. > Without that and some optimization around aggregation, that might be > possible performance degradation. Also when you have more than one keys in > a > partition, you will need to take care of that in your function applied to > each partition. > > > Koert Kuipers wrote > > it can also be done with repartition + sortWithinPartitions + > > mapPartitions. > > perhaps not as convenient but it does not rely on undocumented behavior. > > i used this approach in spark-sorted. see here: > > https://github.com/tresata/spark-sorted/blob/master/src/ > main/scala/com/tresata/spark/sorted/sql/GroupSortedDataset.scala > > > > On Wed, Dec 21, 2016 at 9:44 PM, Liang-Chi Hsieh < > > > viirya@ > > > > wrote: > > > >> > >> I agreed that to make sure this work, you might need to know the Spark > >> internal implementation for APIs such as `groupBy`. > >> > >> But without any more changes to current Spark implementation, I think > >> this > >> is the one possible way to achieve the required function to aggregate on > >> sorted data per key. > >> > >> > >> > >> > >> > >> ----- > >> Liang-Chi Hsieh | @viirya > >> Spark Technology Center > >> http://www.spark.tc/ > >> -- > >> View this message in context: http://apache-spark- > >> developers-list.1001551.n3.nabble.com/Aggregating-over- > >> sorted-data-tp19999p20331.html > >> Sent from the Apache Spark Developers List mailing list archive at > >> Nabble.com. > >> > >> --------------------------------------------------------------------- > >> To unsubscribe e-mail: > > > dev-unsubscribe@.apache > > >> > >> > > > > > > ----- > Liang-Chi Hsieh | @viirya > Spark Technology Center > http://www.spark.tc/ > -- > View this message in context: http://apache-spark- > developers-list.1001551.n3.nabble.com/Aggregating-over- > sorted-data-tp19999p20333.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >