Re: Aggregating over sorted data

Koert Kuipers Thu, 22 Dec 2016 09:10:05 -0800

yes it's less optimal because an abstraction is missing and with
mapPartitions it is done without optimizations. but aggregator is not the
right abstraction to begin with, is assumes a monoid which means no
ordering guarantees. you need a fold operation.


On Dec 22, 2016 02:20, "Liang-Chi Hsieh" <vii...@gmail.com> wrote:

>
> You can't use existing aggregation functions with that. Besides, the
> execution plan of `mapPartitions` doesn't support wholestage codegen.
> Without that and some optimization around aggregation, that might be
> possible performance degradation. Also when you have more than one keys in
> a
> partition, you will need to take care of that in your function applied to
> each partition.
>
>
> Koert Kuipers wrote
> > it can also be done with repartition + sortWithinPartitions +
> > mapPartitions.
> > perhaps not as convenient but it does not rely on undocumented behavior.
> > i used this approach in spark-sorted. see here:
> > https://github.com/tresata/spark-sorted/blob/master/src/
> main/scala/com/tresata/spark/sorted/sql/GroupSortedDataset.scala
> >
> > On Wed, Dec 21, 2016 at 9:44 PM, Liang-Chi Hsieh &lt;
>
> > viirya@
>
> > &gt; wrote:
> >
> >>
> >> I agreed that to make sure this work, you might need to know the Spark
> >> internal implementation for APIs such as `groupBy`.
> >>
> >> But without any more changes to current Spark implementation, I think
> >> this
> >> is the one possible way to achieve the required function to aggregate on
> >> sorted data per key.
> >>
> >>
> >>
> >>
> >>
> >> -----
> >> Liang-Chi Hsieh | @viirya
> >> Spark Technology Center
> >> http://www.spark.tc/
> >> --
> >> View this message in context: http://apache-spark-
> >> developers-list.1001551.n3.nabble.com/Aggregating-over-
> >> sorted-data-tp19999p20331.html
> >> Sent from the Apache Spark Developers List mailing list archive at
> >> Nabble.com.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe e-mail:
>
> > dev-unsubscribe@.apache
>
> >>
> >>
>
>
>
>
>
> -----
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> http://www.spark.tc/
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Aggregating-over-
> sorted-data-tp19999p20333.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Aggregating over sorted data

Reply via email to