Re: Aggregating over sorted data

Liang-Chi Hsieh Thu, 22 Dec 2016 02:21:05 -0800

You can't use existing aggregation functions with that. Besides, the
execution plan of `mapPartitions` doesn't support wholestage codegen.
Without that and some optimization around aggregation, that might be
possible performance degradation. Also when you have more than one keys in a
partition, you will need to take care of that in your function applied to
each partition.



Koert Kuipers wrote
> it can also be done with repartition + sortWithinPartitions +
> mapPartitions.
> perhaps not as convenient but it does not rely on undocumented behavior.
> i used this approach in spark-sorted. see here:
> https://github.com/tresata/spark-sorted/blob/master/src/main/scala/com/tresata/spark/sorted/sql/GroupSortedDataset.scala
> 
> On Wed, Dec 21, 2016 at 9:44 PM, Liang-Chi Hsieh &lt;

> viirya@

> &gt; wrote:
> 
>>
>> I agreed that to make sure this work, you might need to know the Spark
>> internal implementation for APIs such as `groupBy`.
>>
>> But without any more changes to current Spark implementation, I think
>> this
>> is the one possible way to achieve the required function to aggregate on
>> sorted data per key.
>>
>>
>>
>>
>>
>> -----
>> Liang-Chi Hsieh | @viirya
>> Spark Technology Center
>> http://www.spark.tc/
>> --
>> View this message in context: http://apache-spark-
>> developers-list.1001551.n3.nabble.com/Aggregating-over-
>> sorted-data-tp19999p20331.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>>
>>





-----
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Aggregating-over-sorted-data-tp19999p20333.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Aggregating over sorted data

Reply via email to