On Wed, Jan 31, 2018 at 1:15 AM, Ruifeng Zheng <ruife...@foxmail.com> wrote:
> HI all:
>
>
>
>        1, Dataset API supports operation “sortWithinPartitions”, but in RDD
> API there is no counterpart (I know there is
> “repartitionAndSortWithinPartitions”, but I don’t want to repartition the
> RDD), I have to convert RDD to Dataset for this function. Would it make
> sense to add a “sortWithinPartitions” for RDD?


Are you concerned about a shuffle here ?
If yes, if you make the partitioners same, you can avoid that and get
sort within partition.




>
>
>
>        2, In “aggregateByKey”/”reduceByKey”, I want to do some special
> operation (like aggregator compression) after local aggregation on each
> partitions. A similar case may be: compute ‘ApproximatePercentile’ for
> different keys by ”reduceByKey”, it may be helpful if
> ‘QuantileSummaries#compress’ is called before network communication. So I
> wonder if it is useful to add a ‘aggregateWithinPartitions’ for RDD?
>


You could override serde of your 'C' to accomplish this ?
The tricky part would be merging in case of spills and differentiating
between map side and reduce side serialization.


Regards,
Mridul


>
>
> Regards,
>
> Ruifeng
>
>
>
>
>
>
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to