Re: Secondary Sorting in Spark

swetha kasireddy Mon, 26 Oct 2015 18:15:49 -0700

Right now my code does the following for grouping by sessionId(which is the
key) and sorting by timestamp which is the first value in the tuple. The
second value in the tuple is Json.

def getGrpdAndSrtdSessions(rdd: RDD[(String, (Long, String))]):
RDD[(String, List[(Long, String)])] = {
  val grpdSessions = rdd.groupByKey();
  val srtdSessions  = grpdSessions.mapValues[(List[(Long,
String)])](iter => iter.toList.sortBy(_._1))
  srtdSessions
}

Based on the above blog post, should it be something like the following to
avoid a shuffle?

1. Create a class that has sessionId and timeStamp as the fields and use it
as the key.
2. The value will be my list of Json Strings which is the second field in
the tuple.
3.Create a Custom partitioner that chooses the partition based on session
id.
4.Write implicit ordering function of the key that does the ordering by
sessionId and timeStamp
5.And then do repartitionAndSortWithinPartitions

In this scenario, the code does the same thing as groupByKey and sortBy
correct?

 What about if I want to reduce the shuffling when I do a reduceByKey? Do I
just use a customPartitioner and then do reduceByKey? Does using Custom
Partitioner before using a reduceByKey improve performance?

On Mon, Oct 26, 2015 at 2:51 AM, Adrian Tanase <[email protected]> wrote:

> Do you have a particular concern? You’re always using a partitioner
> (default is HashPartitioner) and the Partitioner interface is pretty light,
> can’t see how it could affect performance.
>
> Used correctly it should improve performance as you can better control
> placement of data and avoid shuffling…
>
> -adrian
>
> From: swetha kasireddy
> Date: Monday, October 26, 2015 at 6:56 AM
> To: Adrian Tanase
> Cc: Bill Bejeck, "[email protected]"
> Subject: Re: Secondary Sorting in Spark
>
> Hi,
>
> Does the use of custom partitioner in Streaming affect performance?
>
> On Mon, Oct 5, 2015 at 1:06 PM, Adrian Tanase <[email protected]> wrote:
>
>> Great article, especially the use of a custom partitioner.
>>
>> Also, sorting by multiple fields by creating a tuple out of them is an
>> awesome, easy to miss, Scala feature.
>>
>> Sent from my iPhone
>>
>> On 04 Oct 2015, at 21:41, Bill Bejeck <[email protected]> wrote:
>>
>> I've written blog post on secondary sorting in Spark and I'd thought I'd
>> share it with the group
>>
>> http://codingjunkie.net/spark-secondary-sort/
>>
>> Thanks,
>> Bill
>>
>>
>

Re: Secondary Sorting in Spark

Reply via email to