Re: repartitionAndSortWithinPartitions HELP

2016-07-15 Thread Punit Naik
Okay that clears my doubt! Thanks a lot. On 15-Jul-2016 7:43 PM, "Koert Kuipers" wrote: spark's shuffle mechanism takes care of this kind of optimization internally when you use the sort-based shuffle (which is the default). On Thu, Jul 14, 2016 at 2:57 PM, Punit Naik wrote: > I meant to say

Re: repartitionAndSortWithinPartitions HELP

2016-07-15 Thread Koert Kuipers
spark's shuffle mechanism takes care of this kind of optimization internally when you use the sort-based shuffle (which is the default). On Thu, Jul 14, 2016 at 2:57 PM, Punit Naik wrote: > I meant to say that first we can sort the individual partitions and then > sort them again by merging. Sor

Re: repartitionAndSortWithinPartitions HELP

2016-07-15 Thread Koert Kuipers
sortByKey needs to use a range partitioner, a very particular partitioner, so you cannot supply your own partitioner. you should not have to shuffle twice to do a secondary sort algo On Thu, Jul 14, 2016 at 2:22 PM, Punit Naik wrote: > Okay. Can't I supply the same partitioner I used for > "re

Re: repartitionAndSortWithinPartitions HELP

2016-07-14 Thread Punit Naik
I meant to say that first we can sort the individual partitions and then sort them again by merging. Sort of a divide and conquer mechanism. Does sortByKey take care of all this internally? On Fri, Jul 15, 2016 at 12:08 AM, Punit Naik wrote: > Can we increase the sorting speed of RDD by doing a

Re: repartitionAndSortWithinPartitions HELP

2016-07-14 Thread Punit Naik
Can we increase the sorting speed of RDD by doing a secondary sort first? On Thu, Jul 14, 2016 at 11:52 PM, Punit Naik wrote: > Okay. Can't I supply the same partitioner I used for > "repartitionAndSortWithinPartitions" as an argument to "sortByKey"? > > On 14-Jul-2016 11:38 PM, "Koert Kuipers"

Re: repartitionAndSortWithinPartitions HELP

2016-07-14 Thread Punit Naik
Okay. Can't I supply the same partitioner I used for "repartitionAndSortWithinPartitions" as an argument to "sortByKey"? On 14-Jul-2016 11:38 PM, "Koert Kuipers" wrote: > repartitionAndSortWithinPartitions partitions the rdd and sorts within > each partition. so each partition is fully sorted, b

Re: repartitionAndSortWithinPartitions HELP

2016-07-14 Thread Koert Kuipers
repartitionAndSortWithinPartitions partitions the rdd and sorts within each partition. so each partition is fully sorted, but the rdd is not sorted. sortByKey is basically the same as repartitionAndSortWithinPartitions except it uses a range partitioner so that the entire rdd is sorted. however si

Re: repartitionAndSortWithinPartitions HELP

2016-07-14 Thread Punit Naik
Hi Koert I have already used "repartitionAndSortWithinPartitions" for secondary sorting and it works fine. Just wanted to know whether it will sort the entire RDD or not. On Thu, Jul 14, 2016 at 11:25 PM, Koert Kuipers wrote: > repartitionAndSortWithinPartit sort by keys, not values per key, so

Re: repartitionAndSortWithinPartitions HELP

2016-07-14 Thread Koert Kuipers
repartitionAndSortWithinPartit sort by keys, not values per key, so not really secondary sort by itself. for secondary sort also check out: https://github.com/tresata/spark-sorted On Thu, Jul 14, 2016 at 1:09 PM, Punit Naik wrote: > Hi guys > > In my spark/scala code I am implementing secondar

repartitionAndSortWithinPartitions HELP

2016-07-14 Thread Punit Naik
Hi guys In my spark/scala code I am implementing secondary sort. I wanted to know, when I call the "repartitionAndSortWithinPartitions" method, the whole (entire) RDD will be sorted or only the individual partitions will be sorted? If its the latter case, will applying a "sortByKey" after "reparti