Okay that clears my doubt! Thanks a lot.
On 15-Jul-2016 7:43 PM, "Koert Kuipers" wrote:
spark's shuffle mechanism takes care of this kind of optimization
internally when you use the sort-based shuffle (which is the default).
On Thu, Jul 14, 2016 at 2:57 PM, Punit Naik wrote:
> I meant to say
spark's shuffle mechanism takes care of this kind of optimization
internally when you use the sort-based shuffle (which is the default).
On Thu, Jul 14, 2016 at 2:57 PM, Punit Naik wrote:
> I meant to say that first we can sort the individual partitions and then
> sort them again by merging. Sor
sortByKey needs to use a range partitioner, a very particular partitioner,
so you cannot supply your own partitioner.
you should not have to shuffle twice to do a secondary sort algo
On Thu, Jul 14, 2016 at 2:22 PM, Punit Naik wrote:
> Okay. Can't I supply the same partitioner I used for
> "re
I meant to say that first we can sort the individual partitions and then
sort them again by merging. Sort of a divide and conquer mechanism.
Does sortByKey take care of all this internally?
On Fri, Jul 15, 2016 at 12:08 AM, Punit Naik wrote:
> Can we increase the sorting speed of RDD by doing a
Can we increase the sorting speed of RDD by doing a secondary sort first?
On Thu, Jul 14, 2016 at 11:52 PM, Punit Naik wrote:
> Okay. Can't I supply the same partitioner I used for
> "repartitionAndSortWithinPartitions" as an argument to "sortByKey"?
>
> On 14-Jul-2016 11:38 PM, "Koert Kuipers"
Okay. Can't I supply the same partitioner I used for
"repartitionAndSortWithinPartitions" as an argument to "sortByKey"?
On 14-Jul-2016 11:38 PM, "Koert Kuipers" wrote:
> repartitionAndSortWithinPartitions partitions the rdd and sorts within
> each partition. so each partition is fully sorted, b
repartitionAndSortWithinPartitions partitions the rdd and sorts within each
partition. so each partition is fully sorted, but the rdd is not sorted.
sortByKey is basically the same as repartitionAndSortWithinPartitions
except it uses a range partitioner so that the entire rdd is sorted.
however si
Hi Koert
I have already used "repartitionAndSortWithinPartitions" for secondary
sorting and it works fine. Just wanted to know whether it will sort the
entire RDD or not.
On Thu, Jul 14, 2016 at 11:25 PM, Koert Kuipers wrote:
> repartitionAndSortWithinPartit sort by keys, not values per key, so
repartitionAndSortWithinPartit sort by keys, not values per key, so not
really secondary sort by itself.
for secondary sort also check out:
https://github.com/tresata/spark-sorted
On Thu, Jul 14, 2016 at 1:09 PM, Punit Naik wrote:
> Hi guys
>
> In my spark/scala code I am implementing secondar
Hi guys
In my spark/scala code I am implementing secondary sort. I wanted to know,
when I call the "repartitionAndSortWithinPartitions" method, the whole
(entire) RDD will be sorted or only the individual partitions will be
sorted?
If its the latter case, will applying a "sortByKey" after
"reparti
10 matches
Mail list logo