Turned out that is was sufficient do to repartitionAndSortWithinPartitions
... so far so good ;)
On Tue, May 5, 2015 at 9:45 AM Marius Danciu
wrote:
> Hi Imran,
>
> Yes that's what MyPartitioner does. I do see (using traces from
> MyPartitioner) that the key is partitioned on partition 0 but the
Hi Imran,
Yes that's what MyPartitioner does. I do see (using traces from
MyPartitioner) that the key is partitioned on partition 0 but then I see
this record arriving in both Yarn containers (I see it in the logs).
Basically I need to emulate a Hadoop map-reduce job in Spark and groupByKey
seemed
Hi Marius,
I am also a little confused -- are you saying that myPartitions is
basically something like:
class MyPartitioner extends Partitioner {
def numPartitions = 1
def getPartition(key: Any) = 0
}
??
If so, I don't understand how you'd ever end up data in two partitions.
Indeed, than ev
.
From: Marius Danciu
Date: Tuesday, April 28, 2015 at 9:53 AM
To: Silvio Fiorito, user
Subject: Re: Spark partitioning question
Thank you Silvio,
I am aware of groubBy limitations and this is subject for replacement.
I did try repartitionAndSortWithinPartitions but then I end up with maybe too
need to sort and repartition, try using
> repartitionAndSortWithinPartitions to do it in one shot.
>
> Thanks,
> Silvio
>
> From: Marius Danciu
> Date: Tuesday, April 28, 2015 at 8:10 AM
> To: user
> Subject: Spark partitioning question
>
>
shot.
Thanks,
Silvio
From: Marius Danciu
Date: Tuesday, April 28, 2015 at 8:10 AM
To: user
Subject: Spark partitioning question
Hello all,
I have the following Spark (pseudo)code:
rdd = mapPartitionsWithIndex(...)
.mapPartitionsToPair(...)
.groupByKey()
.sortByKey
Hello all,
I have the following Spark (pseudo)code:
rdd = mapPartitionsWithIndex(...)
.mapPartitionsToPair(...)
.groupByKey()
.sortByKey(comparator)
.partitionBy(myPartitioner)
.mapPartitionsWithIndex(...)
.mapPartitionsToPair( *f* )
The input data