Thanks everyone. As Nathan suggested, I ended up collecting the distinct
keys first and then assigning Ids to each key explicitly.
Regards
Sumit Chawla
On Fri, Jun 22, 2018 at 7:29 AM, Nathan Kronenfeld <
nkronenfeld@uncharted.software> wrote:
> On Thu, Jun 21, 2018 at 4:51 PM, Chawla,Sumit
>
> On Thu, Jun 21, 2018 at 4:51 PM, Chawla,Sumit
>>> wrote:
>>>
Hi
I have been trying to this simple operation. I want to land all
values with one key in same partition, and not have any different key in
the same partition. Is this possible? I am getting b and c alwa
Hi Chawla,
There is nothing wrong with your code, nor with Spark.
The situation in which two different keys are mapped to the same partition
is perfectly valid,
since they are mapped to the same 'bucket'.
The promise is that all records with the same key 'k' will be mapped to the
same partition.
It is not possible because the cardinality of the partitioning key is
non-deterministic, while partition count should be fixed. There's a chance
that cardinality > partition count and then the system can't ensure the
requirement.
Thanks,
Jungtaek Lim (HeartSaVioR)
2018년 6월 22일 (금) 오전 8:55, Chawla
Based on code read it looks like Spark does modulo of key for partition.
Keys of c and b end up pointing to same value. Whats the best partitioning
scheme to deal with this?
Regards
Sumit Chawla
On Thu, Jun 21, 2018 at 4:51 PM, Chawla,Sumit
wrote:
> Hi
>
> I have been trying to this simple o
Hi
I have been trying to this simple operation. I want to land all values
with one key in same partition, and not have any different key in the same
partition. Is this possible? I am getting b and c always getting mixed
up in the same partition.
rdd = sc.parallelize([('a', 5), ('d', 8), ('b