Thanks everyone. As Nathan suggested, I ended up collecting the distinct
keys first and then assigning Ids to each key explicitly.
Regards
Sumit Chawla
On Fri, Jun 22, 2018 at 7:29 AM, Nathan Kronenfeld <
nkronenfeld@uncharted.software> wrote:
> On Thu, Jun 21, 2018 at 4:51 PM, Chawla,Sumit
>
> On Thu, Jun 21, 2018 at 4:51 PM, Chawla,Sumit
>>> wrote:
>>>
Hi
I have been trying to this simple operation. I want to land all
values with one key in same partition, and not have any different key in
the same partition. Is this possible? I am getting b and c alwa
It is not possible because the cardinality of the partitioning key is
non-deterministic, while partition count should be fixed. There's a chance
that cardinality > partition count and then the system can't ensure the
requirement.
Thanks,
Jungtaek Lim (HeartSaVioR)
2018년 6월 22일 (금) 오전 8:55, Chawla
Based on code read it looks like Spark does modulo of key for partition.
Keys of c and b end up pointing to same value. Whats the best partitioning
scheme to deal with this?
Regards
Sumit Chawla
On Thu, Jun 21, 2018 at 4:51 PM, Chawla,Sumit
wrote:
> Hi
>
> I have been trying to this simple o
Hi
I have been trying to this simple operation. I want to land all values
with one key in same partition, and not have any different key in the same
partition. Is this possible? I am getting b and c always getting mixed
up in the same partition.
rdd = sc.parallelize([('a', 5), ('d', 8), ('b