Re: RepartitionByKey Behavior

2018-06-26 Thread Chawla,Sumit
Thanks everyone. As Nathan suggested, I ended up collecting the distinct keys first and then assigning Ids to each key explicitly. Regards Sumit Chawla On Fri, Jun 22, 2018 at 7:29 AM, Nathan Kronenfeld < nkronenfeld@uncharted.software> wrote: > On Thu, Jun 21, 2018 at 4:51 PM, Chawla,Sumit

Re: RepartitionByKey Behavior

2018-06-22 Thread Nathan Kronenfeld
> > On Thu, Jun 21, 2018 at 4:51 PM, Chawla,Sumit >>> wrote: >>> Hi I have been trying to this simple operation. I want to land all values with one key in same partition, and not have any different key in the same partition. Is this possible? I am getting b and c alwa

Re: RepartitionByKey Behavior

2018-06-21 Thread Jungtaek Lim
It is not possible because the cardinality of the partitioning key is non-deterministic, while partition count should be fixed. There's a chance that cardinality > partition count and then the system can't ensure the requirement. Thanks, Jungtaek Lim (HeartSaVioR) 2018년 6월 22일 (금) 오전 8:55, Chawla

Re: RepartitionByKey Behavior

2018-06-21 Thread Chawla,Sumit
Based on code read it looks like Spark does modulo of key for partition. Keys of c and b end up pointing to same value. Whats the best partitioning scheme to deal with this? Regards Sumit Chawla On Thu, Jun 21, 2018 at 4:51 PM, Chawla,Sumit wrote: > Hi > > I have been trying to this simple o