Re: Imbalanced shuffle read

2014-11-12 Thread ankits
I have made some progress - the partitioning is very uneven, and everything goes to one partition. I see that spark partitions by key, so I tried this: //partitioning is done like partitionIdx = f(key) % numPartitions //we use random keys to get even partitioning val uniform = other_st

Re: Imbalanced shuffle read

2014-11-12 Thread ankits
Adding a call to rdd.repartition() after randomizing the keys has no effect either. code - //partitioning is done like partitionIdx = f(key) % numPartitions //we use random keys to get even partitioning val uniform = other_stream.transform(rdd => { rdd.map({ kv => val

Re: Imbalanced shuffle read

2014-11-12 Thread ankits
I tried that, but that did not resolve the problem. All the executors for partitions except one have no shuffle reads and finish within 20-30 ms. one executor has a complete shuffle read of the previous stage. Any other ideas on debugging this? -- View this message in context: http://apache-spa

Re: Imbalanced shuffle read

2014-11-11 Thread Akhil Das
When you calls the groupByKey() try providing the number of partitions like groupByKey(100) depending on your data/cluster size. Thanks Best Regards On Wed, Nov 12, 2014 at 6:45 AM, ankits wrote: > Im running a job that uses groupByKey(), so it generates a lot of shuffle > data. Then it process