I have made some progress - the partitioning is very uneven, and everything
goes to one partition. I see that spark partitions by key, so I tried this:
//partitioning is done like partitionIdx = f(key) % numPartitions
//we use random keys to get even partitioning
val uniform = other_st
Adding a call to rdd.repartition() after randomizing the keys has no effect
either. code -
//partitioning is done like partitionIdx = f(key) % numPartitions
//we use random keys to get even partitioning
val uniform = other_stream.transform(rdd => {
rdd.map({ kv =>
val
I tried that, but that did not resolve the problem. All the executors for
partitions except one have no shuffle reads and finish within 20-30 ms. one
executor has a complete shuffle read of the previous stage. Any other ideas
on debugging this?
--
View this message in context:
http://apache-spa
When you calls the groupByKey() try providing the number of partitions like
groupByKey(100) depending on your data/cluster size.
Thanks
Best Regards
On Wed, Nov 12, 2014 at 6:45 AM, ankits wrote:
> Im running a job that uses groupByKey(), so it generates a lot of shuffle
> data. Then it process