It use HashPartitioner to distribute the record to different partitions, but 
the key is just integer  evenly across output partitions.

>From the code, each resulting partition will get very similar number of 
>records.

Thanks.

Zhan Zhang


On Mar 4, 2015, at 3:47 PM, Du Li 
<l...@yahoo-inc.com.INVALID<mailto:l...@yahoo-inc.com.INVALID>> wrote:

Hi,

My RDD's are created from kafka stream. After receiving a RDD, I want to do 
coalesce/repartition it so that the data will be processed in a set of machines 
in parallel as even as possible. The number of processing nodes is larger than 
the receiving nodes.

My question is how the coalesce/repartition works. Does it distribute by the 
number of records or number of bytes? In my app, my observation is that the 
distribution seems by number of records. The consequence is, however, some 
executors have to process x1000 as much as data when the sizes of records are 
very skewed. Then we have to allocate memory by the worst case.

Is there a way to programmatically affect the coalesce /repartition scheme?

Thanks,
Du

Reply via email to