Hi Sumit,

Can you use 
http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=rdd#pyspark.RDD.mapPartitionsWithIndex
 
<http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=rdd#pyspark.RDD.mapPartitionsWithIndex>
 to solve your problem?

Michael

> On Jan 31, 2017, at 9:08 AM, Chawla,Sumit <sumitkcha...@gmail.com> wrote:
> 
> Hi All
> 
> I have a rdd, which i partition based on some key, and then can sc.runJob for 
> each partition. 
>  Inside this function, i assign each partition a unique key using following:
> 
> "%s_%s" % (id(part), int(round(time.time()))
> This is to make sure that, each partition produces separate bookeeping stuff, 
> which can be aggregated by external system. However, I sometimes i notice 
> multiple 
> partition results pointing to same partition_id. Is this some issue due to 
> the 
> way above code is serialized by Pyspark. What's the best way to define a 
> unique id 
> for each partition. I undestand that its same executor getting multiple 
> partitions to process,
> but i would expect the above code to produce a unique id for each partition.
> 
> 
> Regards
> Sumit Chawla
> 

Reply via email to