Hi Sumit, Can you use http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=rdd#pyspark.RDD.mapPartitionsWithIndex <http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=rdd#pyspark.RDD.mapPartitionsWithIndex> to solve your problem?
Michael > On Jan 31, 2017, at 9:08 AM, Chawla,Sumit <sumitkcha...@gmail.com> wrote: > > Hi All > > I have a rdd, which i partition based on some key, and then can sc.runJob for > each partition. > Inside this function, i assign each partition a unique key using following: > > "%s_%s" % (id(part), int(round(time.time())) > This is to make sure that, each partition produces separate bookeeping stuff, > which can be aggregated by external system. However, I sometimes i notice > multiple > partition results pointing to same partition_id. Is this some issue due to > the > way above code is serialized by Pyspark. What's the best way to define a > unique id > for each partition. I undestand that its same executor getting multiple > partitions to process, > but i would expect the above code to produce a unique id for each partition. > > > Regards > Sumit Chawla >