Hello Mohit, I don't think there's a direct way of bleeding elements across partitions. But you could write it yourself relatively succinctly:
A) Sort the RDD B) Look at the sorted RDD's partitions with the .mapParititionsWithIndex( ) method. Map each partition to its partition ID, and its maximum element. Collect the (partID, maxElements) in the driver. C) Broadcast the collection of (partID, part's max element) tuples D) Look again at the sorted RDD's partitions with mapPartitionsWithIndex( ). For each partition K: D1) Find the immediately-preceding partition K -1 , and its associated maximum value. Use that to decide how many values are missing between the last element of part K-1 and the first element of part K. D2) Step through part K's elements and find the rest of the missing elements in that part This approach sidesteps worries you might have over the hack of using .filter to remove the first element (how do you want to handle ties, for instance?), as well as the possible fragility of zipping. --Brian -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/filling-missing-values-in-a-sequence-tp5708p5846.html Sent from the Apache Spark User List mailing list archive at Nabble.com.