Re: filling missing values in a sequence

bgawalt Fri, 16 May 2014 12:03:51 -0700

Hello Mohit,

I don't think there's a direct way of bleeding elements across partitions.
But you could write it yourself relatively succinctly:


A) Sort the RDD
B) Look at the sorted RDD's partitions with the .mapParititionsWithIndex( )
method. Map each partition to its partition ID, and its maximum element.
Collect the (partID, maxElements) in the driver.
C) Broadcast the collection of (partID, part's max element) tuples
D) Look again at the sorted RDD's partitions with mapPartitionsWithIndex( ).
For each partition K:
D1) Find the immediately-preceding partition K -1 , and its associated
maximum value. Use that to decide how many values are missing between the
last element of part K-1 and the first element of part K.
D2) Step through part K's elements and find the rest of the missing elements
in that part

This approach sidesteps worries you might have over the hack of using
.filter to remove the first element (how do you want to handle ties, for
instance?), as well as the possible fragility of zipping.

--Brian



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/filling-missing-values-in-a-sequence-tp5708p5846.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: filling missing values in a sequence

Reply via email to