Thanks Sean. Yes, your solution works :-) I did oversimplify my real problem, which has other parameters that go along with the sequence.
On Fri, May 16, 2014 at 3:03 AM, Sean Owen <so...@cloudera.com> wrote: > Not sure if this is feasible, but this literally does what I think you > are describing: > > sc.parallelize(rdd1.first to rdd1.last) > > On Tue, May 13, 2014 at 4:56 PM, Mohit Jaggi <mohitja...@gmail.com> wrote: > > Hi, > > I am trying to find a way to fill in missing values in an RDD. The RDD > is a > > sorted sequence. > > For example, (1, 2, 3, 5, 8, 11, ...) > > I need to fill in the missing numbers and get (1,2,3,4,5,6,7,8,9,10,11) > > > > One way to do this is to "slide and zip" > > rdd1 = sc.parallelize(List(1, 2, 3, 5, 8, 11, ...)) > > x = rdd1.first > > rdd2 = rdd1 filter (_ != x) > > rdd3 = rdd2 zip rdd1 > > rdd4 = rdd3 flatmap { (x, y) => generate missing elements between x and > y } > > > > Another method which I think is more efficient is to use > mapParititions() on > > rdd1 to be able to iterate on elements of rdd1 in each partition. > However, > > that leaves the boundaries of the partitions to be "unfilled". Is there a > > way within the function passed to mapPartitions, to read the first > element > > in the next partition? > > > > The latter approach also appears to work for a general "sliding window" > > calculation on the RDD. The former technique requires a lot of "sliding > and > > zipping" and I believe it is not efficient. If only I could read the next > > partition...I have tried passing a pointer to rdd1 to the function > passed to > > mapPartitions but the rdd1 pointer turns out to be NULL, I guess because > > Spark cannot deal with a mapper calling another mapper (since it happens > on > > a worker not the driver) > > > > Mohit. >