Not sure if this is feasible, but this literally does what I think you
are describing:

sc.parallelize(rdd1.first to rdd1.last)

On Tue, May 13, 2014 at 4:56 PM, Mohit Jaggi <mohitja...@gmail.com> wrote:
> Hi,
> I am trying to find a way to fill in missing values in an RDD. The RDD is a
> sorted sequence.
> For example, (1, 2, 3, 5, 8, 11, ...)
> I need to fill in the missing numbers and get (1,2,3,4,5,6,7,8,9,10,11)
>
> One way to do this is to "slide and zip"
> rdd1 =  sc.parallelize(List(1, 2, 3, 5, 8, 11, ...))
> x = rdd1.first
> rdd2 = rdd1 filter (_ != x)
> rdd3 = rdd2 zip rdd1
> rdd4 = rdd3 flatmap { (x, y) => generate missing elements between x and y }
>
> Another method which I think is more efficient is to use mapParititions() on
> rdd1 to be able to iterate on elements of rdd1 in each partition. However,
> that leaves the boundaries of the partitions to be "unfilled". Is there a
> way within the function passed to mapPartitions, to read the first element
> in the next partition?
>
> The latter approach also appears to work for a general "sliding window"
> calculation on the RDD. The former technique requires a lot of "sliding and
> zipping" and I believe it is not efficient. If only I could read the next
> partition...I have tried passing a pointer to rdd1 to the function passed to
> mapPartitions but the rdd1 pointer turns out to be NULL, I guess because
> Spark cannot deal with a mapper calling another mapper (since it happens on
> a worker not the driver)
>
> Mohit.

Reply via email to