Re: filling missing values in a sequence

Mohit Jaggi Tue, 20 May 2014 09:36:06 -0700

Xiangrui,
Thanks for the pointer. I think it should work...for now I did cook up my
own which is similar but on top of spark core APIs. I would suggest moving
the sliding window RDD to the core spark library. It seems quite general to
me and a cursory look at the code indicates nothing specific to machine
learning.


Mohit.


On Mon, May 19, 2014 at 10:13 PM, Xiangrui Meng <men...@gmail.com> wrote:

> Actually there is a sliding method implemented in
> mllib.rdd.RDDFunctions. Since this is not for general use cases, we
> didn't include it in spark-core. You can take a look at the
> implementation there and see whether it fits. -Xiangrui
>
> On Mon, May 19, 2014 at 10:06 PM, Mohit Jaggi <mohitja...@gmail.com>
> wrote:
> > Thanks Sean. Yes, your solution works :-) I did oversimplify my real
> > problem, which has other parameters that go along with the sequence.
> >
> >
> > On Fri, May 16, 2014 at 3:03 AM, Sean Owen <so...@cloudera.com> wrote:
> >>
> >> Not sure if this is feasible, but this literally does what I think you
> >> are describing:
> >>
> >> sc.parallelize(rdd1.first to rdd1.last)
> >>
> >> On Tue, May 13, 2014 at 4:56 PM, Mohit Jaggi <mohitja...@gmail.com>
> wrote:
> >> > Hi,
> >> > I am trying to find a way to fill in missing values in an RDD. The RDD
> >> > is a
> >> > sorted sequence.
> >> > For example, (1, 2, 3, 5, 8, 11, ...)
> >> > I need to fill in the missing numbers and get
> (1,2,3,4,5,6,7,8,9,10,11)
> >> >
> >> > One way to do this is to "slide and zip"
> >> > rdd1 =  sc.parallelize(List(1, 2, 3, 5, 8, 11, ...))
> >> > x = rdd1.first
> >> > rdd2 = rdd1 filter (_ != x)
> >> > rdd3 = rdd2 zip rdd1
> >> > rdd4 = rdd3 flatmap { (x, y) => generate missing elements between x
> and
> >> > y }
> >> >
> >> > Another method which I think is more efficient is to use
> >> > mapParititions() on
> >> > rdd1 to be able to iterate on elements of rdd1 in each partition.
> >> > However,
> >> > that leaves the boundaries of the partitions to be "unfilled". Is
> there
> >> > a
> >> > way within the function passed to mapPartitions, to read the first
> >> > element
> >> > in the next partition?
> >> >
> >> > The latter approach also appears to work for a general "sliding
> window"
> >> > calculation on the RDD. The former technique requires a lot of
> "sliding
> >> > and
> >> > zipping" and I believe it is not efficient. If only I could read the
> >> > next
> >> > partition...I have tried passing a pointer to rdd1 to the function
> >> > passed to
> >> > mapPartitions but the rdd1 pointer turns out to be NULL, I guess
> because
> >> > Spark cannot deal with a mapper calling another mapper (since it
> happens
> >> > on
> >> > a worker not the driver)
> >> >
> >> > Mohit.
> >
> >
>

Re: filling missing values in a sequence

Reply via email to