Xiangrui, Thanks for the pointer. I think it should work...for now I did cook up my own which is similar but on top of spark core APIs. I would suggest moving the sliding window RDD to the core spark library. It seems quite general to me and a cursory look at the code indicates nothing specific to machine learning.
Mohit. On Mon, May 19, 2014 at 10:13 PM, Xiangrui Meng <men...@gmail.com> wrote: > Actually there is a sliding method implemented in > mllib.rdd.RDDFunctions. Since this is not for general use cases, we > didn't include it in spark-core. You can take a look at the > implementation there and see whether it fits. -Xiangrui > > On Mon, May 19, 2014 at 10:06 PM, Mohit Jaggi <mohitja...@gmail.com> > wrote: > > Thanks Sean. Yes, your solution works :-) I did oversimplify my real > > problem, which has other parameters that go along with the sequence. > > > > > > On Fri, May 16, 2014 at 3:03 AM, Sean Owen <so...@cloudera.com> wrote: > >> > >> Not sure if this is feasible, but this literally does what I think you > >> are describing: > >> > >> sc.parallelize(rdd1.first to rdd1.last) > >> > >> On Tue, May 13, 2014 at 4:56 PM, Mohit Jaggi <mohitja...@gmail.com> > wrote: > >> > Hi, > >> > I am trying to find a way to fill in missing values in an RDD. The RDD > >> > is a > >> > sorted sequence. > >> > For example, (1, 2, 3, 5, 8, 11, ...) > >> > I need to fill in the missing numbers and get > (1,2,3,4,5,6,7,8,9,10,11) > >> > > >> > One way to do this is to "slide and zip" > >> > rdd1 = sc.parallelize(List(1, 2, 3, 5, 8, 11, ...)) > >> > x = rdd1.first > >> > rdd2 = rdd1 filter (_ != x) > >> > rdd3 = rdd2 zip rdd1 > >> > rdd4 = rdd3 flatmap { (x, y) => generate missing elements between x > and > >> > y } > >> > > >> > Another method which I think is more efficient is to use > >> > mapParititions() on > >> > rdd1 to be able to iterate on elements of rdd1 in each partition. > >> > However, > >> > that leaves the boundaries of the partitions to be "unfilled". Is > there > >> > a > >> > way within the function passed to mapPartitions, to read the first > >> > element > >> > in the next partition? > >> > > >> > The latter approach also appears to work for a general "sliding > window" > >> > calculation on the RDD. The former technique requires a lot of > "sliding > >> > and > >> > zipping" and I believe it is not efficient. If only I could read the > >> > next > >> > partition...I have tried passing a pointer to rdd1 to the function > >> > passed to > >> > mapPartitions but the rdd1 pointer turns out to be NULL, I guess > because > >> > Spark cannot deal with a mapper calling another mapper (since it > happens > >> > on > >> > a worker not the driver) > >> > > >> > Mohit. > > > > >