At a high-level, the suggestion sounds good to me. However regarding code, its best to submit a Pull Request on Spark github page for community reviewing. You will find more information here. https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
On Tue, Sep 23, 2014 at 10:11 PM, Soumitra Kumar <kumar.soumi...@gmail.com> wrote: > I thought I did a good job ;-) > > OK, so what is the best way to initialize updateStateByKey operation? I > have counts from previous spark-submit, and want to load that in next > spark-submit job. > > ----- Original Message ----- > From: "Soumitra Kumar" <kumar.soumi...@gmail.com> > To: "spark users" <user@spark.apache.org> > Sent: Sunday, September 21, 2014 10:43:01 AM > Subject: How to initialize updateStateByKey operation > > I started with StatefulNetworkWordCount to have a running count of words > seen. > > I have a file 'stored.count' which contains the word counts. > > $ cat stored.count > a 1 > b 2 > > I want to initialize StatefulNetworkWordCount with the values in > 'stored.count' file, how do I do that? > > I looked at the paper 'EECS-2012-259.pdf' Matei et al, in figure 2, it > would be useful to have an initial RDD feeding into 'counts' at 't = 1', as > below. > > initial > | > t = 1: pageView -> ones -> counts > | > t = 2: pageView -> ones -> counts > ... > > I have also attached the modified figure 2 of > http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf . > > I managed to hack Spark code to achieve this, and attaching the modified > files. > > Essentially, I added an argument 'initial : RDD [(K, S)]' to > updateStateByKey method, as > def updateStateByKey[S: ClassTag]( > initial : RDD [(K, S)], > updateFunc: (Seq[V], Option[S]) => Option[S], > partitioner: Partitioner > ): DStream[(K, S)] > > If it sounds interesting for larger crowd I would be happy to cleanup the > code, and volunteer to push into the code. I don't know the procedure to > that though. > > -Soumitra. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >