I thought I did a good job ;-)

OK, so what is the best way to initialize updateStateByKey operation? I have 
counts from previous spark-submit, and want to load that in next spark-submit 
job.

----- Original Message -----
From: "Soumitra Kumar" <kumar.soumi...@gmail.com>
To: "spark users" <user@spark.apache.org>
Sent: Sunday, September 21, 2014 10:43:01 AM
Subject: How to initialize updateStateByKey operation

I started with StatefulNetworkWordCount to have a running count of words seen.

I have a file 'stored.count' which contains the word counts.

$ cat stored.count
a 1
b 2

I want to initialize StatefulNetworkWordCount with the values in 'stored.count' 
file, how do I do that?

I looked at the paper 'EECS-2012-259.pdf' Matei et al, in figure 2, it would be 
useful to have an initial RDD feeding into 'counts' at 't = 1', as below.

                           initial
                             |
t = 1: pageView -> ones -> counts
                             |
t = 2: pageView -> ones -> counts
...

I have also attached the modified figure 2 of 
http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf .

I managed to hack Spark code to achieve this, and attaching the modified files.

Essentially, I added an argument 'initial : RDD [(K, S)]' to updateStateByKey 
method, as
    def updateStateByKey[S: ClassTag](
        initial : RDD [(K, S)],
        updateFunc: (Seq[V], Option[S]) => Option[S],
        partitioner: Partitioner
      ): DStream[(K, S)]

If it sounds interesting for larger crowd I would be happy to cleanup the code, 
and volunteer to push into the code. I don't know the procedure to that though.

-Soumitra.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to