Maybe you are looking for updateStateByKey? http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams
You can use broadcast to efficiently send info to all the workers, if you have some other data that's immutable, like in a local file, that needs to be distributed. On Tue, Nov 4, 2014 at 8:38 PM, Steve Reinhardt <s...@yarcdata.com> wrote: > > -----Original Message----- > From: Sean Owen <so...@cloudera.com> > >>On Tue, Nov 4, 2014 at 8:02 PM, spr <s...@yarcdata.com> wrote: >>> To state this another way, it seems like there's no way to straddle the >>> streaming world and the non-streaming world; to get input from both a >>> (vanilla, Linux) file and a stream. Is that true? >>> >>> If so, it seems I need to turn my (vanilla file) data into a second >>>stream. >> >>Hm, why do you say that? nothing prevents that at all. You can do >>anything you like in your local code, or in functions you send to >>remote workers. (Of course, if those functions depend on a local file, >>it has to exist locally on the workers.) You do have to think about >>the distributed model here, but what executes locally/remotely isn't >>mysterious. It is things in calls to Spark API method that will be >>executed remotely. > > The distinction I was calling out was temporal, not local/distributed, > though that is another important dimension. It sounds like I can do > anything I want in the code before the ssc.start(), but that code runs > once at the beginning of the program. What I'm searching for is some way > to have code that runs repeatedly and potentially updates a variable that > the Streaming code will see. Broadcast() almost does that, but apparently > the underlying variable should be immutable. I'm not aware of any (Spark) > way to have code run repeatedly other than as part of the Spark Streaming > API, but that doesn't look at vanilla files. > > The distributed angle you raise makes my "vanilla file" approach not quite > credible, in that the vanilla file would have to be distributed to all the > nodes for the updates to be seen. So maybe the simplest way to do that is > to have a vanilla Linux code monitoring the vanilla file (on a client > node) and sending any changes to it into a (distinct) stream. If so, the > remote code would need to monitor both that stream and the main data > stream. Does that make sense? > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org