Re: Streaming: which code is (not) executed at every batch interval?

Steve Reinhardt Tue, 04 Nov 2014 12:51:13 -0800

-----Original Message-----
From: Sean Owen <so...@cloudera.com>

>On Tue, Nov 4, 2014 at 8:02 PM, spr <s...@yarcdata.com> wrote:
>> To state this another way, it seems like there's no way to straddle the
>> streaming world and the non-streaming world;  to get input from both a
>> (vanilla, Linux) file and a stream.  Is that true?
>>
>> If so, it seems I need to turn my (vanilla file) data into a second
>>stream.
>
>Hm, why do you say that? nothing prevents that at all. You can do
>anything you like in your local code, or in functions you send to
>remote workers. (Of course, if those functions depend on a local file,
>it has to exist locally on the workers.) You do have to think about
>the distributed model here, but what executes locally/remotely isn't
>mysterious. It is things in calls to Spark API method that will be
>executed remotely.


The distinction I was calling out was temporal, not local/distributed,
though that is another important dimension.  It sounds like I can do
anything I want in the code before the ssc.start(), but that code runs
once at the beginning of the program.  What I'm searching for is some way
to have code that runs repeatedly and potentially updates a variable that
the Streaming code will see.  Broadcast() almost does that, but apparently
the underlying variable should be immutable.  I'm not aware of any (Spark)
way to have code run repeatedly other than as part of the Spark Streaming
API, but that doesn't look at vanilla files.

The distributed angle you raise makes my "vanilla file" approach not quite
credible, in that the vanilla file would have to be distributed to all the
nodes for the updates to be seen.  So maybe the simplest way to do that is
to have a vanilla Linux code monitoring the vanilla file (on a client
node) and sending any changes to it into a (distinct) stream.  If so, the
remote code would need to monitor both that stream and the main data
stream.  Does that make sense?


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Streaming: which code is (not) executed at every batch interval?

Reply via email to