Yes, foreachRDD will do your something for each RDD, which is what you
get for each mini-batch of input.

The operations you express on a DStream (or JavaDStream) are all,
really, "for each RDD", including print(). Logging is a little harder
to reason about since the logging will happen on a potentially remote
receiver. I am not sure if this explains your observed behavior; it
depends on what you were logging.

On Wed, Oct 1, 2014 at 6:51 PM, Andy Davidson
<a...@santacruzintegration.com> wrote:
> Hi
>
> I am new to Spark Streaming. Can I think of  JavaDStream<> foreachRDD() as
> being 'for each mini batch’? The java doc does not say much about this
> function.
>
> Here is the background. I am writing a little test program to figure out how
> to use streams. At some point I wanted to calculate an aggregate. In my
> first attempt my driver did a couple of transformations and tries to get the
> count(). I wrote this code like I would a spark core app and added some
> logging. My log statements where only executed once, how ever
> JavaDStream<>print() appears to run on each mini batch.
>
> When I put my logging and aggregation code inside foreachRDD() things work
> as expected my aggegrate and logging appear to be executed on each
> minibactch
>
> I am running on a 4 machine cluster.  I create a message with the value of
> each mini batch aggregate and use logging and System.out. Given the message
> shows up on my console is it safe to assume that this output code is
> executing in my driver? The ip shows it is running on my Master.
>
> I thought maybe the message is showing up here because I do not have enough
> data in the steam to force load onto the workers?
>
>
> Any in sites would be greatly appreciated.
>
> Andy

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to