can I think of JavaDStream<> foreachRDD() as being 'for each mini batch' ?

Andy Davidson Wed, 01 Oct 2014 10:52:55 -0700

Hi 

I am new to Spark Streaming. Can I think of  JavaDStream<> foreachRDD() as
being 'for each mini batch¹? The java doc does not say much about this
function.


Here is the background. I am writing a little test program to figure out how
to use streams. At some point I wanted to calculate an aggregate. In my
first attempt my driver did a couple of transformations and tries to get the
count(). I wrote this code like I would a spark core app and added some
logging. My log statements where only executed once, how ever
JavaDStream<>print() appears to run on each mini batch.

When I put my logging and aggregation code inside foreachRDD() things work
as expected my aggegrate and logging appear to be executed on each
minibactch

I am running on a 4 machine cluster.  I create a message with the value of
each mini batch aggregate and use logging and System.out. Given the message
shows up on my console is it safe to assume that this output code is
executing in my driver? The ip shows it is running on my Master.

I thought maybe the message is showing up here because I do not have enough
data in the steam to force load onto the workers?


Any in sites would be greatly appreciated.

Andy

can I think of JavaDStream<> foreachRDD() as being 'for each mini batch' ?

Reply via email to