Yes, foreachRDD will do your something for each RDD, which is what you get for each mini-batch of input.
The operations you express on a DStream (or JavaDStream) are all, really, "for each RDD", including print(). Logging is a little harder to reason about since the logging will happen on a potentially remote receiver. I am not sure if this explains your observed behavior; it depends on what you were logging. On Wed, Oct 1, 2014 at 6:51 PM, Andy Davidson <a...@santacruzintegration.com> wrote: > Hi > > I am new to Spark Streaming. Can I think of JavaDStream<> foreachRDD() as > being 'for each mini batch’? The java doc does not say much about this > function. > > Here is the background. I am writing a little test program to figure out how > to use streams. At some point I wanted to calculate an aggregate. In my > first attempt my driver did a couple of transformations and tries to get the > count(). I wrote this code like I would a spark core app and added some > logging. My log statements where only executed once, how ever > JavaDStream<>print() appears to run on each mini batch. > > When I put my logging and aggregation code inside foreachRDD() things work > as expected my aggegrate and logging appear to be executed on each > minibactch > > I am running on a 4 machine cluster. I create a message with the value of > each mini batch aggregate and use logging and System.out. Given the message > shows up on my console is it safe to assume that this output code is > executing in my driver? The ip shows it is running on my Master. > > I thought maybe the message is showing up here because I do not have enough > data in the steam to force load onto the workers? > > > Any in sites would be greatly appreciated. > > Andy --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org