Good explanation, Chris :)
On Fri, Sep 5, 2014 at 12:42 PM, Chris Fregly <ch...@fregly.com> wrote: > good question, soumitra. it's a bit confusing. > > to break TD's code down a bit: > > dstream.count() is a transformation operation (returns a new DStream), > executes lazily, runs in the cluster on the underlying RDDs that come > through in that batch, and returns a new DStream with a single element > representing the count of the underlying RDDs in each batch. > > dstream.foreachRDD() is an output/action operation (returns something > other than a DStream - nothing in this case), triggers the lazy execution > above, returns the results to the driver, and increments the globalCount > locally in the driver. > > per your specific question, RDD.count() is different in that it's an > output/action operation that materializes the RDD and collects the count of > elements in the RDD locally in the driver. confusing, indeed. > > accumulators updated in parallel on the worker nodes across the cluster > and are read locally in the driver. > > > > > On Fri, Aug 8, 2014 at 7:36 AM, Soumitra Kumar <kumar.soumi...@gmail.com> > wrote: > >> I want to keep track of the events processed in a batch. >> >> How come 'globalCount' work for DStream? I think similar construct won't >> work for RDD, that's why there is accumulator. >> >> >> On Fri, Aug 8, 2014 at 12:52 AM, Tathagata Das < >> tathagata.das1...@gmail.com> wrote: >> >>> Do you mean that you want a continuously updated count as more >>> events/records are received in the DStream (remember, DStream is a >>> continuous stream of data)? Assuming that is what you want, you can use a >>> global counter >>> >>> var globalCount = 0L >>> >>> dstream.count().foreachRDD(rdd => { globalCount += rdd.first() } ) >>> >>> This globalCount variable will reside in the driver and will keep being >>> updated after every batch. >>> >>> TD >>> >>> >>> On Thu, Aug 7, 2014 at 10:16 PM, Soumitra Kumar < >>> kumar.soumi...@gmail.com> wrote: >>> >>>> Hello, >>>> >>>> I want to count the number of elements in the DStream, like RDD.count() >>>> . Since there is no such method in DStream, I thought of using >>>> DStream.count and use the accumulator. >>>> >>>> How do I do DStream.count() to count the number of elements in a >>>> DStream? >>>> >>>> How do I create a shared variable in Spark Streaming? >>>> >>>> -Soumitra. >>>> >>> >>> >> >