Good explanation, Chris :)


On Fri, Sep 5, 2014 at 12:42 PM, Chris Fregly <ch...@fregly.com> wrote:

> good question, soumitra.  it's a bit confusing.
>
> to break TD's code down a bit:
>
> dstream.count() is a transformation operation (returns a new DStream),
> executes lazily, runs in the cluster on the underlying RDDs that come
> through in that batch, and returns a new DStream with a single element
> representing the count of the underlying RDDs in each batch.
>
> dstream.foreachRDD() is an output/action operation (returns something
> other than a DStream - nothing in this case), triggers the lazy execution
> above, returns the results to the driver, and increments the globalCount
> locally in the driver.
>
> per your specific question, RDD.count() is different in that it's an
> output/action operation that materializes the RDD and collects the count of
> elements in the RDD locally in the driver.  confusing, indeed.
>
> accumulators updated in parallel on the worker nodes across the cluster
> and are read locally in the driver.
>
>
>
>
> On Fri, Aug 8, 2014 at 7:36 AM, Soumitra Kumar <kumar.soumi...@gmail.com>
> wrote:
>
>> I want to keep track of the events processed in a batch.
>>
>> How come 'globalCount' work for DStream? I think similar construct won't
>> work for RDD, that's why there is accumulator.
>>
>>
>> On Fri, Aug 8, 2014 at 12:52 AM, Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> Do you mean that you want a continuously updated count as more
>>> events/records are received in the DStream (remember, DStream is a
>>> continuous stream of data)? Assuming that is what you want, you can use a
>>> global counter
>>>
>>> var globalCount = 0L
>>>
>>> dstream.count().foreachRDD(rdd => { globalCount += rdd.first() } )
>>>
>>> This globalCount variable will reside in the driver and will keep being
>>> updated after every batch.
>>>
>>> TD
>>>
>>>
>>> On Thu, Aug 7, 2014 at 10:16 PM, Soumitra Kumar <
>>> kumar.soumi...@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> I want to count the number of elements in the DStream, like RDD.count()
>>>> . Since there is no such method in DStream, I thought of using
>>>> DStream.count and use the accumulator.
>>>>
>>>> How do I do DStream.count() to count the number of elements in a
>>>> DStream?
>>>>
>>>> How do I create a shared variable in Spark Streaming?
>>>>
>>>> -Soumitra.
>>>>
>>>
>>>
>>
>

Reply via email to