The data coming from dstream have the same keys that are in myRDD, so
the reduceByKey
after union keeps the overall tuple count in myRDD fixed. Or even with
fixed tuple count, it will keep consuming more resources?

On 9 July 2015 at 16:19, Tathagata Das <t...@databricks.com> wrote:

> If you are continuously unioning RDDs, then you are accumulating ever
> increasing data, and you are processing ever increasing amount of data in
> every batch. Obviously this is going to not last for very long. You
> fundamentally cannot keep processing ever increasing amount of data with
> finite resources, isnt it?
>
> On Thu, Jul 9, 2015 at 3:17 AM, Anand Nalya <anand.na...@gmail.com> wrote:
>
>> Thats from the Streaming tab for Spark 1.4 WebUI.
>>
>> On 9 July 2015 at 15:35, Michel Hubert <mich...@vsnsystemen.nl> wrote:
>>
>>>  Hi,
>>>
>>>
>>>
>>> I was just wondering how you generated to second image with the charts.
>>>
>>> What product?
>>>
>>>
>>>
>>> *From:* Anand Nalya [mailto:anand.na...@gmail.com]
>>> *Sent:* donderdag 9 juli 2015 11:48
>>> *To:* spark users
>>> *Subject:* Breaking lineage and reducing stages in Spark Streaming
>>>
>>>
>>>
>>> Hi,
>>>
>>>
>>>
>>> I've an application in which an rdd is being updated with tuples coming
>>> from RDDs in a DStream with following pattern.
>>>
>>>
>>>
>>> dstream.foreachRDD(rdd => {
>>>
>>>   myRDD = myRDD.union(rdd.filter(myfilter)).reduceByKey(_+_)
>>>
>>> })
>>>
>>>
>>>
>>> I'm using cache() and checkpointin to cache results. Over the time, the
>>> lineage of myRDD keeps increasing and stages in each batch of dstream keeps
>>> increasing, even though all the earlier stages are skipped. When the number
>>> of stages grow big enough, the overall delay due to scheduling delay starts
>>> increasing. The processing time for each batch is still fixed.
>>>
>>>
>>>
>>> Following figures illustrate the problem:
>>>
>>>
>>>
>>> Job execution: https://i.imgur.com/GVHeXH3.png?1
>>>
>>> [image: Image removed by sender.]
>>>
>>> Delays: https://i.imgur.com/1DZHydw.png?1
>>>
>>> [image: Image removed by sender.]
>>>
>>> Is there some pattern that I can use to avoid this?
>>>
>>>
>>>
>>> Regards,
>>>
>>> Anand
>>>
>>
>>
>

Reply via email to