spark.streaming.unpersist = false // in order for SStreaming to not drop the 
raw RDD data

spark.cleaner.ttl = <some reasonable value in seconds>

 

why is the above suggested provided the persist/vache operation on the 
constantly unioniuzed Batch RDD will have to be invoked anyway (after every 
union with DStream RDD), besides it will result in DStraeam RDDs accumulating 
in RAM unncesesarily for the duration of TTL  

 

re 

 

“A more reliable way would be to do dstream.window(...) for the length of time 
you want to keep the data and then union that data with your RDD for further 
processing using transform.”

 

I think the actual requirement here is picking up and adding Specific Messages 
from EVERY DStream RDD  to the Batch RDD rather than “preserving” messages from 
specific  sliding window and adding them to the Batch RDD

 

This should be defined as the Frequency of Updates to the Batch RDD and then 
using dstream.window() equal to that frequency 

 

Can you also elaborate why you consider the dstream.window  approach more 
“reliable”

 

From: Gerard Maas [mailto:gerard.m...@gmail.com] 
Sent: Tuesday, July 7, 2015 12:56 PM
To: Anand Nalya
Cc: spark users
Subject: Re:

 

Anand,

 

AFAIK, you will need to change two settings:

 

spark.streaming.unpersist = false // in order for SStreaming to not drop the 
raw RDD data

spark.cleaner.ttl = <some reasonable value in seconds>

 

Also be aware that the lineage of your union RDD will grow with each batch 
interval. You will need to break lineage often with cache(), and rely on the 
ttl for clean up.

You will probably be in some tricky ground with this approach.

 

A more reliable way would be to do dstream.window(...) for the length of time 
you want to keep the data and then union that data with your RDD for further 
processing using transform.

Something like:

dstream.window(Seconds(900), Seconds(900)).transform(rdd => rdd union 
otherRdd)...

 

If you need an unbound amount of dstream batch intervals, considering writing 
the data to secondary storage instead.

 

-kr, Gerard.

 

 

 

On Tue, Jul 7, 2015 at 1:34 PM, Anand Nalya <anand.na...@gmail.com> wrote:

Hi,

 

Suppose I have an RDD that is loaded from some file and then I also have a 
DStream that has data coming from some stream. I want to keep union some of the 
tuples from the DStream into my RDD. For this I can use something like this:

 

  var myRDD: RDD[(String, Long)] = sc.fromText...

  dstream.foreachRDD{ rdd =>

    myRDD = myRDD.union(rdd.filter(myfilter))

  }

 

My questions is that for how long spark will keep RDDs underlying the dstream 
around? Is there some configuratoin knob that can control that?

 

Regards,

Anand

 

Reply via email to