Requirements – then see my abstracted interpretation – what else do you need in
terms of Requirements …:
“Suppose I have an RDD that is loaded from some file and then I also have a
DStream that has data coming from some stream. I want to keep union some of the
tuples from the DStream into my RDD. For this I can use something like this:”
A formal requirements spec derived from the above - I think the actual
requirement here is picking up and adding Specific (filtered) Messages from
EVERY DStream RDD to the Batch RDD rather than “preserving” (on top of that
all) messages from sliding window and adding them to the Batch RDD. Such
requiremet should be defined as the Frequency of Updates to the Batch RDD and
what these updates are e.g. specific filtered messages and then using
dstream.window() can be made equal to that frequency
Essentialy the update frequency can range from the filtered messages of Every
Single DStream RDD to the filetered messages of a SLIDING WINDOW
Secondly what do you call “mutable uniniong”
That was his initial code
var myRDD: RDD[(String, Long)] = sc.fromText...
dstream.foreachRDD{ rdd =>
myRDD = myRDD.union(rdd.filter(myfilter))
}
Here is how it looks when Persisting the result from evet union – supposed to
produce NEW PERSTINET IMMUTABLE Batch RDD – why is that supposed to be less
“stable/reliable” – what are the exact tectnical reasons for that
var myRDD: RDD[(String, Long)] = sc.fromText...
dstream.foreachRDD{ rdd =>
myRDD = myRDD.union(rdd.filter(myfilter)).cashe()
}
From: Gerard Maas [mailto:[email protected]]
Sent: Tuesday, July 7, 2015 1:55 PM
To: Evo Eftimov
Cc: Anand Nalya; spark users
Subject: Re:
Evo,
I'd let the OP clarify the question. I'm not in position of clarifying his
requirements beyond what's written on the question.
Regarding window vs mutable union: window is a well-supported feature that
accumulates messages over time. The mutable unioning of RDDs is bound to
operational trouble as there're no warranties tied to data preservation and
it's unclear how one can produce 'cuts' of that union ready to be served for
some process/computation. Intuitively, it will 'explode' at some point.
-kr, Gerard.
On Tue, Jul 7, 2015 at 2:06 PM, Evo Eftimov <[email protected]> wrote:
spark.streaming.unpersist = false // in order for SStreaming to not drop the
raw RDD data
spark.cleaner.ttl = <some reasonable value in seconds>
why is the above suggested provided the persist/vache operation on the
constantly unioniuzed Batch RDD will have to be invoked anyway (after every
union with DStream RDD), besides it will result in DStraeam RDDs accumulating
in RAM unncesesarily for the duration of TTL
re
“A more reliable way would be to do dstream.window(...) for the length of time
you want to keep the data and then union that data with your RDD for further
processing using transform.”
I think the actual requirement here is picking up and adding Specific Messages
from EVERY DStream RDD to the Batch RDD rather than “preserving” messages from
specific sliding window and adding them to the Batch RDD
This should be defined as the Frequency of Updates to the Batch RDD and then
using dstream.window() equal to that frequency
Can you also elaborate why you consider the dstream.window approach more
“reliable”
From: Gerard Maas [mailto:[email protected]]
Sent: Tuesday, July 7, 2015 12:56 PM
To: Anand Nalya
Cc: spark users
Subject: Re:
Anand,
AFAIK, you will need to change two settings:
spark.streaming.unpersist = false // in order for SStreaming to not drop the
raw RDD data
spark.cleaner.ttl = <some reasonable value in seconds>
Also be aware that the lineage of your union RDD will grow with each batch
interval. You will need to break lineage often with cache(), and rely on the
ttl for clean up.
You will probably be in some tricky ground with this approach.
A more reliable way would be to do dstream.window(...) for the length of time
you want to keep the data and then union that data with your RDD for further
processing using transform.
Something like:
dstream.window(Seconds(900), Seconds(900)).transform(rdd => rdd union
otherRdd)...
If you need an unbound amount of dstream batch intervals, considering writing
the data to secondary storage instead.
-kr, Gerard.
On Tue, Jul 7, 2015 at 1:34 PM, Anand Nalya <[email protected]> wrote:
Hi,
Suppose I have an RDD that is loaded from some file and then I also have a
DStream that has data coming from some stream. I want to keep union some of the
tuples from the DStream into my RDD. For this I can use something like this:
var myRDD: RDD[(String, Long)] = sc.fromText...
dstream.foreachRDD{ rdd =>
myRDD = myRDD.union(rdd.filter(myfilter))
}
My questions is that for how long spark will keep RDDs underlying the dstream
around? Is there some configuratoin knob that can control that?
Regards,
Anand