Yes, I mean there's nothing to keep you from using them together other than their very different lifetime. That's probably the key here: if you need the streaming data to live a long time it has to live in persistent storage first.
I do exactly this and what you describe for the same purpose. I don't believe there's any need for threads; an RDD is just bookkeeping about partitions, and that has to be re-assessed when the underlying data grows. But making a new RDD on the fly is easy. It's a "reference" to the data only. (Well, that changes if you cache the results, in which case you very much care about unpersisting the RDD before getting a different reference to all of the same data and more.) On Wed, Apr 15, 2015 at 8:06 PM, Evo Eftimov <evo.efti...@isecc.com> wrote: > Hi Sean well there is certainly a difference between "batch" RDD and > "streaming" RDD and in the previous reply you have already outlined some. > Other differences are in the Object Oriented Model / API of Spark, which also > matters besides the RDD / Spark Cluster Platform architecture. > > Secondly, in the previous em I have clearly described what I mean by "update" > and that it is a result of RDD transformation and hence a new RDD derived > from the previously joined/union/cogrouped one - ie not "mutating" an > existing RDD > > Lets also leave aside the architectural goal why I want to keep updating a > batch RDD with new data coming from DStream RDDs - fyi it is NOT to "make > streaming RDDs long living" > > Let me now go back to the overall objective - the app context is Spark > Streaming job. I want to "update" / "add" the content of incoming streaming > RDDs (e.g. JavaDStreamRDDs) to an already loaded (e.g. from HDFS file) batch > RDD e.g. JavaRDD - the only way to union / join / cogroup from DSTreamRDD to > batch RDD is via the "transform" method which always returns DStream RDD NOT > batch RDD - check the API > > On a separate note - your suggestion to keep reloading a Batch RDD from a > file - it may have some applications in other scenarios so lets drill down > into it - in the context of Spark Streaming app where the driver launches a > DAG pipeline and then just essentially hangs, I guess the only way to keep > reloading a batch RDD from file is from a separate thread still using the > same spark context. The thread will reload the batch RDD with the same > reference ie reassign the reference to the newly instantiated/loaded batch > RDD - is that what you mean by reloading batch RDD from file > > -----Original Message----- > From: Sean Owen [mailto:so...@cloudera.com] > Sent: Wednesday, April 15, 2015 7:43 PM > To: Evo Eftimov > Cc: user@spark.apache.org > Subject: Re: adding new elements to batch RDD from DStream RDD > > What do you mean by "batch RDD"? they're just RDDs, though store their data > in different ways and come from different sources. You can union an RDD from > an HDFS file with one from a DStream. > > It sounds like you want streaming data to live longer than its batch > interval, but that's not something you can expect the streaming framework to > provide. It's perfectly possible to save the RDD's data to persistent store > and use it later. > > You can't update RDDs; they're immutable. You can re-read data from > persistent store by making a new RDD at any time. > > On Wed, Apr 15, 2015 at 7:37 PM, Evo Eftimov <evo.efti...@isecc.com> wrote: >> The only way to join / union /cogroup a DStream RDD with Batch RDD is >> via the "transform" method, which returns another DStream RDD and >> hence it gets discarded at the end of the micro-batch. >> >> Is there any way to e.g. union Dstream RDD with Batch RDD which >> produces a new Batch RDD containing the elements of both the DStream >> RDD and the Batch RDD. >> >> And once such Batch RDD is created in the above way, can it be used by >> other DStream RDDs to e.g. join with as this time the result can be >> another DStream RDD >> >> Effectively the functionality described above will result in >> periodical updates (additions) of elements to a Batch RDD - the >> additional elements will keep coming from DStream RDDs which keep >> streaming in with every micro-batch. >> Also newly arriving DStream RDDs will be able to join with the thus >> previously updated BAtch RDD and produce a result DStream RDD >> >> Something almost like that can be achieved with updateStateByKey, but >> is there a way to do it as described here >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/adding-new-element >> s-to-batch-RDD-from-DStream-RDD-tp22504.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For >> additional commands, e-mail: user-h...@spark.apache.org >> > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org