Re: adding new elements to batch RDD from DStream RDD

Sean Owen Wed, 15 Apr 2015 12:17:25 -0700

Yes, I mean there's nothing to keep you from using them together other
than their very different lifetime. That's probably the key here: if
you need the streaming data to live a long time it has to live in
persistent storage first.


I do exactly this and what you describe for the same purpose.
I don't believe there's any need for threads; an RDD is just
bookkeeping about partitions, and that has to be re-assessed when the
underlying data grows. But making a new RDD on the fly is easy. It's a
"reference" to the data only.

(Well, that changes if you cache the results, in which case you very
much care about unpersisting the RDD before getting a different
reference to all of the same data and more.)




On Wed, Apr 15, 2015 at 8:06 PM, Evo Eftimov <evo.efti...@isecc.com> wrote:
> Hi Sean well there is certainly a difference between "batch" RDD and 
> "streaming" RDD and in the previous reply you have already outlined some. 
> Other differences are in the Object Oriented Model / API of Spark, which also 
> matters besides the RDD / Spark Cluster Platform architecture.
>
> Secondly, in the previous em I have clearly described what I mean by "update" 
> and that it is a result of RDD transformation and hence a new RDD derived 
> from the previously joined/union/cogrouped one - ie not "mutating" an 
> existing RDD
>
> Lets also leave aside the architectural goal why I want to keep updating a 
> batch RDD with new data coming from DStream RDDs - fyi it is NOT to "make 
> streaming RDDs long living"
>
> Let me now go back to the overall objective - the app context is Spark 
> Streaming job. I want to "update" / "add" the content of incoming streaming 
> RDDs (e.g. JavaDStreamRDDs) to an already loaded (e.g. from HDFS file) batch 
> RDD e.g. JavaRDD - the only way to union / join / cogroup from DSTreamRDD to 
> batch RDD is via the "transform" method which always returns DStream RDD NOT 
> batch RDD - check the API
>
> On a separate note - your suggestion to keep reloading a Batch RDD from a 
> file - it may have some applications in other scenarios so lets drill down 
> into it - in the context of Spark Streaming app where the driver launches a 
> DAG pipeline and then just essentially hangs, I guess the only way to keep 
> reloading a batch RDD from file is from a separate thread still using the 
> same spark context. The thread will reload the batch RDD with the same 
> reference ie reassign the reference to the newly instantiated/loaded batch 
> RDD - is that what you mean by reloading batch RDD from file
>
> -----Original Message-----
> From: Sean Owen [mailto:so...@cloudera.com]
> Sent: Wednesday, April 15, 2015 7:43 PM
> To: Evo Eftimov
> Cc: user@spark.apache.org
> Subject: Re: adding new elements to batch RDD from DStream RDD
>
> What do you mean by "batch RDD"? they're just RDDs, though store their data 
> in different ways and come from different sources. You can union an RDD from 
> an HDFS file with one from a DStream.
>
> It sounds like you want streaming data to live longer than its batch 
> interval, but that's not something you can expect the streaming framework to 
> provide. It's perfectly possible to save the RDD's data to persistent store 
> and use it later.
>
> You can't update RDDs; they're immutable. You can re-read data from 
> persistent store by making a new RDD at any time.
>
> On Wed, Apr 15, 2015 at 7:37 PM, Evo Eftimov <evo.efti...@isecc.com> wrote:
>> The only way to join / union /cogroup a DStream RDD with Batch RDD is
>> via the "transform" method, which returns another DStream RDD and
>> hence it gets discarded at the end of the micro-batch.
>>
>> Is there any way to e.g. union Dstream RDD with Batch RDD which
>> produces a new Batch RDD containing the elements of both the DStream
>> RDD and the Batch RDD.
>>
>> And once such Batch RDD is created in the above way, can it be used by
>> other DStream RDDs to e.g. join with as this time the result can be
>> another DStream RDD
>>
>> Effectively the functionality described above will result in
>> periodical updates (additions) of elements to a Batch RDD - the
>> additional elements will keep coming from DStream RDDs which keep
>> streaming in with every micro-batch.
>> Also newly arriving DStream RDDs will be able to join with the thus
>> previously updated BAtch RDD and produce a result DStream RDD
>>
>> Something almost like that can be achieved with updateStateByKey, but
>> is there a way to do it as described here
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/adding-new-element
>> s-to-batch-RDD-from-DStream-RDD-tp22504.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
>> additional commands, e-mail: user-h...@spark.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: adding new elements to batch RDD from DStream RDD

Reply via email to