Another approach not mentioned is to use a function to get the RDD that is to be joined. Something like this.
Not sure, but you can try something like this also: kvDstream.foreachRDD(rdd => { val rdd = getOrUpdateRDD(....params...) rdd.join(kvFile) }) The getOrUpdateRDD() function that you implement will get called every batch interval. And you can decide to return the same RDD or an updated RDD when you want to. Once updated, if the RDD is going to be used in multiple batch intervals, you should cache it. Furthermore, if you are going to join it, you should partition it by a partitioner, then cached it and make sure that the same partitioner is used for joining. That would be more efficient, as the RDD will stay partitioned in memory, minimizing the cost of join. On Wed, Jun 10, 2015 at 9:08 AM, Evo Eftimov <evo.efti...@isecc.com> wrote: > It depends on how big the Batch RDD requiring reloading is > > > > Reloading it for EVERY single DStream RDD would slow down the stream > processing inline with the total time required to reload the Batch RDD ….. > > > > But if the Batch RDD is not that big then that might not be an issues > especially in the context of the latency requirements for your streaming app > > > > Another more efficient and real-time approach may be to represent your > Batch RDD as a Dstraeam RDDs and keep aggregating them over the lifetime of > the spark streaming app instance and keep joining with the actual Dstream > RDDs > > > > You can feed your HDFS file into a Message Broker topic and consume it > from there in the form of DStream RDDs which you keep aggregating over the > lifetime of the spark streaming app instance > > > > *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com] > *Sent:* Wednesday, June 10, 2015 8:36 AM > *To:* Ilove Data > *Cc:* user@spark.apache.org > *Subject:* Re: Join between DStream and Periodically-Changing-RDD > > > > RDD's are immutable, why not join two DStreams? > > > > Not sure, but you can try something like this also: > > > > kvDstream.foreachRDD(rdd => { > > > > val file = ssc.sparkContext.textFile("/sigmoid/") > > val kvFile = file.map(x => (x.split(",")(0), x)) > > > > rdd.join(kvFile) > > > > > > }) > > > > > Thanks > > Best Regards > > > > On Tue, Jun 9, 2015 at 7:37 PM, Ilove Data <data4...@gmail.com> wrote: > > Hi, > > > > I'm trying to join DStream with interval let say 20s, join with RDD loaded > from HDFS folder which is changing periodically, let say new file is coming > to the folder for every 10 minutes. > > > > How should it be done, considering the HDFS files in the folder is > periodically changing/adding new files? Do RDD automatically detect changes > in HDFS folder as RDD source and automatically reload RDD? > > > > Thanks! > > Rendy > > >