Hi all,
Recently in our project, we need to update a RDD using data regularly
received from DStream, I plan to use "foreachRDD" API to achieve this:
var MyRDD = ...
dstream.foreachRDD { rdd =>
MyRDD = MyRDD.join(rdd).......
...
}
Is this usage correct? My concern is, as I am repeatedly and endlessly
reassigning MyRDD in order to update it, will it create a too long RDD
lineage to process when I want to query MyRDD later on (similar as
https://issues.apache.org/jira/browse/SPARK-4672) ?
Maybe I should:
1. cache or checkpoint latest MyRDD and unpersist old MyRDD every time a
dstream comes in.
2. use the unpublished IndexedRDD
(https://github.com/amplab/spark-indexedrdd) to conduct efficient RDD
update.
As I lack experience using Spark Streaming and indexedRDD, I am here to make
sure my thoughts are on the right track. Your wise suggestions will be
greatly appreciated.
-----
Feel the sparking Spark!
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/Possible-long-lineage-issue-when-using-DStream-to-update-a-normal-RDD-tp12128.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]