Re: Join two Spark Streaming

2016-06-07 Thread vinay453
I am working on window dstreams wherein each dstream contains 3 rdd with following keys: a,b,c b,c,d c,d,e d,e,f I want to get only unique keys across all dstream a,b,c,d,e,f How to do it in pyspark streaming? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.c

Re: Join two Spark Streaming

2014-07-11 Thread Tathagata Das
1. Since the RDD of the previous batch is used to create the RDD of the next batch, the lineage of dependencies in the RDDs continues to grow infinitely. Thats not good because of it increases fault-recover times, task sizes, etc. Checkpointing saves the data of an RDD to HDFS and truncates the lin

Re: Join two Spark Streaming

2014-07-11 Thread Bill Jay
Hi Tathagata, Thanks for the solution. Actually, I will use the number of unique integers in the batch instead of accumulative number of unique integers. I do have two questions about your code: 1. Why do we need uniqueValuesRDD? Why do we need to call uniqueValuesRDD.checkpoint()? 2. Where is

Re: Join two Spark Streaming

2014-07-10 Thread Tathagata Das
Do you want to continuously maintain the set of unique integers seen since the beginning of stream? var uniqueValuesRDD: RDD[Int] = ... dstreamOfIntegers.transform(newDataRDD => { val newUniqueValuesRDD = newDataRDD.union(distinctValues).distinct uniqueValuesRDD = newUniqueValuesRDD //