I am working on window dstreams wherein each dstream contains 3 rdd with
following keys:
a,b,c
b,c,d
c,d,e
d,e,f
I want to get only unique keys across all dstream
a,b,c,d,e,f
How to do it in pyspark streaming?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.c
1. Since the RDD of the previous batch is used to create the RDD of the
next batch, the lineage of dependencies in the RDDs continues to grow
infinitely. Thats not good because of it increases fault-recover times,
task sizes, etc. Checkpointing saves the data of an RDD to HDFS and
truncates the lin
Hi Tathagata,
Thanks for the solution. Actually, I will use the number of unique integers
in the batch instead of accumulative number of unique integers.
I do have two questions about your code:
1. Why do we need uniqueValuesRDD? Why do we need to call
uniqueValuesRDD.checkpoint()?
2. Where is
Do you want to continuously maintain the set of unique integers seen since
the beginning of stream?
var uniqueValuesRDD: RDD[Int] = ...
dstreamOfIntegers.transform(newDataRDD => {
val newUniqueValuesRDD = newDataRDD.union(distinctValues).distinct
uniqueValuesRDD = newUniqueValuesRDD
//