I want to do hash based comparison to find duplicate records. Record which i receive from stream will have hashid,recordid field in it.
1. I want to have all the historic records (hashid, recordid --> key,value) in memory RDD 2. When a new record is received in spark DStream RDD i want to compare against the historic records (hash, recordid) 3. also add the new records into existing historic records (hashid, recordid --> key,value) in memory RDD My thoughts: 1. join the time based RDD and cache them in memory (historic lookup) 2. compare the new RDD comes, foreach record compare againt the historic lookup What I have done: 1. I have created a stream line and able to consume the records. 2. But i am not sure how to store it in memory I have the following questions: 1. How can i achieve this or workaround ? 2. Can i do this using MLib? or spark stream fits for my usecase ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-stream-based-deduplication-tp27770.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org