I want to do hash based comparison to find duplicate records. Record which i
receive from stream will have hashid,recordid field in it.

1. I want to have all the historic records (hashid, recordid --> key,value)
in memory RDD
2. When a new record is received in spark DStream RDD i want to compare
against the historic records (hash, recordid)
3. also add the new records into existing historic records (hashid, recordid
--> key,value) in memory RDD

My thoughts:


1. join the time based RDD and cache them in memory (historic lookup)
2. compare the new RDD comes, foreach record compare againt the historic
lookup

What I have done:


1. I have created a stream line and able to consume the records.
2. But i am not sure how to store it in memory

I have the following questions:


1. How can i achieve this or workaround ?
2. Can i do this using MLib? or spark stream fits for my usecase ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-stream-based-deduplication-tp27770.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to