I need to do deduplication processing in Spark. The current plan is to generate 
a tuple where key is the dedup criteria and value is the original input. I am 
thinking to use reduceByKey to discard duplicate values. If I do that, can I 
simply return the first argument or should I return a copy of the first 
argument. Is there are better way to do dedup in Spark?

-Yao

Reply via email to