Dedup

Ge, Yao (Y.) Wed, 08 Oct 2014 12:38:59 -0700

I need to do deduplication processing in Spark. The current plan is to generate 
a tuple where key is the dedup criteria and value is the original input. I am 
thinking to use reduceByKey to discard duplicate values. If I do that, can I 
simply return the first argument or should I return a copy of the first 
argument. Is there are better way to do dedup in Spark?


-Yao

Dedup

Reply via email to