I think the question is about copying the argument. If it's an immutable value like String, yes just return the first argument and ignore the second. If you're dealing with a notoriously mutable value like a Hadoop Writable, you need to copy the value you return.
This works fine although you will spend a fair bit of time marshaling all of those duplicates together just to discard all but one. If there are lots of duplicates, It would take a bit more work, but would be faster, to do something like this: mapPartitions and retain one input value each unique dedup criteria, and then output those pairs, and then reduceByKey the result. On Wed, Oct 8, 2014 at 8:37 PM, Ge, Yao (Y.) <y...@ford.com> wrote: > I need to do deduplication processing in Spark. The current plan is to > generate a tuple where key is the dedup criteria and value is the original > input. I am thinking to use reduceByKey to discard duplicate values. If I do > that, can I simply return the first argument or should I return a copy of > the first argument. Is there are better way to do dedup in Spark? > > > > -Yao --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org