Maybe you could implement something like this (i don't know if something similar already exists in spark):
http://www.cs.berkeley.edu/~jnwang/papers/icde14_massjoin.pdf Best, Flavio On Oct 8, 2014 9:58 PM, "Nicholas Chammas" <nicholas.cham...@gmail.com> wrote: > Multiple values may be different, yet still be considered duplicates > depending on how the dedup criteria is selected. Is that correct? Do you > care in that case what value you select for a given key? > > On Wed, Oct 8, 2014 at 3:37 PM, Ge, Yao (Y.) <y...@ford.com> wrote: > >> I need to do deduplication processing in Spark. The current plan is to >> generate a tuple where key is the dedup criteria and value is the original >> input. I am thinking to use reduceByKey to discard duplicate values. If I >> do that, can I simply return the first argument or should I return a copy >> of the first argument. Is there are better way to do dedup in Spark? >> >> >> >> -Yao >> > >