What is your data like? Are you looking at exact matching or are you interested in nearly same records? Do you need to merge similar records to get a canonical value?
Best Regards, Sonal Nube Technologies <http://www.nubetech.co> <http://in.linkedin.com/in/sonalgoyal> On Thu, Oct 9, 2014 at 2:31 AM, Flavio Pompermaier <pomperma...@okkam.it> wrote: > Maybe you could implement something like this (i don't know if something > similar already exists in spark): > > http://www.cs.berkeley.edu/~jnwang/papers/icde14_massjoin.pdf > > Best, > Flavio > On Oct 8, 2014 9:58 PM, "Nicholas Chammas" <nicholas.cham...@gmail.com> > wrote: > >> Multiple values may be different, yet still be considered duplicates >> depending on how the dedup criteria is selected. Is that correct? Do you >> care in that case what value you select for a given key? >> >> On Wed, Oct 8, 2014 at 3:37 PM, Ge, Yao (Y.) <y...@ford.com> wrote: >> >>> I need to do deduplication processing in Spark. The current plan is to >>> generate a tuple where key is the dedup criteria and value is the original >>> input. I am thinking to use reduceByKey to discard duplicate values. If I >>> do that, can I simply return the first argument or should I return a copy >>> of the first argument. Is there are better way to do dedup in Spark? >>> >>> >>> >>> -Yao >>> >> >>