Maybe you could implement something like this (i don't know if something
similar already exists in spark):

http://www.cs.berkeley.edu/~jnwang/papers/icde14_massjoin.pdf

Best,
Flavio
On Oct 8, 2014 9:58 PM, "Nicholas Chammas" <nicholas.cham...@gmail.com>
wrote:

> Multiple values may be different, yet still be considered duplicates
> depending on how the dedup criteria is selected. Is that correct? Do you
> care in that case what value you select for a given key?
>
> On Wed, Oct 8, 2014 at 3:37 PM, Ge, Yao (Y.) <y...@ford.com> wrote:
>
>>  I need to do deduplication processing in Spark. The current plan is to
>> generate a tuple where key is the dedup criteria and value is the original
>> input. I am thinking to use reduceByKey to discard duplicate values. If I
>> do that, can I simply return the first argument or should I return a copy
>> of the first argument. Is there are better way to do dedup in Spark?
>>
>>
>>
>> -Yao
>>
>
>

Reply via email to