RE: Dedup

2016-01-12 Thread gpmacalalad
t; much Sean! >> >> -Yao >> >> -Original Message- >> From: Sean Owen [mailto: > sowen@ > ] >> Sent: Thursday, October 09, 2014 3:04 AM >> To: Ge, Yao (Y.) >> Cc: > user@.apache >> Subject: Re: Dedup >> >> I think

RE: Dedup

2014-10-09 Thread Sean Owen
lto:so...@cloudera.com] > Sent: Thursday, October 09, 2014 3:04 AM > To: Ge, Yao (Y.) > Cc: user@spark.apache.org > Subject: Re: Dedup > > I think the question is about copying the argument. If it's an immutable > value like String, yes just return the first argument and ignore the

RE: Dedup

2014-10-09 Thread Ge, Yao (Y.)
much Sean! -Yao -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Thursday, October 09, 2014 3:04 AM To: Ge, Yao (Y.) Cc: user@spark.apache.org Subject: Re: Dedup I think the question is about copying the argument. If it's an immutable value like String, yes

Re: Dedup

2014-10-09 Thread Sean Owen
I think the question is about copying the argument. If it's an immutable value like String, yes just return the first argument and ignore the second. If you're dealing with a notoriously mutable value like a Hadoop Writable, you need to copy the value you return. This works fine although you will

Re: Dedup

2014-10-08 Thread Akhil Das
If you are looking to eliminate duplicate rows (or similar) then you can define a key from the data and on that key you can do reduceByKey. Thanks Best Regards On Thu, Oct 9, 2014 at 10:30 AM, Sonal Goyal wrote: > What is your data like? Are you looking at exact matching or are you > interested

Re: Dedup

2014-10-08 Thread Sonal Goyal
What is your data like? Are you looking at exact matching or are you interested in nearly same records? Do you need to merge similar records to get a canonical value? Best Regards, Sonal Nube Technologies On Thu, Oct 9, 2014 at 2:

Re: Dedup

2014-10-08 Thread Flavio Pompermaier
Maybe you could implement something like this (i don't know if something similar already exists in spark): http://www.cs.berkeley.edu/~jnwang/papers/icde14_massjoin.pdf Best, Flavio On Oct 8, 2014 9:58 PM, "Nicholas Chammas" wrote: > Multiple values may be different, yet still be considered dup

Re: Dedup

2014-10-08 Thread Nicholas Chammas
Multiple values may be different, yet still be considered duplicates depending on how the dedup criteria is selected. Is that correct? Do you care in that case what value you select for a given key? On Wed, Oct 8, 2014 at 3:37 PM, Ge, Yao (Y.) wrote: > I need to do deduplication processing in S