RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Evo Eftimov
Yes simply look for partitionby in the javadoc for e.g. PairJavaRDD From: Jeetendra Gangele [mailto:gangele...@gmail.com] Sent: Thursday, April 16, 2015 9:57 PM To: Evo Eftimov Cc: Wang, Ningjun (LNG-NPV); user Subject: Re: How to join RDD keyValuePairs efficiently Does this same

Re: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Jeetendra Gangele
age- > From: Wang, Ningjun (LNG-NPV) [mailto:ningjun.w...@lexisnexis.com] > Sent: Thursday, April 16, 2015 9:39 PM > To: user@spark.apache.org > Subject: RE: How to join RDD keyValuePairs efficiently > > Evo > > > partition the large doc RDD based on the hash

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Evo Eftimov
: RE: How to join RDD keyValuePairs efficiently Evo > partition the large doc RDD based on the hash function on the key ie the docid What API to use to do this? By the way, loading the entire dataset to memory cause OutOfMemory problem because it is too large (I only have one mach

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Wang, Ningjun (LNG-NPV)
ay, April 16, 2015 5:02 PM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: Re: How to join RDD keyValuePairs efficiently This would be much, much faster if your set of IDs was simply a Set, and you passed that to a filter() call that just filtered in the docs that matched an ID in t

Re: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Akhil Das
You could try repartitioning your RDD using a custom partitioner (HashPartitioner etc) and caching the dataset into memory to speedup the joins. Thanks Best Regards On Tue, Apr 14, 2015 at 8:10 PM, Wang, Ningjun (LNG-NPV) < ningjun.w...@lexisnexis.com> wrote: > I have an RDD that contains milli

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Evo Eftimov
NG-NPV) Cc: user@spark.apache.org Subject: Re: How to join RDD keyValuePairs efficiently This would be much, much faster if your set of IDs was simply a Set, and you passed that to a filter() call that just filtered in the docs that matched an ID in the set. On Thu, Apr 16, 2015 at 4:51 PM, Wan

Re: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Sean Owen
This would be much, much faster if your set of IDs was simply a Set, and you passed that to a filter() call that just filtered in the docs that matched an ID in the set. On Thu, Apr 16, 2015 at 4:51 PM, Wang, Ningjun (LNG-NPV) wrote: > Does anybody have a solution for this? > > > > > > From: Wang

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Wang, Ningjun (LNG-NPV)
Does anybody have a solution for this? From: Wang, Ningjun (LNG-NPV) Sent: Tuesday, April 14, 2015 10:41 AM To: user@spark.apache.org Subject: How to join RDD keyValuePairs efficiently I have an RDD that contains millions of Document objects. Each document has an unique Id that is a string. I n