Yes simply look for partitionby in the javadoc for e.g. PairJavaRDD
From: Jeetendra Gangele [mailto:gangele...@gmail.com]
Sent: Thursday, April 16, 2015 9:57 PM
To: Evo Eftimov
Cc: Wang, Ningjun (LNG-NPV); user
Subject: Re: How to join RDD keyValuePairs efficiently
Does this same
age-
> From: Wang, Ningjun (LNG-NPV) [mailto:ningjun.w...@lexisnexis.com]
> Sent: Thursday, April 16, 2015 9:39 PM
> To: user@spark.apache.org
> Subject: RE: How to join RDD keyValuePairs efficiently
>
> Evo
>
> > partition the large doc RDD based on the hash
: RE: How to join RDD keyValuePairs efficiently
Evo
> partition the large doc RDD based on the hash function on the key ie
the docid
What API to use to do this?
By the way, loading the entire dataset to memory cause OutOfMemory problem
because it is too large (I only have one mach
ay, April 16, 2015 5:02 PM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.org
Subject: Re: How to join RDD keyValuePairs efficiently
This would be much, much faster if your set of IDs was simply a Set, and you
passed that to a filter() call that just filtered in the docs that matched an
ID in t
You could try repartitioning your RDD using a custom partitioner
(HashPartitioner etc) and caching the dataset into memory to speedup the
joins.
Thanks
Best Regards
On Tue, Apr 14, 2015 at 8:10 PM, Wang, Ningjun (LNG-NPV) <
ningjun.w...@lexisnexis.com> wrote:
> I have an RDD that contains milli
NG-NPV)
Cc: user@spark.apache.org
Subject: Re: How to join RDD keyValuePairs efficiently
This would be much, much faster if your set of IDs was simply a Set, and you
passed that to a filter() call that just filtered in the docs that matched an
ID in the set.
On Thu, Apr 16, 2015 at 4:51 PM, Wan
This would be much, much faster if your set of IDs was simply a Set,
and you passed that to a filter() call that just filtered in the docs
that matched an ID in the set.
On Thu, Apr 16, 2015 at 4:51 PM, Wang, Ningjun (LNG-NPV)
wrote:
> Does anybody have a solution for this?
>
>
>
>
>
> From: Wang
Does anybody have a solution for this?
From: Wang, Ningjun (LNG-NPV)
Sent: Tuesday, April 14, 2015 10:41 AM
To: user@spark.apache.org
Subject: How to join RDD keyValuePairs efficiently
I have an RDD that contains millions of Document objects. Each document has an
unique Id that is a string. I n