Re: Sending large objects to specific RDDs

2016-01-17 Thread Daniel Imberman
This is perfect. So I guess my best course of action will be to create a custom partitioner to assure that the smallest amount of data is shuffled when I join the partitions, and then I really only need to do a map (rather than a mapPartitions) since the inverted index object will be pointed to (ra

Re: Sending large objects to specific RDDs

2016-01-16 Thread Ted Yu
Both groupByKey and join() accept Partitioner as parameter. Maybe you can specify a custom Partitioner so that the amount of shuffle is reduced. On Sat, Jan 16, 2016 at 9:39 AM, Daniel Imberman wrote: > Hi Ted, > > I think I might have figured something out!(Though I haven't tested it at > scal

Re: Sending large objects to specific RDDs

2016-01-16 Thread Daniel Imberman
Hi Koert, So I actually just mentioned something somewhat similar in the thread (your email actually came through as I was sending it :) ). One question I have is if I do a groupByKey and I have been smart about my partitioning up to this point would I have that benefit of not needing to shuffle

Re: Sending large objects to specific RDDs

2016-01-16 Thread Daniel Imberman
Hi Ted, I think I might have figured something out!(Though I haven't tested it at scale yet) My current thought is that I can do a groupByKey on the RDD of vectors and then do a join with the invertedIndex. It would look something like this: val InvIndexes:RDD[(Int,InvertedIndex)] val partitione

Re: Sending large objects to specific RDDs

2016-01-16 Thread Koert Kuipers
Just doing a join is not an option? If you carefully manage your partitioning then this can be pretty efficient (meaning no extra shuffle, basically map-side join) On Jan 13, 2016 2:30 PM, "Daniel Imberman" wrote: > I'm looking for a way to send structures to pre-determined partitions so > that >

Re: Sending large objects to specific RDDs

2016-01-15 Thread Ted Yu
My knowledge of XSEDE is limited - I visited the website. If there is no easy way to deploy HBase, alternative approach (using hdfs ?) needs to be considered. I need to do more homework on this :-) On Thu, Jan 14, 2016 at 3:51 PM, Daniel Imberman wrote: > Hi Ted, > > So unfortunately after loo

Re: Sending large objects to specific RDDs

2016-01-14 Thread Daniel Imberman
Hi Ted, So unfortunately after looking into the cluster manager that I will be using for my testing (I'm using a super-computer called XSEDE rather than AWS), it looks like the cluster does not actually come with Hbase installed (this cluster is becoming somewhat problematic, as it is essentially

Re: Sending large objects to specific RDDs

2016-01-13 Thread Daniel Imberman
Thank you Ted! That sounds like it would probably be the most efficient (with the least overhead) way of handling this situation. On Wed, Jan 13, 2016 at 11:36 AM Ted Yu wrote: > Another approach is to store the objects in NoSQL store such as HBase. > > Looking up object should be very fast. > >

Re: Sending large objects to specific RDDs

2016-01-13 Thread Ted Yu
Another approach is to store the objects in NoSQL store such as HBase. Looking up object should be very fast. Cheers On Wed, Jan 13, 2016 at 11:29 AM, Daniel Imberman wrote: > I'm looking for a way to send structures to pre-determined partitions so > that > they can be used by another RDD in a