Hi Ted, I think I might have figured something out!(Though I haven't tested it at scale yet)
My current thought is that I can do a groupByKey on the RDD of vectors and then do a join with the invertedIndex. It would look something like this: val InvIndexes:RDD[(Int,InvertedIndex)] val partitionedVectors:RDD[(Int, Vector)] val partitionedTasks:RDD[(Int, (Iterator[Vector], InvertedIndex))] = partitionedvectors.groupByKey().join(invIndexes) val similarities = partitionedTasks.map(//calculate similarities) val maxSim = similarities.reduce(math.max) So while I realize that usually a groupByKey is usually frowned upon, it seems to me that since I need all associated vectors to be local anyways that this repartitioning would not be too expensive. Does this seem like a reasonable approach to this problem or are there any faults that I should consider should I approach it this way? Thank you for your help, Daniel On Fri, Jan 15, 2016 at 5:30 PM Ted Yu <yuzhih...@gmail.com> wrote: > My knowledge of XSEDE is limited - I visited the website. > > If there is no easy way to deploy HBase, alternative approach (using hdfs > ?) needs to be considered. > > I need to do more homework on this :-) > > On Thu, Jan 14, 2016 at 3:51 PM, Daniel Imberman < > daniel.imber...@gmail.com> wrote: > >> Hi Ted, >> >> So unfortunately after looking into the cluster manager that I will be >> using for my testing (I'm using a super-computer called XSEDE rather than >> AWS), it looks like the cluster does not actually come with Hbase installed >> (this cluster is becoming somewhat problematic, as it is essentially AWS >> but you have to do your own virtualization scripts). Do you have any other >> thoughts on how I could go about dealing with this purely using spark and >> HDFS? >> >> Thank you >> >> On Wed, Jan 13, 2016 at 11:49 AM Daniel Imberman < >> daniel.imber...@gmail.com> wrote: >> >>> Thank you Ted! That sounds like it would probably be the most efficient >>> (with the least overhead) way of handling this situation. >>> >>> On Wed, Jan 13, 2016 at 11:36 AM Ted Yu <yuzhih...@gmail.com> wrote: >>> >>>> Another approach is to store the objects in NoSQL store such as HBase. >>>> >>>> Looking up object should be very fast. >>>> >>>> Cheers >>>> >>>> On Wed, Jan 13, 2016 at 11:29 AM, Daniel Imberman < >>>> daniel.imber...@gmail.com> wrote: >>>> >>>>> I'm looking for a way to send structures to pre-determined partitions >>>>> so that >>>>> they can be used by another RDD in a mapPartition. >>>>> >>>>> Essentially I'm given and RDD of SparseVectors and an RDD of inverted >>>>> indexes. The inverted index objects are quite large. >>>>> >>>>> My hope is to do a MapPartitions within the RDD of vectors where I can >>>>> compare each vector to the inverted index. The issue is that I only >>>>> NEED one >>>>> inverted index object per partition (which would have the same key as >>>>> the >>>>> values within that partition). >>>>> >>>>> >>>>> val vectors:RDD[(Int, SparseVector)] >>>>> >>>>> val invertedIndexes:RDD[(Int, InvIndex)] = >>>>> a.reduceByKey(generateInvertedIndex) >>>>> vectors:RDD.mapPartitions{ >>>>> iter => >>>>> val invIndex = invertedIndexes(samePartitionKey) >>>>> iter.map(invIndex.calculateSimilarity(_)) >>>>> ) >>>>> } >>>>> >>>>> How could I go about setting up the Partition such that the specific >>>>> data >>>>> structure I need will be present for the mapPartition but I won't have >>>>> the >>>>> extra overhead of sending over all values (which would happen if I >>>>> were to >>>>> make a broadcast variable). >>>>> >>>>> One thought I have been having is to store the objects in HDFS but I'm >>>>> not >>>>> sure if that would be a suboptimal solution (It seems like it could >>>>> slow >>>>> down the process a lot) >>>>> >>>>> Another thought I am currently exploring is whether there is some way >>>>> I can >>>>> create a custom Partition or Partitioner that could hold the data >>>>> structure >>>>> (Although that might get too complicated and become problematic) >>>>> >>>>> Any thoughts on how I could attack this issue would be highly >>>>> appreciated. >>>>> >>>>> thank you for your help! >>>>> >>>>> >>>>> >>>>> -- >>>>> View this message in context: >>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Sending-large-objects-to-specific-RDDs-tp25967.html >>>>> Sent from the Apache Spark User List mailing list archive at >>>>> Nabble.com. >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>> >>>>> >>>> >