Re: creating a distributed index

Philip Ogren Mon, 04 Aug 2014 08:46:49 -0700

This looks like a really cool feature and it seems likely that this willbe extremely useful for things we are doing. However, I'm not sure itis quite what I need here. With an inverted index you don't actuallylook items up by their keys but instead try to match against some inputstring. So, if I created an inverted index on 1M strings/documents in asingle JVM I could subsequently submit a string-valued query to retrievethe n-best matches. I'd like to do something like build ten 1Mstring/document indexes across ten nodes, submit a string-valued queryto each of the ten indexes and aggregate the n-best matches from the tensets of results. Would this be possible with IndexedRDD or some otherfeature of Spark?


Thanks,
Philip



On 08/01/2014 04:26 PM, Ankur Dave wrote:

At 2014-08-01 14:50:22 -0600, Philip Ogren <philip.og...@oracle.com> wrote:

It seems that I could do this with mapPartition so that each element in a
partition gets added to an index for that partition.
[...]
Would it then be possible to take a string and query each partition's index
with it? Or better yet, take a batch of strings and query each string in the
batch against each partition's index?

I proposed a key-value store based on RDDs called IndexedRDD that does exactly 
what you described. It uses mapPartitions to construct an index within each 
partition, then exposes get and multiget methods to allow looking up values 
associated with given keys.

It will hopefully make it into Spark 1.2.0. Until then you can try it out by 
merging in the pull request locally: https://github.com/apache/spark/pull/1297.

See JIRA for details and slides on how it works: 
https://issues.apache.org/jira/browse/SPARK-2365.

Ankur



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: creating a distributed index

Reply via email to