Suppose I want to take my large text data input and create a distributed
inverted index in Spark on each string in the input (imagine an
in-memory lucene index - not want I'm doing but it's analogous). It
seems that I could do this with mapPartition so that each element in a
partition gets added to an index for that partition. I'm making the
simplifying assumption that the individual indexes do not need to
coordinate any global metrics so that e.g. tf-idf scores are consistent
across these indexes. Would it then be possible to take a string and
query each partition's index with it? Or better yet, take a batch of
strings and query each string in the batch against each partition's index?
Thanks,
Philip
- creating a distributed index Philip Ogren
-