Suppose I want to take my large text data input and create a distributed inverted index in Spark on each string in the input (imagine an in-memory lucene index - not want I'm doing but it's analogous). It seems that I could do this with mapPartition so that each element in a partition gets added to an index for that partition. I'm making the simplifying assumption that the individual indexes do not need to coordinate any global metrics so that e.g. tf-idf scores are consistent across these indexes. Would it then be possible to take a string and query each partition's index with it? Or better yet, take a batch of strings and query each string in the batch against each partition's index?

Thanks,
Philip

Reply via email to