I have an index containing all students, now I want to do an index search inside an Apache Hadoop mapper, i.e.
for each (record from mapper input reader) { output = lucene.search("name:"+ record.name + " OR " + " id:" + record.id ); emit(output) } my question is whether I should shard the index (across terms, not splitting the same postings list for one term) or simply replicate it. the index for the entire dataset is not too big, so it can fig into my local disk, and I can copy it to every node in the cluster, and let them sit there all the time, so no copy overhead is incurred. the only argument in favor of sharding is that a smaller index might be faster. but since index search is only O(lg(n)) time, maybe this time saving is very small. so will sharding be worth the effort? thanks yang --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org