: Yes, when I say "duplicate" sentences, they are exact copies of the same : string.
you still haven't explained how you indexed these sentences, what do you mean by "each lucene document actually contains exactly one sentence." ? Did you tokenize the sentence into one field? do you a field for verbs and a field for nouns? what does the structure of your documents look liike? if (per chance) you have one field in each document that contains the orriginal, untokenized sentence as an indexed keyword, then finding duplicates would be pretty damn easy by iterating over a TermEnum on that field and looking or any term in more then one document. admitedly, that's a pretty contrived case, and most likely that isn't the situation you are in -- but it serves as an example of how understanding your index structure can help people answer your question. can you send the code the code you used to index these documents? that might help people spot novel ways of finding likely duplicates. -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]