Re: Ideas Needed - Finding Duplicate Documents

Chris Hostetter Sun, 12 Jun 2005 16:29:47 -0700

: Yes, when I say "duplicate" sentences, they are exact copies of the same
: string.


you still haven't explained how you indexed these sentences, what do you
mean by "each lucene document actually contains exactly one sentence." ?

Did you tokenize the sentence into one field? do you a field for verbs and
a field for nouns?  what does the structure of your documents look liike?

if (per chance) you have one field in each document that contains the
orriginal, untokenized sentence as an indexed keyword, then finding
duplicates would be pretty damn easy by iterating over a TermEnum on that
field and looking or any term in more then one document.

admitedly, that's a pretty contrived case, and most likely that isn't the
situation you are in -- but it serves as an example of how understanding
your index structure can help people answer your question.

can you send the code the code you used to index these documents?  that
might help people spot novel ways of finding likely duplicates.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Ideas Needed - Finding Duplicate Documents

Reply via email to