How to eliminate near duplicates from the index? Someone suggested that I
could look at the TermVectors and do a comparision to remove the duplicates.
One major problem with this is the structure of the document is no longer
important. Are there any obvious pitfalls? For example: Document A being a
subset of Document B but in no particular order.

Nutch's DeleteDuplicates class is useful only when the documents are
identical with respect to either URL or the content.

Reply via email to