near duplicates

Find Me Tue, 17 Oct 2006 08:54:39 -0700

How to eliminate near duplicates from the index? Someone suggested that I
could look at the TermVectors and do a comparision to remove the duplicates.
One major problem with this is the structure of the document is no longer
important. Are there any obvious pitfalls? For example: Document A being a
subset of Document B but in no particular order.


Nutch's DeleteDuplicates class is useful only when the documents are
identical with respect to either URL or the content.

near duplicates

Reply via email to