How to eliminate near duplicates from the index? Someone suggested that I could look at the TermVectors and do a comparision to remove the duplicates. One major problem with this is the structure of the document is no longer important. Are there any obvious pitfalls? For example: Document A being a subset of Document B but in no particular order.
Nutch's DeleteDuplicates class is useful only when the documents are identical with respect to either URL or the content.