There is a DuplicateFilter class in contrib that works pretty well. 2011/3/5 Grant Ingersoll <gsing...@apache.org>: > See http://wiki.apache.org/solr/Deduplication. Should be fairly easy to pull > out if you are doing just Lucene. > > On Mar 5, 2011, at 1:49 AM, Mark wrote: > >> Is there a way one could detect duplicates (say by using some unique hash of >> certain fields) and marking a document as a duplicate but not remove it. >> >> Here is an example: >> >> Doc 1) This is my test >> Doc 2) This is my test >> Doc 3) Another test >> Doc 4) This is my test >> >> Doc 1 and 3 should be considered unique whereas 2 and 4 should be marked as >> duplicates (of doc 1). >> >> Can this be easily accomplished? >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem docs using Solr/Lucene: > http://www.lucidimagination.com/search > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
--------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org