it's indeed very slow. because it do collapsing in all matched documents. we tacked this problem by doing collapsing in top 100 documents.
2011/3/6 Mark <static.void....@gmail.com> > I'm familiar with Deduplication however I do not wish to remove my > duplicates and my needs are slightly different. I would like to mark the > first document with signature 'xyz' as unique but the next one as a > duplicate. This way I can filter out "duplicates" during searching using a > filter query but still return the original document. > > The only thing I know of at the moment is to use field collapsing but I > tried the patch on 1.4.1 and it was terribly slow. > > > On 3/5/11 4:43 AM, Grant Ingersoll wrote: > >> See http://wiki.apache.org/solr/Deduplication. Should be fairly easy to >> pull out if you are doing just Lucene. >> >> On Mar 5, 2011, at 1:49 AM, Mark wrote: >> >> Is there a way one could detect duplicates (say by using some unique hash >>> of certain fields) and marking a document as a duplicate but not remove it. >>> >>> Here is an example: >>> >>> Doc 1) This is my test >>> Doc 2) This is my test >>> Doc 3) Another test >>> Doc 4) This is my test >>> >>> Doc 1 and 3 should be considered unique whereas 2 and 4 should be marked >>> as duplicates (of doc 1). >>> >>> Can this be easily accomplished? >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> -------------------------- >> Grant Ingersoll >> http://www.lucidimagination.com/ >> >> Search the Lucene ecosystem docs using Solr/Lucene: >> http://www.lucidimagination.com/search >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >