This is possible using contrib's DuplicateFilter. Below is an example of your problem defined as an XML-based test which I just ran OK through my test writer/runner. Hopefully this is readable and demonstrates the use of FilteredQuery/DuplicateFilter.
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="test.xsl"?> <Test description="DuplicateFilter tests"> <Data> <Index name="index1"> <Analyzers class="org.apache.lucene.analysis.standard.StandardAnalyzer29"> </Analyzers> <Shard name="shard1"> <Document pk="1"> <Field name="text">This is my test</Field> <Field name="md5">abc</Field> </Document> <Document pk="2"> <Field name="text">This is my test</Field> <Field name="md5">abc</Field> </Document> <Document pk="3"> <Field name="text">Another test</Field> <Field name="md5">def</Field> </Document> <Document pk="4"> <Field name="text">This is my test</Field> <Field name="md5">abc</Field> </Document> </Shard> </Index> </Data> <Tests> <Test description="Eliminate duplicates based on MD5 field"> <Query> <FilteredQuery> <Query> <UserQuery fieldName="text">test</UserQuery> </Query> <Filter> <DuplicateFilter fieldName="md5"/> </Filter> </FilteredQuery> </Query> <ExpectedResults> <Result fieldName="pk">1</Result> <Result fieldName="pk">3</Result> </ExpectedResults> </Test> </Tests> </Test> ----- Original Message ---- From: Mark <static.void....@gmail.com> To: java-user@lucene.apache.org Sent: Thu, 10 March, 2011 15:35:22 Subject: Re: Detecting duplicates My understanding is It can mark documents with the same signature indicating that they are similar however there is no way at query time to return only 1 "unique" document per signature. Am I missing something? Doc 1) This is my test Doc 2) This is my test Doc 3) Another test Doc 4) This is my test If I run a query for "test" it should return Doc 1) This is my test Doc 3) Another test On 3/10/11 6:25 AM, Grant Ingersoll wrote: > On Mar 5, 2011, at 8:35 PM, Mark wrote: > >> I'm familiar with Deduplication however I do not wish to remove my >> duplicates >>and my needs are slightly different. I would like to mark the first document >>with signature 'xyz' as unique but the next one as a duplicate. This way I >>can >>filter out "duplicates" during searching using a filter query but still >>return >>the original document. > My understanding is that you can have it mark duplicates. > >> The only thing I know of at the moment is to use field collapsing but I >> tried >>the patch on 1.4.1 and it was terribly slow. >> >> On 3/5/11 4:43 AM, Grant Ingersoll wrote: >>> See http://wiki.apache.org/solr/Deduplication. Should be fairly easy to >>> pull >>>out if you are doing just Lucene. >>> >>> On Mar 5, 2011, at 1:49 AM, Mark wrote: >>> >>>> Is there a way one could detect duplicates (say by using some unique hash >>>> of >>>>certain fields) and marking a document as a duplicate but not remove it. >>>> >>>> Here is an example: >>>> >>>> Doc 1) This is my test >>>> Doc 2) This is my test >>>> Doc 3) Another test >>>> Doc 4) This is my test >>>> >>>> Doc 1 and 3 should be considered unique whereas 2 and 4 should be marked >>>> as >>>>duplicates (of doc 1). >>>> >>>> Can this be easily accomplished? >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >>> -------------------------- >>> Grant Ingersoll >>> http://www.lucidimagination.com/ >>> >>> Search the Lucene ecosystem docs using Solr/Lucene: >>> http://www.lucidimagination.com/search >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem docs using Solr/Lucene: > http://www.lucidimagination.com/search > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org