My understanding is It can mark documents with the same signature indicating that they are similar however there is no way at query time to return only 1 "unique" document per signature. Am I missing something?

Doc 1) This is my test
Doc 2) This is my test
Doc 3) Another test
Doc 4) This is my test

If I run a query for "test" it should return

Doc 1) This is my test
Doc 3) Another test


On 3/10/11 6:25 AM, Grant Ingersoll wrote:
On Mar 5, 2011, at 8:35 PM, Mark wrote:

I'm familiar with Deduplication however I do not wish to remove my duplicates and my 
needs are slightly different. I would like to mark the first document with signature 
'xyz' as unique but the next one as a duplicate. This way I can filter out 
"duplicates" during searching using a filter query but still return the 
original document.
My understanding is that you can have it mark duplicates.

The only thing I know of at the moment is to use field collapsing but I tried 
the patch on 1.4.1 and it was terribly slow.

On 3/5/11 4:43 AM, Grant Ingersoll wrote:
See http://wiki.apache.org/solr/Deduplication.  Should be fairly easy to pull 
out if you are doing just Lucene.

On Mar 5, 2011, at 1:49 AM, Mark wrote:

Is there a way one could detect duplicates (say by using some unique hash of 
certain fields) and marking a document as a duplicate but not remove it.

Here is an example:

Doc 1) This is my test
Doc 2) This is my test
Doc 3) Another test
Doc 4) This is my test

Doc 1 and 3 should be considered unique whereas 2 and 4 should be marked as 
duplicates (of doc 1).

Can this be easily accomplished?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem docs using Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem docs using Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to