Re: Detecting duplicates

2011-03-10 Thread mark harwood
-user@lucene.apache.org Sent: Thu, 10 March, 2011 15:35:22 Subject: Re: Detecting duplicates My understanding is It can mark documents with the same signature indicating that they are similar however there is no way at query time to return only 1 "unique" document per signature. Am I missing so

Re: Detecting duplicates

2011-03-10 Thread Alexander Aristov
did you check it http://wiki.apache.org/solr/Deduplication Best Regards Alexander Aristov On 10 March 2011 18:35, Mark wrote: > My understanding is It can mark documents with the same signature > indicating that they are similar however there is no way at query time to > return only 1 "unique

Re: Detecting duplicates

2011-03-10 Thread Mark
My understanding is It can mark documents with the same signature indicating that they are similar however there is no way at query time to return only 1 "unique" document per signature. Am I missing something? Doc 1) This is my test Doc 2) This is my test Doc 3) Another test Doc 4) This is my

Re: Detecting duplicates

2011-03-10 Thread Grant Ingersoll
On Mar 5, 2011, at 8:35 PM, Mark wrote: > I'm familiar with Deduplication however I do not wish to remove my duplicates > and my needs are slightly different. I would like to mark the first document > with signature 'xyz' as unique but the next one as a duplicate. This way I > can filter out "

Re: Detecting duplicates

2011-03-08 Thread Otis Gospodnetic
://search-lucene.com/ - Original Message > From: Mark > To: java-user@lucene.apache.org > Sent: Sat, March 5, 2011 8:35:13 PM > Subject: Re: Detecting duplicates > > I'm familiar with Deduplication however I do not wish to remove my > duplicates and my needs

Re: Detecting duplicates

2011-03-05 Thread Li Li
it's indeed very slow. because it do collapsing in all matched documents. we tacked this problem by doing collapsing in top 100 documents. 2011/3/6 Mark > I'm familiar with Deduplication however I do not wish to remove my > duplicates and my needs are slightly different. I would like to mark the

Re: Detecting duplicates

2011-03-05 Thread Mark
I'm familiar with Deduplication however I do not wish to remove my duplicates and my needs are slightly different. I would like to mark the first document with signature 'xyz' as unique but the next one as a duplicate. This way I can filter out "duplicates" during searching using a filter query

Re: Detecting duplicates

2011-03-05 Thread Devon H. O'Dell
There is a DuplicateFilter class in contrib that works pretty well. 2011/3/5 Grant Ingersoll : > See http://wiki.apache.org/solr/Deduplication.  Should be fairly easy to pull > out if you are doing just Lucene. > > On Mar 5, 2011, at 1:49 AM, Mark wrote: > >> Is there a way one could detect dupli

Re: Detecting duplicates

2011-03-05 Thread Grant Ingersoll
See http://wiki.apache.org/solr/Deduplication. Should be fairly easy to pull out if you are doing just Lucene. On Mar 5, 2011, at 1:49 AM, Mark wrote: > Is there a way one could detect duplicates (say by using some unique hash of > certain fields) and marking a document as a duplicate but not

Re: Detecting duplicates

2011-03-04 Thread Li Li
it's the problem of near duplication detection. there are many papers addressing this problem. methods like simhash are used. 2011/3/5 Mark > Is there a way one could detect duplicates (say by using some unique hash > of certain fields) and marking a document as a duplicate but not remove it. >