-user@lucene.apache.org
Sent: Thu, 10 March, 2011 15:35:22
Subject: Re: Detecting duplicates
My understanding is It can mark documents with the same signature
indicating that they are similar however there is no way at query time
to return only 1 "unique" document per signature. Am I missing so
did you check it
http://wiki.apache.org/solr/Deduplication
Best Regards
Alexander Aristov
On 10 March 2011 18:35, Mark wrote:
> My understanding is It can mark documents with the same signature
> indicating that they are similar however there is no way at query time to
> return only 1 "unique
My understanding is It can mark documents with the same signature
indicating that they are similar however there is no way at query time
to return only 1 "unique" document per signature. Am I missing something?
Doc 1) This is my test
Doc 2) This is my test
Doc 3) Another test
Doc 4) This is my
On Mar 5, 2011, at 8:35 PM, Mark wrote:
> I'm familiar with Deduplication however I do not wish to remove my duplicates
> and my needs are slightly different. I would like to mark the first document
> with signature 'xyz' as unique but the next one as a duplicate. This way I
> can filter out "
://search-lucene.com/
- Original Message
> From: Mark
> To: java-user@lucene.apache.org
> Sent: Sat, March 5, 2011 8:35:13 PM
> Subject: Re: Detecting duplicates
>
> I'm familiar with Deduplication however I do not wish to remove my
> duplicates and my needs
it's indeed very slow. because it do collapsing in all matched documents.
we tacked this problem by doing collapsing in top 100 documents.
2011/3/6 Mark
> I'm familiar with Deduplication however I do not wish to remove my
> duplicates and my needs are slightly different. I would like to mark the
I'm familiar with Deduplication however I do not wish to remove my
duplicates and my needs are slightly different. I would like to mark the
first document with signature 'xyz' as unique but the next one as a
duplicate. This way I can filter out "duplicates" during searching using
a filter query
There is a DuplicateFilter class in contrib that works pretty well.
2011/3/5 Grant Ingersoll :
> See http://wiki.apache.org/solr/Deduplication. Should be fairly easy to pull
> out if you are doing just Lucene.
>
> On Mar 5, 2011, at 1:49 AM, Mark wrote:
>
>> Is there a way one could detect dupli
See http://wiki.apache.org/solr/Deduplication. Should be fairly easy to pull
out if you are doing just Lucene.
On Mar 5, 2011, at 1:49 AM, Mark wrote:
> Is there a way one could detect duplicates (say by using some unique hash of
> certain fields) and marking a document as a duplicate but not
it's the problem of near duplication detection. there are many papers
addressing this problem. methods like simhash are used.
2011/3/5 Mark
> Is there a way one could detect duplicates (say by using some unique hash
> of certain fields) and marking a document as a duplicate but not remove it.
>
10 matches
Mail list logo