Re: near duplicates

Find Me Tue, 24 Oct 2006 07:51:40 -0700

It doesn't make sense to eliminate near duplicates during search time. But
if you are trying to cluster duplicates together then probably you want to
look at Carrot.


On 10/24/06, Beto Siless <[EMAIL PROTECTED]> wrote:


Hi Andrej!

I'm taking a look to fuzzy signatures for near duplicate detection and
and I have seen your TextProfileSignature. The question is: If I index
the documents with their text signature, is there a way to filter near
duplicates at search time without comparing each document with all other?

Thanks
Beto

Andrzej Bialecki wrote:
> karl wettin wrote:
>>
>> 17 okt 2006 kl. 17.54 skrev Find Me:
>>
>>> How to eliminate near duplicates from the index?
>>
>> I would probably try to measure the Ecludian distance between all
>> documents, computed on terms and their positions. Or perhaps use
>> standard deviation to find the distribution of terms in a document.
>> One would based on the output from that try to find a threashold.
>> Either way it will consume lots of CPU.
>
>
> There are better ways to achieve this. You need to create a fuzzy
> signature of the document, based on term histogram or shingles - take a
> look a the Signature framework in Nutch.
>
> There is a substantial literature on this subject - go to Citeseer and
> run a search for "near duplicate detection".
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: near duplicates

Reply via email to