It doesn't make sense to eliminate near duplicates during search time. But
if you are trying to cluster duplicates together then probably you want to
look at Carrot.

On 10/24/06, Beto Siless <[EMAIL PROTECTED]> wrote:

Hi Andrej!

I'm taking a look to fuzzy signatures for near duplicate detection and
and I have seen your TextProfileSignature. The question is: If I index
the documents with their text signature, is there a way to filter near
duplicates at search time without comparing each document with all other?

Thanks
Beto

Andrzej Bialecki wrote:
> karl wettin wrote:
>>
>> 17 okt 2006 kl. 17.54 skrev Find Me:
>>
>>> How to eliminate near duplicates from the index?
>>
>> I would probably try to measure the Ecludian distance between all
>> documents, computed on terms and their positions. Or perhaps use
>> standard deviation to find the distribution of terms in a document.
>> One would based on the output from that try to find a threashold.
>> Either way it will consume lots of CPU.
>
>
> There are better ways to achieve this. You need to create a fuzzy
> signature of the document, based on term histogram or shingles - take a
> look a the Signature framework in Nutch.
>
> There is a substantial literature on this subject - go to Citeseer and
> run a search for "near duplicate detection".
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Reply via email to