Re: near duplicates

Andrzej Bialecki Tue, 17 Oct 2006 09:58:01 -0700

karl wettin wrote:

17 okt 2006 kl. 17.54 skrev Find Me:
How to eliminate near duplicates from the index?
I would probably try to measure the Ecludian distance between alldocuments, computed on terms and their positions. Or perhaps usestandard deviation to find the distribution of terms in a document.One would based on the output from that try to find a threashold.Either way it will consume lots of CPU.

There are better ways to achieve this. You need to create a fuzzysignature of the document, based on term histogram or shingles - take alook a the Signature framework in Nutch.

There is a substantial literature on this subject - go to Citeseer andrun a search for "near duplicate detection".


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: near duplicates

Reply via email to