karl wettin wrote:
17 okt 2006 kl. 17.54 skrev Find Me:
How to eliminate near duplicates from the index?
I would probably try to measure the Ecludian distance between all
documents, computed on terms and their positions. Or perhaps use
standard deviation to find the distribution of terms in a document.
One would based on the output from that try to find a threashold.
Either way it will consume lots of CPU.
There are better ways to achieve this. You need to create a fuzzy
signature of the document, based on term histogram or shingles - take a
look a the Signature framework in Nutch.
There is a substantial literature on this subject - go to Citeseer and
run a search for "near duplicate detection".
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]