Beto Siless wrote:
Hi Andrej!
I'm taking a look to fuzzy signatures for near duplicate detection and
and I have seen your TextProfileSignature. The question is: If I index
the documents with their text signature, is there a way to filter near
duplicates at search time without comparing each document with all other?
Well, Nutch uses DeleteDuplicates tool for this, which basically sorts
and compares these signatures and removes all duplicates from the index.
Then, when searching, there are simply no duplicates anymore because
they have been already deleted.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]