It doesn't make sense to eliminate near duplicates during search time. But if you are trying to cluster duplicates together then probably you want to look at Carrot.
On 10/24/06, Beto Siless <[EMAIL PROTECTED]> wrote:
Hi Andrej! I'm taking a look to fuzzy signatures for near duplicate detection and and I have seen your TextProfileSignature. The question is: If I index the documents with their text signature, is there a way to filter near duplicates at search time without comparing each document with all other? Thanks Beto Andrzej Bialecki wrote: > karl wettin wrote: >> >> 17 okt 2006 kl. 17.54 skrev Find Me: >> >>> How to eliminate near duplicates from the index? >> >> I would probably try to measure the Ecludian distance between all >> documents, computed on terms and their positions. Or perhaps use >> standard deviation to find the distribution of terms in a document. >> One would based on the output from that try to find a threashold. >> Either way it will consume lots of CPU. > > > There are better ways to achieve this. You need to create a fuzzy > signature of the document, based on term histogram or shingles - take a > look a the Signature framework in Nutch. > > There is a substantial literature on this subject - go to Citeseer and > run a search for "near duplicate detection". > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]