Re: near duplicates

2006-10-24 Thread Andrzej Bialecki
Beto Siless wrote: Hi Andrej! I'm taking a look to fuzzy signatures for near duplicate detection and and I have seen your TextProfileSignature. The question is: If I index the documents with their text signature, is there a way to filter near duplicates at search time without comparing

Re: near duplicates

2006-10-24 Thread Find Me
It doesn't make sense to eliminate near duplicates during search time. But if you are trying to cluster duplicates together then probably you want to look at Carrot. On 10/24/06, Beto Siless <[EMAIL PROTECTED]> wrote: Hi Andrej! I'm taking a look to fuzzy signatures fo

Re: near duplicates

2006-10-24 Thread Beto Siless
Hi Andrej! I'm taking a look to fuzzy signatures for near duplicate detection and and I have seen your TextProfileSignature. The question is: If I index the documents with their text signature, is there a way to filter near duplicates at search time without comparing each document wit

Re: near duplicates

2006-10-24 Thread Beto Siless
Hi Karl! I'm interested in near duplicate detection based on termFreqVectos. Now I'm comparing all documents with each other (calculating the angle)... Is there a way to avoid that? Thanks! Beto karl wettin wrote: 17 okt 2006 kl. 17.54 skrev Find Me: How to eliminate near

Re: near duplicates

2006-10-18 Thread John Casey
On 10/18/06, Isabel Drost <[EMAIL PROTECTED]> wrote: Find Me wrote: > How to eliminate near duplicates from the index? Someone suggested that I > could look at the TermVectors and do a comparision to remove the > duplicates. As an alternative you could also have a look at the p

Re: near duplicates

2006-10-18 Thread karl wettin
17 okt 2006 kl. 18.55 skrev Andrzej Bialecki: You need to create a fuzzy signature of the document, based on term histogram or shingles - take a look a the Signature framework in Nutch. There is a substantial literature on this subject - go to Citeseer and run a search for "near duplicate

Re: near duplicates

2006-10-17 Thread Andrzej Bialecki
karl wettin wrote: 17 okt 2006 kl. 17.54 skrev Find Me: How to eliminate near duplicates from the index? I would probably try to measure the Ecludian distance between all documents, computed on terms and their positions. Or perhaps use standard deviation to find the distribution of terms

Re: near duplicates

2006-10-17 Thread karl wettin
17 okt 2006 kl. 17.54 skrev Find Me: How to eliminate near duplicates from the index? Oh, one more thing. You should probably look at the norms in order to avoid comparing all documents to each other.

Re: near duplicates

2006-10-17 Thread karl wettin
17 okt 2006 kl. 17.54 skrev Find Me: How to eliminate near duplicates from the index? I would probably try to measure the Ecludian distance between all documents, computed on terms and their positions. Or perhaps use standard deviation to find the distribution of terms in a document

near duplicates

2006-10-17 Thread Find Me
How to eliminate near duplicates from the index? Someone suggested that I could look at the TermVectors and do a comparision to remove the duplicates. One major problem with this is the structure of the document is no longer important. Are there any obvious pitfalls? For example: Document A being