subject:"near duplicates"

Re: near duplicates

2006-10-24 Thread Andrzej Bialecki

Beto Siless wrote: Hi Andrej! I'm taking a look to fuzzy signatures for near duplicate detection and and I have seen your TextProfileSignature. The question is: If I index the documents with their text signature, is there a way to filter near duplicates at search time without comparing

Re: near duplicates

2006-10-24 Thread Find Me

It doesn't make sense to eliminate near duplicates during search time. But if you are trying to cluster duplicates together then probably you want to look at Carrot. On 10/24/06, Beto Siless <[EMAIL PROTECTED]> wrote: Hi Andrej! I'm taking a look to fuzzy signatures fo

Re: near duplicates

2006-10-24 Thread Beto Siless

Hi Andrej! I'm taking a look to fuzzy signatures for near duplicate detection and and I have seen your TextProfileSignature. The question is: If I index the documents with their text signature, is there a way to filter near duplicates at search time without comparing each document wit

Re: near duplicates

2006-10-24 Thread Beto Siless

Hi Karl! I'm interested in near duplicate detection based on termFreqVectos. Now I'm comparing all documents with each other (calculating the angle)... Is there a way to avoid that? Thanks! Beto karl wettin wrote: 17 okt 2006 kl. 17.54 skrev Find Me: How to eliminate near

Re: near duplicates

2006-10-18 Thread John Casey

On 10/18/06, Isabel Drost <[EMAIL PROTECTED]> wrote: Find Me wrote: > How to eliminate near duplicates from the index? Someone suggested that I > could look at the TermVectors and do a comparision to remove the > duplicates. As an alternative you could also have a look at the p

Re: near duplicates

2006-10-18 Thread karl wettin

17 okt 2006 kl. 18.55 skrev Andrzej Bialecki: You need to create a fuzzy signature of the document, based on term histogram or shingles - take a look a the Signature framework in Nutch. There is a substantial literature on this subject - go to Citeseer and run a search for "near duplicate

Re: near duplicates

2006-10-17 Thread Andrzej Bialecki

karl wettin wrote: 17 okt 2006 kl. 17.54 skrev Find Me: How to eliminate near duplicates from the index? I would probably try to measure the Ecludian distance between all documents, computed on terms and their positions. Or perhaps use standard deviation to find the distribution of terms

Re: near duplicates

2006-10-17 Thread karl wettin

17 okt 2006 kl. 17.54 skrev Find Me: How to eliminate near duplicates from the index? Oh, one more thing. You should probably look at the norms in order to avoid comparing all documents to each other.

Re: near duplicates

2006-10-17 Thread karl wettin

17 okt 2006 kl. 17.54 skrev Find Me: How to eliminate near duplicates from the index? I would probably try to measure the Ecludian distance between all documents, computed on terms and their positions. Or perhaps use standard deviation to find the distribution of terms in a document

near duplicates

2006-10-17 Thread Find Me

How to eliminate near duplicates from the index? Someone suggested that I could look at the TermVectors and do a comparision to remove the duplicates. One major problem with this is the structure of the document is no longer important. Are there any obvious pitfalls? For example: Document A being

Re: near duplicates

Re: near duplicates

Re: near duplicates

Re: near duplicates

Re: near duplicates

Re: near duplicates

Re: near duplicates

Re: near duplicates

Re: near duplicates

near duplicates

10 matches

Site Navigation

Mail list logo

Footer information