Beto Siless wrote:
Hi Andrej!
I'm taking a look to fuzzy signatures for near duplicate detection and
and I have seen your TextProfileSignature. The question is: If I index
the documents with their text signature, is there a way to filter near
duplicates at search time without comparing
It doesn't make sense to eliminate near duplicates during search time. But
if you are trying to cluster duplicates together then probably you want to
look at Carrot.
On 10/24/06, Beto Siless <[EMAIL PROTECTED]> wrote:
Hi Andrej!
I'm taking a look to fuzzy signatures fo
Hi Andrej!
I'm taking a look to fuzzy signatures for near duplicate detection and
and I have seen your TextProfileSignature. The question is: If I index
the documents with their text signature, is there a way to filter near
duplicates at search time without comparing each document wit
Hi Karl!
I'm interested in near duplicate detection based on termFreqVectos. Now
I'm comparing all documents with each other (calculating the angle)...
Is there a way to avoid that?
Thanks!
Beto
karl wettin wrote:
17 okt 2006 kl. 17.54 skrev Find Me:
How to eliminate near
On 10/18/06, Isabel Drost <[EMAIL PROTECTED]> wrote:
Find Me wrote:
> How to eliminate near duplicates from the index? Someone suggested that
I
> could look at the TermVectors and do a comparision to remove the
> duplicates.
As an alternative you could also have a look at the p
17 okt 2006 kl. 18.55 skrev Andrzej Bialecki:
You need to create a fuzzy signature of the document, based on term
histogram or shingles - take a look a the Signature framework in
Nutch.
There is a substantial literature on this subject - go to Citeseer
and run a search for "near duplicate
karl wettin wrote:
17 okt 2006 kl. 17.54 skrev Find Me:
How to eliminate near duplicates from the index?
I would probably try to measure the Ecludian distance between all
documents, computed on terms and their positions. Or perhaps use
standard deviation to find the distribution of terms
17 okt 2006 kl. 17.54 skrev Find Me:
How to eliminate near duplicates from the index?
Oh, one more thing. You should probably look at the norms in order to
avoid comparing all documents to each other.
17 okt 2006 kl. 17.54 skrev Find Me:
How to eliminate near duplicates from the index?
I would probably try to measure the Ecludian distance between all
documents, computed on terms and their positions. Or perhaps use
standard deviation to find the distribution of terms in a document
How to eliminate near duplicates from the index? Someone suggested that I
could look at the TermVectors and do a comparision to remove the duplicates.
One major problem with this is the structure of the document is no longer
important. Are there any obvious pitfalls? For example: Document A being
10 matches
Mail list logo