Re: TextProfileSigature using deduplication

2008-11-20 Thread Andrzej Bialecki
Mark Miller wrote: Thanks for sharing Marc, thats very nice to know. I'll take your experience as a starting point for some wiki recommendations. Sounds like we should add a switch to order alpha as well. On the general note of near-duplicate detection ... I found this paper in the proceedin

Re: TextProfileSigature using deduplication

2008-11-20 Thread Mark Miller
Thanks for sharing Marc, thats very nice to know. I'll take your experience as a starting point for some wiki recommendations. Sounds like we should add a switch to order alpha as well. Marc Sturlese wrote: Hey there, I found couple of solutions that work fine for my case (is not exacly what

Re: TextProfileSigature using deduplication

2008-11-20 Thread Marc Sturlese
re > (due to small text changes) the frequency of a > term moves between quantized bands. This then > changes the über hash that you get from combining > all terms, but with 10 or so bands we still get > some matches on the hashes from the individual > bands. > > The "find potentially similar files" uses a > simple Lucene scoring function, based on the > number of matching fingerprint values. > > -- Ken > -- > Ken Krugler > Krugle, Inc. > +1 530-210-6378 > "If you can't find it, you can't fix it" > > -- View this message in context: http://www.nabble.com/TextProfileSigature-using-deduplication-tp20559155p20600118.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: TextProfileSigature using deduplication

2008-11-18 Thread Ken Krugler
Marc Sturlese wrote: Hey there, I've been testing and checking the source of the TextProfileSignature.java to avoid similar entries at indexing time. What I understood is that it is useful for huge text where the frequency of the tokens (the words in lowercase just with number and leters in taht

Re: TextProfileSigature using deduplication

2008-11-18 Thread Andrzej Bialecki
Marc Sturlese wrote: Hey there, I've been testing and checking the source of the TextProfileSignature.java to avoid similar entries at indexing time. What I understood is that it is useful for huge text where the frequency of the tokens (the words in lowercase just with number and leters in taht

Re: TextProfileSigature using deduplication

2008-11-18 Thread Marc Sturlese
>>> so it works really slow... >>> > What are you doing for the String comparison? Not exact right? > > -- View this message in context: http://www.nabble.com/TextProfileSigature-using-deduplication-tp20559155p20560828.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: TextProfileSigature using deduplication

2008-11-18 Thread Mark Miller
I have my own duplication system to detect that but I use String comparison so it works really slow... What are you doing for the String comparison? Not exact right?

Re: TextProfileSigature using deduplication

2008-11-18 Thread Mark Miller
Have you tried the tunning params for TextProfileSignature? I probably have to update the dedupe wiki. You can set the quantRate and the minTokenLength. Those are the variables names and you set them right with signatureClass, signatureField, fields, etc. Whether or not you can tune it to me

TextProfileSigature using deduplication

2008-11-18 Thread Marc Sturlese
in advance -- View this message in context: http://www.nabble.com/TextProfileSigature-using-deduplication-tp20559155p20559155.html Sent from the Solr - User mailing list archive at Nabble.com.