Mark Miller wrote:
Thanks for sharing Marc, thats very nice to know. I'll take your
experience as a starting point for some wiki recommendations.
Sounds like we should add a switch to order alpha as well.
On the general note of near-duplicate detection ... I found this paper
in the proceedin
Thanks for sharing Marc, thats very nice to know. I'll take your
experience as a starting point for some wiki recommendations.
Sounds like we should add a switch to order alpha as well.
Marc Sturlese wrote:
Hey there,
I found couple of solutions that work fine for my case (is not exacly what
re
> (due to small text changes) the frequency of a
> term moves between quantized bands. This then
> changes the über hash that you get from combining
> all terms, but with 10 or so bands we still get
> some matches on the hashes from the individual
> bands.
>
> The "find potentially similar files" uses a
> simple Lucene scoring function, based on the
> number of matching fingerprint values.
>
> -- Ken
> --
> Ken Krugler
> Krugle, Inc.
> +1 530-210-6378
> "If you can't find it, you can't fix it"
>
>
--
View this message in context:
http://www.nabble.com/TextProfileSigature-using-deduplication-tp20559155p20600118.html
Sent from the Solr - User mailing list archive at Nabble.com.
Marc Sturlese wrote:
Hey there, I've been testing and checking the source of the
TextProfileSignature.java to avoid similar entries at indexing time.
What I understood is that it is useful for huge text where the frequency of
the tokens (the words in lowercase just with number and leters in taht
Marc Sturlese wrote:
Hey there, I've been testing and checking the source of the
TextProfileSignature.java to avoid similar entries at indexing time.
What I understood is that it is useful for huge text where the frequency of
the tokens (the words in lowercase just with number and leters in taht
>>> so it works really slow...
>>>
> What are you doing for the String comparison? Not exact right?
>
>
--
View this message in context:
http://www.nabble.com/TextProfileSigature-using-deduplication-tp20559155p20560828.html
Sent from the Solr - User mailing list archive at Nabble.com.
I have my own duplication system to detect that but I use String
comparison
so it works really slow...
What are you doing for the String comparison? Not exact right?
Have you tried the tunning params for TextProfileSignature? I probably
have to update the dedupe wiki.
You can set the quantRate and the minTokenLength. Those are the
variables names and you set them right with signatureClass,
signatureField, fields, etc.
Whether or not you can tune it to me
in advance
--
View this message in context:
http://www.nabble.com/TextProfileSigature-using-deduplication-tp20559155p20559155.html
Sent from the Solr - User mailing list archive at Nabble.com.