between two Strings
Googling for "java string similarity" throws up some stuff you might
find useful.
--
Ian.
On Wed, Sep 3, 2008 at 11:58 PM, Thiago Moreira <[EMAIL PROTECTED]>
wrote:
Well, the similar definition that I'm looking for is the number 2,
maybe
the numbe
I would create 1-5 ngram sized shingles and measure the distance using
Tanimoto coefficient. That would probably work out just fine. You
might want to add more weight the greater the size of the shingle.
There are shingle filters in lucene/java/contrib/analyzers and there
is a Tanimoto dist
Googling for "java string similarity" throws up some stuff you might
find useful.
--
Ian.
On Wed, Sep 3, 2008 at 11:58 PM, Thiago Moreira <[EMAIL PROTECTED]> wrote:
>
> Well, the similar definition that I'm looking for is the number 2, maybe
> the number 3, but to start the number 2 is enou
More details may change my opinion (not quite sure how others feel
yet), but with the way you've described it so far, it seems like all
you need is a basic string matcher:
For every message:
- if message.subject is found in the pool, then this
message is "similar to" the message in the poo
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
I don't know how much of this is a Lucene problem, but -- as I'm sure
you will inevitably hear from others on the list -- it depends on
what your definition of "similar" is.
By similar, do you mean:
1. Identical, except for variations in case (upper/lower)
2. Allow 1., but also allow prefix
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]