Re: [TEXT] Distance vs. Metric vs. Similarity

Bruno P. Kinoshita Sun, 14 Dec 2014 12:41:45 -0800

Hello Benedikt!
> Metric feels like it's something more general, but I'm not sure.
You're right. Metric was supposed to be a general interface, representing the 
String Metric from the Wikipedia article.
>  and the interface from StringMetric to StringDistance.
I'm reading the Myers paper, and already have a local branch with the Myers 
algorithm from [collections] ported to [text]. 
Perhaps we could move the StringMetric interface to o.a.c.text package, and 
create StringDistance or EditDistance interface in o.a.c.text.distance.
This way we can have String Metrics as in Wikipedia, as being a way of giving a 
valuefor comparing two strings. We would have the edit distances in the 
distance package, and the diff algorithms in another diff package. All of them 
being String Metrics. 
What do you think?
> > I think we should consider renaming everything to distance, since the> > 
> > implemented algorithms all end on *Distance. So we would change the 
> > package> > name from o.a.c.text.similarity to o.a.c.text.distance and the 
> > interface> > from StringMetric to StringDistance.> >> 
> Looking at the code again, it seems like the algorithms all really return a> 
>similarity score and not a distance. For exmaple FuzzyDistance JavaDoc> 
>states: "A higher score indicates a higher similarity". If this is a case,> 
>maybe it makes more sense to rename everything to Similarity?
I'm in favor of dropping score and similarity, and adopting distance in the 
package, classes and javadocs, as it is used in other tools (e.g. Solr, Talend, 
Informatica IIR, etc).
All the best,Bruno


 
      From: Benedikt Ritter <[email protected]>
 To: Commons Developers List <[email protected]> 
 Sent: Sunday, December 14, 2014 6:20 PM
 Subject: Re: [TEXT] Distance vs. Metric vs. Similarity
   
2014-12-14 21:08 GMT+01:00 Benedikt Ritter <[email protected]>:
>
> Hi,
>
> currently the wording in commons text is a bit confusing. We have the
> three terms:
>
> - distance
> - similarity
> - metric
>
> Distance and similarity seem to be just opposites of the same thing. A
> great distance indicates a small similarity between two character
> sequences. Metric feels like it's something more general, but I'm not sure.
>
> I think we should consider renaming everything to distance, since the
> implemented algorithms all end on *Distance. So we would change the package
> name from o.a.c.text.similarity to o.a.c.text.distance and the interface
> from StringMetric to StringDistance.
>

Looking at the code again, it seems like the algorithms all really return a
similarity score and not a distance. For exmaple FuzzyDistance JavaDoc
states: "A higher score indicates a higher similarity". If this is a case,
maybe it makes more sense to rename everything to Similarity?


>
> WDYT?
>
> Benedikt
>
> --
> http://people.apache.org/~britter/
> http://www.systemoutprintln.de/
> http://twitter.com/BenediktRitter
> http://github.com/britter


>


-- 
http://people.apache.org/~britter/
http://www.systemoutprintln.de/
http://twitter.com/BenediktRitter
http://github.com/britter

Re: [TEXT] Distance vs. Metric vs. Similarity

Reply via email to