Hello Benedikt! > Metric feels like it's something more general, but I'm not sure. You're right. Metric was supposed to be a general interface, representing the String Metric from the Wikipedia article. > and the interface from StringMetric to StringDistance. I'm reading the Myers paper, and already have a local branch with the Myers algorithm from [collections] ported to [text]. Perhaps we could move the StringMetric interface to o.a.c.text package, and create StringDistance or EditDistance interface in o.a.c.text.distance. This way we can have String Metrics as in Wikipedia, as being a way of giving a valuefor comparing two strings. We would have the edit distances in the distance package, and the diff algorithms in another diff package. All of them being String Metrics. What do you think? > > I think we should consider renaming everything to distance, since the> > > > implemented algorithms all end on *Distance. So we would change the > > package> > name from o.a.c.text.similarity to o.a.c.text.distance and the > > interface> > from StringMetric to StringDistance.> >> > Looking at the code again, it seems like the algorithms all really return a> >similarity score and not a distance. For exmaple FuzzyDistance JavaDoc> >states: "A higher score indicates a higher similarity". If this is a case,> >maybe it makes more sense to rename everything to Similarity? I'm in favor of dropping score and similarity, and adopting distance in the package, classes and javadocs, as it is used in other tools (e.g. Solr, Talend, Informatica IIR, etc). All the best,Bruno
From: Benedikt Ritter <brit...@apache.org> To: Commons Developers List <dev@commons.apache.org> Sent: Sunday, December 14, 2014 6:20 PM Subject: Re: [TEXT] Distance vs. Metric vs. Similarity 2014-12-14 21:08 GMT+01:00 Benedikt Ritter <brit...@apache.org>: > > Hi, > > currently the wording in commons text is a bit confusing. We have the > three terms: > > - distance > - similarity > - metric > > Distance and similarity seem to be just opposites of the same thing. A > great distance indicates a small similarity between two character > sequences. Metric feels like it's something more general, but I'm not sure. > > I think we should consider renaming everything to distance, since the > implemented algorithms all end on *Distance. So we would change the package > name from o.a.c.text.similarity to o.a.c.text.distance and the interface > from StringMetric to StringDistance. > Looking at the code again, it seems like the algorithms all really return a similarity score and not a distance. For exmaple FuzzyDistance JavaDoc states: "A higher score indicates a higher similarity". If this is a case, maybe it makes more sense to rename everything to Similarity? > > WDYT? > > Benedikt > > -- > http://people.apache.org/~britter/ > http://www.systemoutprintln.de/ > http://twitter.com/BenediktRitter > http://github.com/britter > -- http://people.apache.org/~britter/ http://www.systemoutprintln.de/ http://twitter.com/BenediktRitter http://github.com/britter