Hi Benedikt >Very nice! Maybe we can even come up with a generic class that calculates >a>distance based on a similarity score. Hmmm, that's a good idea. We probably want to keep that idea in an issue for later :-) [1] I'll use my next development cycle on [text] to review the code and reports, and to write the user guide with what we have already in the project. Do you think we would need anything else before trying a 1.0 release? There are two TODO marks in the test, but I plan to get rid of them in the next days too. But they don't seem like a blocker right now anyway.
ThanksBruno [1] https://issues.apache.org/jira/browse/SANDBOX-495 From: Benedikt Ritter <brit...@apache.org> To: Commons Developers List <dev@commons.apache.org> Sent: Wednesday, April 15, 2015 11:03 PM Subject: Re: [TEXT] Distance vs. Metric vs. Similarity Hi Bruno 2015-04-15 12:14 GMT+02:00 Bruno P. Kinoshita <brunodepau...@yahoo.com.br>: > Hi Benedikt, > > After playing more with [text] and some edit distances, I think we can > retake this conversation and hopefully fix SANDBOX-488 [1]. > > I've created a branch SANDBOX-488 in git [2] with the following > modifications: > > * The StringMetric interface has been renamed to EditDistance > * We have the following edit distances available: Levenshtein, > JaroWrinkler, Hamming ([lang]) and Cosine. Others might be added in the > future, such as Jaccard and QGram > * When an edit distance returns 0, it means both strings are identical or > at least very similar. The opposite is true, returning 1, or higher values, > means that the strings are less close to each other > * There are other classes that can be used for text similarity, such as > the FuzzyScore ([lang]), and the CosineSimilarity (used by the Cosine edit > distance). Others might be added later, such as the Jaccard Index. The > behaviour of each of these classes varies > > I think it is simpler, and users will quickly understand the API. Once one > understands what is an edit distance, s/he can guess the behaviour of any > of its implementations. > > What do you think? If you agree I'd like to merge the branch and fix the > issue. > Very nice! Maybe we can even come up with a generic class that calculates a distance based on a similarity score. Benedikt > > TL;DR: the similarity package contains code to work on text similarity, > such as edit distances, but also scores / indexes and other algorithms. The > StringMetric interface has been renamed to EditDistance, and only edit > distances implement it > > TIA > Bruno > > [1] https://issues.apache.org/jira/browse/SANDBOX-488 > [2] > https://git1-us-west.apache.org/repos/asf?p=commons-text.git;a=tree;f=src/main/java/org/apache/commons/text/similarity;h=a2de9f0196b543f50c6d2c28376feb311f46eeda;hb=refs/heads/SANDBOX-488 > > ------------------------------ > *From:* Benedikt Ritter <brit...@apache.org> > *To:* Commons Developers List <dev@commons.apache.org>; Bruno P. > Kinoshita <brunodepau...@yahoo.com.br> > *Sent:* Friday, December 19, 2014 2:35 AM > > *Subject:* Re: [TEXT] Distance vs. Metric vs. Similarity > > > > 2014-12-14 23:10 GMT+01:00 Bruno P. Kinoshita <brunodepau...@yahoo.com.br> > : > > > Sounds good, although I'm not sure I understand where you are going > with> the marker interface. What is it's purpose? > Let's then keep the StringMetric interface and update its Javadoc. > Thinking again, that other marker interface seems to be unnecessary. > > Okay, but we need to make sure all algorithms really return a > distance> then. As I said, FuzzyDistance currently really returns a > similarity score.> An algorithm returning a distance should return a higher > number for higher> distances. I had a look at the code, and I think I > understand what you are saying now. In FuzzyDistance, the higher the score, > the closer strings are. Different than what the other algorithms return. > I believe I found why I named that package similarity. Probably it was > because I saw that in the stringmetric library [1]. There, Levenshtein, > Jaccard and other algorithms are suffixed with "Metric". > How about we keep the package as similarity and simply rename the classes > to [Algo]Metric too? This way we will be able to accommodate other metrics > such as the Sorensen-Dice coefficient, where the higher the coefficient, > more similar two strings are. > WDYT? > > > > Hey Bruno, > > yes we can do it that way. What I want to avoid is, that the users have to > check the JavaDoc every time they use an algorithms. To me it would make > sense to have a number of distance algorithms and they all return a > distance. Or we have Similarity algorithms and they all return a > similarity. That way users can swap out the underlying algorithms without > changing their code. > > Benedikt > > > CheersBruno > [1] https://github.com/rockymadden/stringmetric > > > > From: Benedikt Ritter <brit...@apache.org> > To: Commons Developers List <dev@commons.apache.org>; Bruno P. Kinoshita > <brunodepau...@yahoo.com.br> > Sent: Sunday, December 14, 2014 6:45 PM > Subject: Re: [TEXT] Distance vs. Metric vs. Similarity > > Hi Bruna, > > > > 2014-12-14 21:37 GMT+01:00 Bruno P. Kinoshita <brunodepau...@yahoo.com.br > >: > > > > Hello Benedikt! > > > Metric feels like it's something more general, but I'm not sure. > > You're right. Metric was supposed to be a general interface, > > representing the String Metric from the Wikipedia article. > > > and the interface from StringMetric to StringDistance. > > I'm reading the Myers paper, and already have a local branch with the > > Myers algorithm from [collections] ported to [text]. > > Perhaps we could move the StringMetric interface to o.a.c.text package, > > and create StringDistance or EditDistance interface in > o.a.c.text.distance. > > This way we can have String Metrics as in Wikipedia, as being a way of > > giving a valuefor comparing two strings. We would have the edit distances > > in the distance package, and the diff algorithms in another diff package. > > All of them being String Metrics. > > What do you think? > > > > Sounds good, although I'm not sure I understand where you are going with > the marker interface. What is it's purpose? > > > > > > I think we should consider renaming everything to distance, since > > the> > implemented algorithms all end on *Distance. So we would change > the > > package> > name from o.a.c.text.similarity to o.a.c.text.distance and the > > interface> > from StringMetric to StringDistance.> >> > > > Looking at the code again, it seems like the algorithms all really > > return a> similarity score and not a distance. For exmaple FuzzyDistance > > JavaDoc> states: "A higher score indicates a higher similarity". If this > is > > a case,> maybe it makes more sense to rename everything to Similarity? > > I'm in favor of dropping score and similarity, and adopting distance in > > the package, classes and javadocs, as it is used in other tools (e.g. > Solr, > > Talend, Informatica IIR, etc). > > > > Okay, but we need to make sure all algorithms really return a distance > then. As I said, FuzzyDistance currently really returns a similarity score. > An algorithm returning a distance should return a higher number for higher > distances. > > Benedikt > > > > All the best,Bruno > > > > > > From: Benedikt Ritter <brit...@apache.org> > > To: Commons Developers List <dev@commons.apache.org> > > Sent: Sunday, December 14, 2014 6:20 PM > > Subject: Re: [TEXT] Distance vs. Metric vs. Similarity > > > > 2014-12-14 21:08 GMT+01:00 Benedikt Ritter <brit...@apache.org>: > > > > > > Hi, > > > > > > currently the wording in commons text is a bit confusing. We have the > > > three terms: > > > > > > - distance > > > - similarity > > > - metric > > > > > > Distance and similarity seem to be just opposites of the same thing. A > > > great distance indicates a small similarity between two character > > > sequences. Metric feels like it's something more general, but I'm not > > sure. > > > > > > I think we should consider renaming everything to distance, since the > > > implemented algorithms all end on *Distance. So we would change the > > package > > > name from o.a.c.text.similarity to o.a.c.text.distance and the > interface > > > from StringMetric to StringDistance. > > > > > > > Looking at the code again, it seems like the algorithms all really > return a > > similarity score and not a distance. For exmaple FuzzyDistance JavaDoc > > states: "A higher score indicates a higher similarity". If this is a > case, > > maybe it makes more sense to rename everything to Similarity? > > > > > > > > > > WDYT? > > > > > > Benedikt > > > > > > -- > > > http://people.apache.org/~britter/ > > > http://www.systemoutprintln.de/ > > > http://twitter.com/BenediktRitter > > > http://github.com/britter > > > > > > > > > > > > > > > > > > -- > > http://people.apache.org/~britter/ > > http://www.systemoutprintln.de/ > > http://twitter.com/BenediktRitter > > http://github.com/britter > > > > > > > > > > -- > http://people.apache.org/~britter/ > http://www.systemoutprintln.de/ > http://twitter.com/BenediktRitter > http://github.com/britter > > > > > > -- > http://people.apache.org/~britter/ > http://www.systemoutprintln.de/ > http://twitter.com/BenediktRitter > http://github.com/britter > > > -- http://people.apache.org/~britter/ http://www.systemoutprintln.de/ http://twitter.com/BenediktRitter http://github.com/britter