2015-04-16 13:38 GMT+02:00 Bruno P. Kinoshita <brunodepau...@yahoo.com.br>:
> Hi Benedikt > > >Very nice! Maybe we can even come up with a generic class that calculates > a>distance based on a similarity score. > Hmmm, that's a good idea. We probably want to keep that idea in an issue > for later :-) [1] I'll use my next development cycle on [text] to review > the code and reports, and to write the user guide with what we have already > in the project. > Do you think we would need anything else before trying a 1.0 release? > There are two TODO marks in the test, but I plan to get rid of them in the > next days too. But they don't seem like a blocker right now anyway. > Release early, release often. Better come up with a small feature set in 1.0 and add stuff in the next releases than try to push everything into 1.0. I'd like to do a little review cycle of the code myself. I hope to find the time this weekend. After polishing up, we can go for 1.0 keep up the good work! Benedikt > > ThanksBruno > > [1] https://issues.apache.org/jira/browse/SANDBOX-495 > > > From: Benedikt Ritter <brit...@apache.org> > To: Commons Developers List <dev@commons.apache.org> > Sent: Wednesday, April 15, 2015 11:03 PM > Subject: Re: [TEXT] Distance vs. Metric vs. Similarity > > Hi Bruno > > 2015-04-15 12:14 GMT+02:00 Bruno P. Kinoshita <brunodepau...@yahoo.com.br > >: > > > Hi Benedikt, > > > > After playing more with [text] and some edit distances, I think we can > > retake this conversation and hopefully fix SANDBOX-488 [1]. > > > > I've created a branch SANDBOX-488 in git [2] with the following > > modifications: > > > > * The StringMetric interface has been renamed to EditDistance > > * We have the following edit distances available: Levenshtein, > > JaroWrinkler, Hamming ([lang]) and Cosine. Others might be added in the > > future, such as Jaccard and QGram > > * When an edit distance returns 0, it means both strings are identical or > > at least very similar. The opposite is true, returning 1, or higher > values, > > means that the strings are less close to each other > > * There are other classes that can be used for text similarity, such as > > the FuzzyScore ([lang]), and the CosineSimilarity (used by the Cosine > edit > > distance). Others might be added later, such as the Jaccard Index. The > > behaviour of each of these classes varies > > > > I think it is simpler, and users will quickly understand the API. Once > one > > understands what is an edit distance, s/he can guess the behaviour of any > > of its implementations. > > > > What do you think? If you agree I'd like to merge the branch and fix the > > issue. > > > > Very nice! Maybe we can even come up with a generic class that calculates a > distance based on a similarity score. > > Benedikt > > > > > > TL;DR: the similarity package contains code to work on text similarity, > > such as edit distances, but also scores / indexes and other algorithms. > The > > StringMetric interface has been renamed to EditDistance, and only edit > > distances implement it > > > > TIA > > Bruno > > > > [1] https://issues.apache.org/jira/browse/SANDBOX-488 > > [2] > > > https://git1-us-west.apache.org/repos/asf?p=commons-text.git;a=tree;f=src/main/java/org/apache/commons/text/similarity;h=a2de9f0196b543f50c6d2c28376feb311f46eeda;hb=refs/heads/SANDBOX-488 > > > > ------------------------------ > > *From:* Benedikt Ritter <brit...@apache.org> > > *To:* Commons Developers List <dev@commons.apache.org>; Bruno P. > > Kinoshita <brunodepau...@yahoo.com.br> > > *Sent:* Friday, December 19, 2014 2:35 AM > > > > *Subject:* Re: [TEXT] Distance vs. Metric vs. Similarity > > > > > > > > 2014-12-14 23:10 GMT+01:00 Bruno P. Kinoshita < > brunodepau...@yahoo.com.br> > > : > > > > > Sounds good, although I'm not sure I understand where you are going > > with> the marker interface. What is it's purpose? > > Let's then keep the StringMetric interface and update its Javadoc. > > Thinking again, that other marker interface seems to be unnecessary. > > > Okay, but we need to make sure all algorithms really return a > > distance> then. As I said, FuzzyDistance currently really returns a > > similarity score.> An algorithm returning a distance should return a > higher > > number for higher> distances. I had a look at the code, and I think I > > understand what you are saying now. In FuzzyDistance, the higher the > score, > > the closer strings are. Different than what the other algorithms return. > > I believe I found why I named that package similarity. Probably it was > > because I saw that in the stringmetric library [1]. There, Levenshtein, > > Jaccard and other algorithms are suffixed with "Metric". > > How about we keep the package as similarity and simply rename the classes > > to [Algo]Metric too? This way we will be able to accommodate other > metrics > > such as the Sorensen-Dice coefficient, where the higher the coefficient, > > more similar two strings are. > > WDYT? > > > > > > > > Hey Bruno, > > > > yes we can do it that way. What I want to avoid is, that the users have > to > > check the JavaDoc every time they use an algorithms. To me it would make > > sense to have a number of distance algorithms and they all return a > > distance. Or we have Similarity algorithms and they all return a > > similarity. That way users can swap out the underlying algorithms without > > changing their code. > > > > Benedikt > > > > > > CheersBruno > > [1] https://github.com/rockymadden/stringmetric > > > > > > > > From: Benedikt Ritter <brit...@apache.org> > > To: Commons Developers List <dev@commons.apache.org>; Bruno P. > Kinoshita > > <brunodepau...@yahoo.com.br> > > Sent: Sunday, December 14, 2014 6:45 PM > > Subject: Re: [TEXT] Distance vs. Metric vs. Similarity > > > > Hi Bruna, > > > > > > > > 2014-12-14 21:37 GMT+01:00 Bruno P. Kinoshita < > brunodepau...@yahoo.com.br > > >: > > > > > > Hello Benedikt! > > > > Metric feels like it's something more general, but I'm not sure. > > > You're right. Metric was supposed to be a general interface, > > > representing the String Metric from the Wikipedia article. > > > > and the interface from StringMetric to StringDistance. > > > I'm reading the Myers paper, and already have a local branch with the > > > Myers algorithm from [collections] ported to [text]. > > > Perhaps we could move the StringMetric interface to o.a.c.text package, > > > and create StringDistance or EditDistance interface in > > o.a.c.text.distance. > > > This way we can have String Metrics as in Wikipedia, as being a way of > > > giving a valuefor comparing two strings. We would have the edit > distances > > > in the distance package, and the diff algorithms in another diff > package. > > > All of them being String Metrics. > > > What do you think? > > > > > > > Sounds good, although I'm not sure I understand where you are going with > > the marker interface. What is it's purpose? > > > > > > > > > I think we should consider renaming everything to distance, since > > > the> > implemented algorithms all end on *Distance. So we would change > > the > > > package> > name from o.a.c.text.similarity to o.a.c.text.distance and > the > > > interface> > from StringMetric to StringDistance.> >> > > > > Looking at the code again, it seems like the algorithms all really > > > return a> similarity score and not a distance. For exmaple > FuzzyDistance > > > JavaDoc> states: "A higher score indicates a higher similarity". If > this > > is > > > a case,> maybe it makes more sense to rename everything to Similarity? > > > I'm in favor of dropping score and similarity, and adopting distance in > > > the package, classes and javadocs, as it is used in other tools (e.g. > > Solr, > > > Talend, Informatica IIR, etc). > > > > > > > Okay, but we need to make sure all algorithms really return a distance > > then. As I said, FuzzyDistance currently really returns a similarity > score. > > An algorithm returning a distance should return a higher number for > higher > > distances. > > > > Benedikt > > > > > > > All the best,Bruno > > > > > > > > > From: Benedikt Ritter <brit...@apache.org> > > > To: Commons Developers List <dev@commons.apache.org> > > > Sent: Sunday, December 14, 2014 6:20 PM > > > Subject: Re: [TEXT] Distance vs. Metric vs. Similarity > > > > > > 2014-12-14 21:08 GMT+01:00 Benedikt Ritter <brit...@apache.org>: > > > > > > > > Hi, > > > > > > > > currently the wording in commons text is a bit confusing. We have the > > > > three terms: > > > > > > > > - distance > > > > - similarity > > > > - metric > > > > > > > > Distance and similarity seem to be just opposites of the same thing. > A > > > > great distance indicates a small similarity between two character > > > > sequences. Metric feels like it's something more general, but I'm not > > > sure. > > > > > > > > I think we should consider renaming everything to distance, since the > > > > implemented algorithms all end on *Distance. So we would change the > > > package > > > > name from o.a.c.text.similarity to o.a.c.text.distance and the > > interface > > > > from StringMetric to StringDistance. > > > > > > > > > > Looking at the code again, it seems like the algorithms all really > > return a > > > similarity score and not a distance. For exmaple FuzzyDistance JavaDoc > > > states: "A higher score indicates a higher similarity". If this is a > > case, > > > maybe it makes more sense to rename everything to Similarity? > > > > > > > > > > > > > > WDYT? > > > > > > > > Benedikt > > > > > > > > -- > > > > http://people.apache.org/~britter/ > > > > http://www.systemoutprintln.de/ > > > > http://twitter.com/BenediktRitter > > > > http://github.com/britter > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > http://people.apache.org/~britter/ > > > http://www.systemoutprintln.de/ > > > http://twitter.com/BenediktRitter > > > http://github.com/britter > > > > > > > > > > > > > > > > -- > > http://people.apache.org/~britter/ > > http://www.systemoutprintln.de/ > > http://twitter.com/BenediktRitter > > http://github.com/britter > > > > > > > > > > > > -- > > http://people.apache.org/~britter/ > > http://www.systemoutprintln.de/ > > http://twitter.com/BenediktRitter > > http://github.com/britter > > > > > > > > > -- > http://people.apache.org/~britter/ > http://www.systemoutprintln.de/ > http://twitter.com/BenediktRitter > http://github.com/britter > > > > -- http://people.apache.org/~britter/ http://www.systemoutprintln.de/ http://twitter.com/BenediktRitter http://github.com/britter