2015-04-16 13:38 GMT+02:00 Bruno P. Kinoshita <brunodepau...@yahoo.com.br>:

> Hi Benedikt
>
> >Very nice! Maybe we can even come up with a generic class that calculates
> a>distance based on a similarity score.
> Hmmm, that's a good idea. We probably want to keep that idea in an issue
> for later :-) [1] I'll use my next development cycle on [text] to review
> the code and reports, and to write the user guide with what we have already
> in the project.
> Do you think we would need anything else before trying a 1.0 release?
> There are two TODO marks in the test, but I plan to get rid of them in the
> next days too. But they don't seem like a blocker right now anyway.
>

Release early, release often. Better come up with a small feature set in
1.0 and add stuff in the next releases than try to push everything into 1.0.
I'd like to do a little review cycle of the code myself. I hope to find the
time this weekend. After polishing up, we can go for 1.0

keep up the good work!
Benedikt


>
> ThanksBruno
>
> [1] https://issues.apache.org/jira/browse/SANDBOX-495
>
>
>       From: Benedikt Ritter <brit...@apache.org>
>  To: Commons Developers List <dev@commons.apache.org>
>  Sent: Wednesday, April 15, 2015 11:03 PM
>  Subject: Re: [TEXT] Distance vs. Metric vs. Similarity
>
> Hi Bruno
>
> 2015-04-15 12:14 GMT+02:00 Bruno P. Kinoshita <brunodepau...@yahoo.com.br
> >:
>
> > Hi Benedikt,
> >
> > After playing more with [text] and some edit distances, I think we can
> > retake this conversation and hopefully fix SANDBOX-488 [1].
> >
> > I've created a branch SANDBOX-488 in git [2] with the following
> > modifications:
> >
> > * The StringMetric interface has been renamed to EditDistance
> > * We have the following edit distances available: Levenshtein,
> > JaroWrinkler, Hamming ([lang]) and Cosine. Others might be added in the
> > future, such as Jaccard and QGram
> > * When an edit distance returns 0, it means both strings are identical or
> > at least very similar. The opposite is true, returning 1, or higher
> values,
> > means that the strings are less close to each other
> > * There are other classes that can be used for text similarity, such as
> > the FuzzyScore ([lang]), and the CosineSimilarity (used by the Cosine
> edit
> > distance). Others might be added later, such as the Jaccard Index. The
> > behaviour of each of these classes varies
> >
> > I think it is simpler, and users will quickly understand the API. Once
> one
> > understands what is an edit distance, s/he can guess the behaviour of any
> > of its implementations.
> >
> > What do you think? If you agree I'd like to merge the branch and fix the
> > issue.
> >
>
> Very nice! Maybe we can even come up with a generic class that calculates a
> distance based on a similarity score.
>
> Benedikt
>
>
> >
> > TL;DR: the similarity package contains code to work on text similarity,
> > such as edit distances, but also scores / indexes and other algorithms.
> The
> > StringMetric interface has been renamed to EditDistance, and only edit
> > distances implement it
> >
> > TIA
> > Bruno
> >
> > [1] https://issues.apache.org/jira/browse/SANDBOX-488
> > [2]
> >
> https://git1-us-west.apache.org/repos/asf?p=commons-text.git;a=tree;f=src/main/java/org/apache/commons/text/similarity;h=a2de9f0196b543f50c6d2c28376feb311f46eeda;hb=refs/heads/SANDBOX-488
> >
> >  ------------------------------
> >  *From:* Benedikt Ritter <brit...@apache.org>
> > *To:* Commons Developers List <dev@commons.apache.org>; Bruno P.
> > Kinoshita <brunodepau...@yahoo.com.br>
> > *Sent:* Friday, December 19, 2014 2:35 AM
> >
> > *Subject:* Re: [TEXT] Distance vs. Metric vs. Similarity
> >
> >
> >
> > 2014-12-14 23:10 GMT+01:00 Bruno P. Kinoshita <
> brunodepau...@yahoo.com.br>
> > :
> >
> > > Sounds good, although I'm not sure I understand where you are going
> > with> the marker interface. What is it's purpose?
> > Let's then keep the StringMetric interface and update its Javadoc.
> > Thinking again, that other marker interface seems to be unnecessary.  >
> > Okay, but we need to make sure all algorithms really return a
> > distance> then. As I said, FuzzyDistance currently really returns a
> > similarity score.> An algorithm returning a distance should return a
> higher
> > number for higher> distances. I had a look at the code, and I think I
> > understand what you are saying now. In FuzzyDistance, the higher the
> score,
> > the closer strings are. Different than what the other algorithms return.
> > I believe I found why I named that package similarity. Probably it was
> > because I saw that in the stringmetric library [1]. There, Levenshtein,
> > Jaccard and other algorithms are suffixed with "Metric".
> > How about we keep the package as similarity and simply rename the classes
> > to [Algo]Metric too? This way we will be able to accommodate other
> metrics
> > such as the Sorensen-Dice coefficient, where the higher the coefficient,
> > more similar two strings are.
> > WDYT?
> >
> >
> >
> > Hey Bruno,
> >
> > yes we can do it that way. What I want to avoid is, that the users have
> to
> > check the JavaDoc every time they use an algorithms. To me it would make
> > sense to have a number of distance algorithms and they all return a
> > distance. Or we have Similarity algorithms and they all return a
> > similarity. That way users can swap out the underlying algorithms without
> > changing their code.
> >
> > Benedikt
> >
> >
> > CheersBruno
> > [1] https://github.com/rockymadden/stringmetric
> >
> >
> >
> >      From: Benedikt Ritter <brit...@apache.org>
> >  To: Commons Developers List <dev@commons.apache.org>; Bruno P.
> Kinoshita
> > <brunodepau...@yahoo.com.br>
> >  Sent: Sunday, December 14, 2014 6:45 PM
> >  Subject: Re: [TEXT] Distance vs. Metric vs. Similarity
> >
> > Hi Bruna,
> >
> >
> >
> > 2014-12-14 21:37 GMT+01:00 Bruno P. Kinoshita <
> brunodepau...@yahoo.com.br
> > >:
> > >
> > > Hello Benedikt!
> > > > Metric feels like it's something more general, but I'm not sure.
> > > You're right. Metric was supposed to be a general interface,
> > > representing the String Metric from the Wikipedia article.
> > > >  and the interface from StringMetric to StringDistance.
> > > I'm reading the Myers paper, and already have a local branch with the
> > > Myers algorithm from [collections] ported to [text].
> > > Perhaps we could move the StringMetric interface to o.a.c.text package,
> > > and create StringDistance or EditDistance interface in
> > o.a.c.text.distance.
> > > This way we can have String Metrics as in Wikipedia, as being a way of
> > > giving a valuefor comparing two strings. We would have the edit
> distances
> > > in the distance package, and the diff algorithms in another diff
> package.
> > > All of them being String Metrics.
> > > What do you think?
> > >
> >
> > Sounds good, although I'm not sure I understand where you are going with
> > the marker interface. What is it's purpose?
> >
> >
> > > > > I think we should consider renaming everything to distance, since
> > > the> > implemented algorithms all end on *Distance. So we would change
> > the
> > > package> > name from o.a.c.text.similarity to o.a.c.text.distance and
> the
> > > interface> > from StringMetric to StringDistance.> >>
> > > > Looking at the code again, it seems like the algorithms all really
> > > return a> similarity score and not a distance. For exmaple
> FuzzyDistance
> > > JavaDoc> states: "A higher score indicates a higher similarity". If
> this
> > is
> > > a case,> maybe it makes more sense to rename everything to Similarity?
> > > I'm in favor of dropping score and similarity, and adopting distance in
> > > the package, classes and javadocs, as it is used in other tools (e.g.
> > Solr,
> > > Talend, Informatica IIR, etc).
> > >
> >
> > Okay, but we need to make sure all algorithms really return a distance
> > then. As I said, FuzzyDistance currently really returns a similarity
> score.
> > An algorithm returning a distance should return a higher number for
> higher
> > distances.
> >
> > Benedikt
> >
> >
> > > All the best,Bruno
> > >
> > >
> > >      From: Benedikt Ritter <brit...@apache.org>
> > >  To: Commons Developers List <dev@commons.apache.org>
> > >  Sent: Sunday, December 14, 2014 6:20 PM
> > >  Subject: Re: [TEXT] Distance vs. Metric vs. Similarity
> > >
> > > 2014-12-14 21:08 GMT+01:00 Benedikt Ritter <brit...@apache.org>:
> > > >
> > > > Hi,
> > > >
> > > > currently the wording in commons text is a bit confusing. We have the
> > > > three terms:
> > > >
> > > > - distance
> > > > - similarity
> > > > - metric
> > > >
> > > > Distance and similarity seem to be just opposites of the same thing.
> A
> > > > great distance indicates a small similarity between two character
> > > > sequences. Metric feels like it's something more general, but I'm not
> > > sure.
> > > >
> > > > I think we should consider renaming everything to distance, since the
> > > > implemented algorithms all end on *Distance. So we would change the
> > > package
> > > > name from o.a.c.text.similarity to o.a.c.text.distance and the
> > interface
> > > > from StringMetric to StringDistance.
> > > >
> > >
> > > Looking at the code again, it seems like the algorithms all really
> > return a
> > > similarity score and not a distance. For exmaple FuzzyDistance JavaDoc
> > > states: "A higher score indicates a higher similarity". If this is a
> > case,
> > > maybe it makes more sense to rename everything to Similarity?
> > >
> > >
> > > >
> > > > WDYT?
> > > >
> > > > Benedikt
> > > >
> > > > --
> > > > http://people.apache.org/~britter/
> > > > http://www.systemoutprintln.de/
> > > > http://twitter.com/BenediktRitter
> > > > http://github.com/britter
>
>
> >
> >
> >
> >
> >
> > >
> > >
> > > >
> > >
> > >
> > > --
> > > http://people.apache.org/~britter/
> > > http://www.systemoutprintln.de/
> > > http://twitter.com/BenediktRitter
> > > http://github.com/britter
> > >
> > >
> > >
> > >
> >
> > --
> > http://people.apache.org/~britter/
> > http://www.systemoutprintln.de/
> > http://twitter.com/BenediktRitter
> > http://github.com/britter
> >
> >
> >
> >
> >
> > --
> > http://people.apache.org/~britter/
> > http://www.systemoutprintln.de/
> > http://twitter.com/BenediktRitter
> > http://github.com/britter
> >
> >
> >
>
>
> --
> http://people.apache.org/~britter/
> http://www.systemoutprintln.de/
> http://twitter.com/BenediktRitter
> http://github.com/britter
>
>
>
>


-- 
http://people.apache.org/~britter/
http://www.systemoutprintln.de/
http://twitter.com/BenediktRitter
http://github.com/britter

Reply via email to