Hi Bruna,


2014-12-14 21:37 GMT+01:00 Bruno P. Kinoshita <brunodepau...@yahoo.com.br>:
>
> Hello Benedikt!
> > Metric feels like it's something more general, but I'm not sure.
> You're right. Metric was supposed to be a general interface,
> representing the String Metric from the Wikipedia article.
> >  and the interface from StringMetric to StringDistance.
> I'm reading the Myers paper, and already have a local branch with the
> Myers algorithm from [collections] ported to [text].
> Perhaps we could move the StringMetric interface to o.a.c.text package,
> and create StringDistance or EditDistance interface in o.a.c.text.distance.
> This way we can have String Metrics as in Wikipedia, as being a way of
> giving a valuefor comparing two strings. We would have the edit distances
> in the distance package, and the diff algorithms in another diff package.
> All of them being String Metrics.
> What do you think?
>

Sounds good, although I'm not sure I understand where you are going with
the marker interface. What is it's purpose?


> > > I think we should consider renaming everything to distance, since
> the> > implemented algorithms all end on *Distance. So we would change the
> package> > name from o.a.c.text.similarity to o.a.c.text.distance and the
> interface> > from StringMetric to StringDistance.> >>
> > Looking at the code again, it seems like the algorithms all really
> return a> similarity score and not a distance. For exmaple FuzzyDistance
> JavaDoc> states: "A higher score indicates a higher similarity". If this is
> a case,> maybe it makes more sense to rename everything to Similarity?
> I'm in favor of dropping score and similarity, and adopting distance in
> the package, classes and javadocs, as it is used in other tools (e.g. Solr,
> Talend, Informatica IIR, etc).
>

Okay, but we need to make sure all algorithms really return a distance
then. As I said, FuzzyDistance currently really returns a similarity score.
An algorithm returning a distance should return a higher number for higher
distances.

Benedikt


> All the best,Bruno
>
>
>       From: Benedikt Ritter <brit...@apache.org>
>  To: Commons Developers List <dev@commons.apache.org>
>  Sent: Sunday, December 14, 2014 6:20 PM
>  Subject: Re: [TEXT] Distance vs. Metric vs. Similarity
>
> 2014-12-14 21:08 GMT+01:00 Benedikt Ritter <brit...@apache.org>:
> >
> > Hi,
> >
> > currently the wording in commons text is a bit confusing. We have the
> > three terms:
> >
> > - distance
> > - similarity
> > - metric
> >
> > Distance and similarity seem to be just opposites of the same thing. A
> > great distance indicates a small similarity between two character
> > sequences. Metric feels like it's something more general, but I'm not
> sure.
> >
> > I think we should consider renaming everything to distance, since the
> > implemented algorithms all end on *Distance. So we would change the
> package
> > name from o.a.c.text.similarity to o.a.c.text.distance and the interface
> > from StringMetric to StringDistance.
> >
>
> Looking at the code again, it seems like the algorithms all really return a
> similarity score and not a distance. For exmaple FuzzyDistance JavaDoc
> states: "A higher score indicates a higher similarity". If this is a case,
> maybe it makes more sense to rename everything to Similarity?
>
>
> >
> > WDYT?
> >
> > Benedikt
> >
> > --
> > http://people.apache.org/~britter/
> > http://www.systemoutprintln.de/
> > http://twitter.com/BenediktRitter
> > http://github.com/britter
>
>
> >
>
>
> --
> http://people.apache.org/~britter/
> http://www.systemoutprintln.de/
> http://twitter.com/BenediktRitter
> http://github.com/britter
>
>
>
>

-- 
http://people.apache.org/~britter/
http://www.systemoutprintln.de/
http://twitter.com/BenediktRitter
http://github.com/britter

Reply via email to