Re: [Text] JaccardSimilarity

Alex Herbert Thu, 07 Mar 2019 16:05:32 -0800

> On 8 Mar 2019, at 00:01, Bruno P. Kinoshita 
> <brunodepau...@yahoo.com.br.INVALID> wrote:
> 
>> I’d favour dropping the round and adding it to the Changes.xml via a Jira 
>> ticket so it is noted if someone upgrades. They can always restore 
>> functionality to as-it-was by doing a round on the output of the class. 
> +1
>> I’ve already made the test using the python distance.jaccard function from 
>> the distance library in the PR for Text-155. So changing the test is simple. 
>> It’s just the decision on whether to do it.
> I think we can aim at implementing this for 1.7 (which from the looks of it 
> will have several bug fixes & improvements!).
> CheersBruno

I'll put the changes into a Jira and PR.

Alex


> 
> 
>    On Friday, 8 March 2019, 10:54:32 am NZDT, Alex Herbert 
> <alex.d.herb...@gmail.com> wrote:  
> 
> Hi Bruno,
> 
>> On 7 Mar 2019, at 21:18, Bruno P. Kinoshita <ki...@apache.org> wrote:
>> 
>> Hi Alex,
>> Can't recall why it was done that way. When the initial code for the edit 
>> distances was created, some Java libraries like Simmetrics, 
>> java-string-similarity, Lucene, and also R/Python code were used to verify 
>> the output of the edit distances.
>> Maybe we used Math.round just to get a test passing, which I agree it had to 
>> be documented.
>> But even better if we just drop the Math.round and instead update the tests 
>> with that assertEquals(expected, actual, threshold) method, with a good 
>> enough threshold.
>> What do you think?
> 
> I’d favour dropping the round and adding it to the Changes.xml via a Jira 
> ticket so it is noted if someone upgrades. They can always restore 
> functionality to as-it-was by doing a round on the output of the class. 
> 
> If I understand the metric correctly (intersect over union) to have a 
> difference in the 3rd decimal place would require the union of the two 
> character sets to be above 200, i.e. a string containing over 200 unique 
> characters, e.g. 
> 
> A) 0/200 = 0
> B) 1/200 = 0.005
> C) 2/200 = 0.01
> 
> In this case result A and C can be distinguished but not B and C due to round 
> up.
> 
> So in practical terms it would not make a difference unless using a large 
> character set. For ASCII strings there is no difference.
> 
> I’ve already made the test using the python distance.jaccard function from 
> the distance library in the PR for Text-155. So changing the test is simple. 
> It’s just the decision on whether to do it.
> 
> Alex
> 
> 
>> CheersBruno
>> 
>>     On Friday, 8 March 2019, 4:49:52 am NZDT, Alex Herbert 
>> <alex.d.herb...@gmail.com> wrote:  
>> 
>> A quick question about the JaccardSimilarity class:
>> 
>> Q. Why does it round the similarity to 2 decimal places?
>> 
>> This is not documented.
>> 
>> It is also done in the complimentary JaccardDistance class.
>> 
>> Looking at the history in git it seems to have always been that way. 
>> First commit was 2016-11-27.
>> 
>> Thanks,
>> 
>> Alex
>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>> For additional commands, e-mail: dev-h...@commons.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org
Re: [Text] JaccardSimilarity

Reply via email to