> On 8 Mar 2019, at 00:01, Bruno P. Kinoshita
> <brunodepau...@yahoo.com.br.INVALID> wrote:
>
>> I’d favour dropping the round and adding it to the Changes.xml via a Jira
>> ticket so it is noted if someone upgrades. They can always restore
>> functionality to as-it-was by doing a round on the output of the class.
> +1
>> I’ve already made the test using the python distance.jaccard function from
>> the distance library in the PR for Text-155. So changing the test is simple.
>> It’s just the decision on whether to do it.
> I think we can aim at implementing this for 1.7 (which from the looks of it
> will have several bug fixes & improvements!).
> CheersBruno
I'll put the changes into a Jira and PR.
Alex
>
>
> On Friday, 8 March 2019, 10:54:32 am NZDT, Alex Herbert
> <alex.d.herb...@gmail.com> wrote:
>
> Hi Bruno,
>
>> On 7 Mar 2019, at 21:18, Bruno P. Kinoshita <ki...@apache.org> wrote:
>>
>> Hi Alex,
>> Can't recall why it was done that way. When the initial code for the edit
>> distances was created, some Java libraries like Simmetrics,
>> java-string-similarity, Lucene, and also R/Python code were used to verify
>> the output of the edit distances.
>> Maybe we used Math.round just to get a test passing, which I agree it had to
>> be documented.
>> But even better if we just drop the Math.round and instead update the tests
>> with that assertEquals(expected, actual, threshold) method, with a good
>> enough threshold.
>> What do you think?
>
> I’d favour dropping the round and adding it to the Changes.xml via a Jira
> ticket so it is noted if someone upgrades. They can always restore
> functionality to as-it-was by doing a round on the output of the class.
>
> If I understand the metric correctly (intersect over union) to have a
> difference in the 3rd decimal place would require the union of the two
> character sets to be above 200, i.e. a string containing over 200 unique
> characters, e.g.
>
> A) 0/200 = 0
> B) 1/200 = 0.005
> C) 2/200 = 0.01
>
> In this case result A and C can be distinguished but not B and C due to round
> up.
>
> So in practical terms it would not make a difference unless using a large
> character set. For ASCII strings there is no difference.
>
> I’ve already made the test using the python distance.jaccard function from
> the distance library in the PR for Text-155. So changing the test is simple.
> It’s just the decision on whether to do it.
>
> Alex
>
>
>> CheersBruno
>>
>> On Friday, 8 March 2019, 4:49:52 am NZDT, Alex Herbert
>> <alex.d.herb...@gmail.com> wrote:
>>
>> A quick question about the JaccardSimilarity class:
>>
>> Q. Why does it round the similarity to 2 decimal places?
>>
>> This is not documented.
>>
>> It is also done in the complimentary JaccardDistance class.
>>
>> Looking at the history in git it seems to have always been that way.
>> First commit was 2016-11-27.
>>
>> Thanks,
>>
>> Alex
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>> For additional commands, e-mail: dev-h...@commons.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org