On 22.11.2012 17:01, Julian Foad wrote:
>> Author: brane
>> Modified: subversion/trunk/subversion/include/private/svn_string_private.h
>> +/**
>> + * Computes the similarity score STRA and STRB, given as the ratio
>> + * between the length of their longest common subsequence and the
>> + * length of the strings, normalized to the range [0..1000].
> "longest common non-contiguous subsequence"

Yes, we know it is, but that's not what the name of the algorithm is.

> "average length of the two strings"

This is true; however, saying "average" always sounds, to me, as if
we're dealing with a whole bunch of values, rather than just two.

> Just use 'float' and [0..1]?  I'm thinking we may want to use this for 
> comparing diff hunks, and there may be more than 500 characters in each hunk 
> and it matters whether just one character has been changed.

Could do that, too. Harder to test though. :) It's trivial to extend the
range to [0..10**6] or even 10**9 if that's a problem. Frankly it feels
nice to keep the whole thing integral and not leave floating-point
messes around.

> [...]
>
>> Modified: subversion/trunk/subversion/tests/libsvn_subr/string-test.c
>> +static svn_error_t *
>> +test_string_similarity(apr_pool_t *pool)
>> +{
>> +  const struct sim_score_test_t
>> +  {
>> +    const char *stra;
>> +    const char *strb;
>> +    apr_size_t lcs;
>> +    int score;
>> +  } tests[] =
>> +      {
>> +#define SCORE(lcs, len) ((2000 * (lcs) + (len)/2) / (len))
>> +
>> +        /* Equality */
>> +        {"",       "",          0, 1000},
>> +        {"quoth",  "quoth",     5, SCORE(5, 5+5)},
>> +
>> +        /* Deletion at start */
>> +        {"quoth",  "uoth",      4, SCORE(4, 5+4)},
>> +        {"uoth",   "quoth",     4, SCORE(4, 4+5)},
>> +
>> +        /* Deletion at end */
>> +        {"quoth",  "quot",      4, SCORE(4, 5+4)},
>> +        {"quot",   "quoth",     4, SCORE(4, 4+5)},
>> +
>> +        /* Insertion at start */
>> +        {"quoth",  "Xquoth",    5, SCORE(5, 5+6)},
>> +        {"Xquoth", "quoth",     5, SCORE(5, 6+5)},
>> +
>> +        /* Insertion at end */
>> +        {"quoth",  "quothX",    5, SCORE(5, 5+6)},
>> +        {"quothX", "quoth",     5, SCORE(5, 6+5)},
> Half of the examples in each of these four pairs are redundant.  
> (The word "deletion" implies an ordering of string-A -> string-B, so
>  by that terminology "uoth" -> "quoth" is an Insertion, same as 
> "quoth" -> "Xquoth".)

Theoretically and academically speaking, yes, they're redundant.
Practically, however, this helped me find a bug that only popped up when
exactly one of these examples was removed. I love symmetric algorithms
but my bugs are always asymmetric and some of them break causality, too. :)

So in order to avoid falling into the incompleteness trap, I just
mirrored all the test cases. It's less of a problem to have duplicate
tests than it is to miss one.

-- Brane

-- 
Branko Čibej
Director of Subversion | WANdisco | www.wandisco.com

Reply via email to