On 22.11.2012 17:01, Julian Foad wrote: >> Author: brane >> Modified: subversion/trunk/subversion/include/private/svn_string_private.h >> +/** >> + * Computes the similarity score STRA and STRB, given as the ratio >> + * between the length of their longest common subsequence and the >> + * length of the strings, normalized to the range [0..1000]. > "longest common non-contiguous subsequence"
Yes, we know it is, but that's not what the name of the algorithm is. > "average length of the two strings" This is true; however, saying "average" always sounds, to me, as if we're dealing with a whole bunch of values, rather than just two. > Just use 'float' and [0..1]? I'm thinking we may want to use this for > comparing diff hunks, and there may be more than 500 characters in each hunk > and it matters whether just one character has been changed. Could do that, too. Harder to test though. :) It's trivial to extend the range to [0..10**6] or even 10**9 if that's a problem. Frankly it feels nice to keep the whole thing integral and not leave floating-point messes around. > [...] > >> Modified: subversion/trunk/subversion/tests/libsvn_subr/string-test.c >> +static svn_error_t * >> +test_string_similarity(apr_pool_t *pool) >> +{ >> + const struct sim_score_test_t >> + { >> + const char *stra; >> + const char *strb; >> + apr_size_t lcs; >> + int score; >> + } tests[] = >> + { >> +#define SCORE(lcs, len) ((2000 * (lcs) + (len)/2) / (len)) >> + >> + /* Equality */ >> + {"", "", 0, 1000}, >> + {"quoth", "quoth", 5, SCORE(5, 5+5)}, >> + >> + /* Deletion at start */ >> + {"quoth", "uoth", 4, SCORE(4, 5+4)}, >> + {"uoth", "quoth", 4, SCORE(4, 4+5)}, >> + >> + /* Deletion at end */ >> + {"quoth", "quot", 4, SCORE(4, 5+4)}, >> + {"quot", "quoth", 4, SCORE(4, 4+5)}, >> + >> + /* Insertion at start */ >> + {"quoth", "Xquoth", 5, SCORE(5, 5+6)}, >> + {"Xquoth", "quoth", 5, SCORE(5, 6+5)}, >> + >> + /* Insertion at end */ >> + {"quoth", "quothX", 5, SCORE(5, 5+6)}, >> + {"quothX", "quoth", 5, SCORE(5, 6+5)}, > Half of the examples in each of these four pairs are redundant. > (The word "deletion" implies an ordering of string-A -> string-B, so > by that terminology "uoth" -> "quoth" is an Insertion, same as > "quoth" -> "Xquoth".) Theoretically and academically speaking, yes, they're redundant. Practically, however, this helped me find a bug that only popped up when exactly one of these examples was removed. I love symmetric algorithms but my bugs are always asymmetric and some of them break causality, too. :) So in order to avoid falling into the incompleteness trap, I just mirrored all the test cases. It's less of a problem to have duplicate tests than it is to miss one. -- Brane -- Branko Čibej Director of Subversion | WANdisco | www.wandisco.com