On March 7, 2016 3:57:16 PM GMT+01:00, David Malcolm <dmalc...@redhat.com> wrote: >On Sat, 2016-03-05 at 23:46 +0100, Bernhard Reutner-Fischer wrote: >[...] > >> diff --git a/gcc/fortran/misc.c b/gcc/fortran/misc.c >> index 405bae0..72ed311 100644 >> --- a/gcc/fortran/misc.c >> +++ b/gcc/fortran/misc.c >[...] > >> @@ -274,3 +275,41 @@ get_c_kind(const char *c_kind_name,teropKind_tki >> nds_table[]) >> >> return ISOCBINDING_INVALID; >> } >> + >> + >> +/* For a given name TYPO, determine the best candidate from >> CANDIDATES >> + perusing Levenshtein distance. Frees CANDIDATES before >> returning. */ >> + >> +const char * >> +gfc_closest_fuzzy_match (const char *typo, char **candidates) >> +{ >> + /* Determine closest match. */ >> + const char *best = NULL; >> + char **cand = candidates; >> + edit_distance_t best_distance = MAX_EDIT_DISTANCE; >> + >> + while (cand && *cand) >> + { >> + edit_distance_t dist = levenshtein_distance (typo, *cand); >> + if (dist < best_distance) >> + { >> + best_distance = dist; >> + best = *cand; >> + } >> + cand++; >> + } >> + /* If more than half of the letters were misspelled, the >> suggestion is >> + likely to be meaningless. */ >> + if (best) >> + { >> + unsigned int cutoff = MAX (strlen (typo), strlen (best)) / 2; >> + >> + if (best_distance > cutoff) >> + { >> + XDELETEVEC (candidates); >> + return NULL; >> + } >> + XDELETEVEC (candidates); >> + } >> + return best; >> +} > >FWIW, there are two overloaded variants of levenshtein_distance in >gcc/spellcheck.h, the first of which takes a pair of strlen values; >your patch uses the second one: > >extern edit_distance_t >levenshtein_distance (const char *s, int len_s, > const char *t, int len_t); > >extern edit_distance_t >levenshtein_distance (const char *s, const char *t); > >So one minor tweak you may want to consider here is to calculate > strlen (typo) >once at the top of gfc_closest_fuzzy_match, and then pass it in to the >4-arg variant of levenshtein_distance, which would avoid recalculating >strlen (typo) for every candidate.
I've pondered this back then but came to the conclusion to use the variant without len because to use the 4 argument variant I would have stored the candidates strlen in the vector too and was not convinced about the memory footprint for that would be justified. Maybe it is, but I would prefer the following tweak in the 4 argument variant: If you would amend the 4 argument variant with a if (len_t == -1) len_t = strlen (t); before the if (len_s == 0) return len_t; if (len_t == 0) return len_s; checks then I'd certainly use the 4 arg variant :) WDYT? > >I can't comment on the rest of the patch (I'm not a Fortran expert), >though it seems sane to > >Hope this is constructive It is, thanks for your thoughts! cheers,