On 2013-10-30 19:28, Roy Smith wrote: > For example, it's reasonable to consider any vowel (or string of > vowels, for that matter) to be closer to another vowel than to a > consonant. A great example is the word, "bureaucrat". As far as > I'm concerned, it's spelled {b, vowels, r, vowels, c, r, a, t}. It > usually takes me three or four tries to get auto-correct to even > recognize what I'm trying to type and fix it for me.
[glad I'm not the only one who has trouble spelling "bureaucrat"] Steven D'Aprano wisely mentioned elsewhere in the thread that "The right solution to that is to treat it no differently from other fuzzy searches. A good search engine should be tolerant of spelling errors and alternative spellings for any letter, not just those with diacritics." Often the Levenshtein distance is used for calculating closeness, and the off-the-shelf algorithm assigns a cost of one per difference (addition, change, or removal). It doesn't sound like it would be that hard[1] to assign varying costs based on what character was added/changed/removed. A diacritic might have a cost of N while a similar character (vowel->vowel or consonant->consonant, or consonant-cluster shift) might have a cost of 2N, and a totally arbitrary character shift might have a cost of 3N (or higher). Unfortunately, the Levenshtein algorithm is already O(M*N) slow and can't be reasonably precalculated without knowing both strings, so this just ends up heaping additional lookups/comparisons atop already-slow code. -tkc [1] http://en.wikipedia.org/wiki/Levenshtein_distance#Possible_modifications . -- https://mail.python.org/mailman/listinfo/python-list