Olive wrote: > One feature that seems to be missing in the re module (or any tools that I > know for searching text) is "diacretical incensitive search". I would like > to have a match for something like this: > > re.match("franc", "français") > > in about the same whay we can have a case incensitive search: > > re.match("(?i)fran", "Français"). > > Another related and more general problem (in the sense that it could > easily be used to solve the first problem) would be to translate a string > removing any diacritical mark: > > nodiac("Français") -> "Francais" > > The algorithm to write such a function is trivial but there are a lot of > mark we can put on a letter. It would be necessary to have the list of > "a"'s with something on it. i.e. "à,á,ã", etc. and this for every letter. > Trying to make such a list by hand would inevitably lead to some symbols > forgotten (and would be tedious).
[Python3.3] >>> unicodedata.normalize("NFKD", "Français").encode("ascii", "ignore").decode() 'Francais' import sys from collections import defaultdict from unicodedata import name, normalize d = defaultdict(list) for i in range(sys.maxunicode): c = chr(i) n = normalize("NFKD", c)[0] if ord(n) < 128 and n.isalpha(): # optional d[n].append(c) for k, v in d.items(): if len(v) > 1: print(k, "".join(v)) See also <http://effbot.org/zone/unicode-convert.htm> PS: Be warned that experiments on the console may be misleading: >>> unicodedata.normalize("NFKD", "ç") 'c' >>> ascii(_) "'c\\u0327'" -- http://mail.python.org/mailman/listinfo/python-list