Richie Hindle wrote: > [Serge] > > def search_key(s): > > de_str = unicodedata.normalize("NFD", s) > > return ''.join(cp for cp in de_str if not > > unicodedata.category(cp).startswith('M')) > > Lovely bit of code - thanks for posting it!
Well, it is not so good. Please read my next message to Luis. > > You might want to use "NFKD" to normalize things like LATIN SMALL > LIGATURE FI and subscript/superscript characters as well as diacritics. IMHO It is perfectly acceptable to declare you don't interpret those symbols. After all they are called *compatibility* code points. I tried "a quater" symbol: Google and MSN don't interpret it. Yahoo doesn't support it at all. NFKD form is also more tricky to use. It loses semantic of characters, for example if you have character "digit two" followed by "superscript digit two"; they look like 2 power 2, but NFKD will convert them into 22 (twenty two), which is wrong. So if you want to use NFKD for search your will have to preprocess your data, for example inserting space between the twos. -- http://mail.python.org/mailman/listinfo/python-list