-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Richie Hindle escreveu: > [Serge] >> def search_key(s): >> de_str = unicodedata.normalize("NFD", s) >> return ''.join(cp for cp in de_str if not >> unicodedata.category(cp).startswith('M')) > > Lovely bit of code - thanks for posting it! > > You might want to use "NFKD" to normalize things like LATIN SMALL > LIGATURE FI and subscript/superscript characters as well as diacritics. >
Thank you very much for your info. It's a very good aproach. When I used the "NFD" option, I came across many errors on these and possibly other codes: \xba, \xc9, \xcd. I tried to use "NFKD" instead, and the number of errors was only about half a dozen, for a universe of 600000+ names, on code \xbf. It looks like I have to do a search and substitute using regular expressions for these cases. Or is there a better way to do it? Luis P. Mendes -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFEYINaHn4UHCY8rB8RAqLKAJ0cN7yRlzJSpmH7jlrWoyhUH1990wCgkxCW 9d7f/FyHXoSfRUrbES0XKvU= =eAuO -----END PGP SIGNATURE----- -- http://mail.python.org/mailman/listinfo/python-list