Re: ascii to latin1

Luis P. Mendes Tue, 09 May 2006 05:02:18 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Richie Hindle escreveu:
> [Serge]
>> def search_key(s):
>>     de_str = unicodedata.normalize("NFD", s)
>>     return ''.join(cp for cp in de_str if not
>>                    unicodedata.category(cp).startswith('M'))
> 
> Lovely bit of code - thanks for posting it!
> 
> You might want to use "NFKD" to normalize things like LATIN SMALL
> LIGATURE FI and subscript/superscript characters as well as diacritics.
>


Thank you very much for your info.  It's a very good aproach.

When I used the "NFD" option, I came across many errors on these and
possibly other codes: \xba, \xc9, \xcd.

I tried to use "NFKD" instead, and the number of errors was only about
half a dozen, for a universe of 600000+ names, on code \xbf.

It looks like I have to do a search and substitute using regular
expressions for these cases.  Or is there a better way to do it?


Luis P. Mendes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEYINaHn4UHCY8rB8RAqLKAJ0cN7yRlzJSpmH7jlrWoyhUH1990wCgkxCW
9d7f/FyHXoSfRUrbES0XKvU=
=eAuO
-----END PGP SIGNATURE-----
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: ascii to latin1

Reply via email to