[issue4610] Unicode case mappings are incorrect

Jeff Senn Wed, 14 Oct 2009 12:03:20 -0700

Jeff Senn <s...@users.sourceforge.net> added the comment:

> Feel free to upload it here. I'm fairly skeptical that it is
> possible to implement casing "correctly" in a locale-independent
> way.


Ok. I will try to find time to complete it enough to be readable.

Unicode (see sec 3.13) specifies the casing of unicode strings pretty 
completely -- i.e. it gives "Default Casing" rules to be used when no 
locale specific "tailoring" is available.  The only dependencies on 
locale for the special casing rules are for Turkish, Azeri, and 
Lithuanian.  And you only need to know that that is the language, no 
other details.  So I'm sure that a complete implementation is possible 
without resort to a lot of locale munging -- at least for .lower() 
.upper() and .title().

.swapcase() is just ...err... dumb^h^h^h^h questionably useful. 

However .capitalize() is a bit weird; and I'm not sure it isn't 
incorrectly implemented now:

It UPPERCASES the first character, rather than TITLECASING, which is 
probably wrong in the very few cases where it makes a difference:
e.g. (using Croatian ligatures)

>>> u'\u01c5amonjna'.title()
u'\u01c4amonjna'
>>> u'\u01c5amonjna'.capitalize()
u'\u01c5amonjna'

"Capitalization" is not precisely defined (by the Unicode standard) -- 
the currently python implementation doesn't even do what the docs say: 
"makes the first character have upper case" (it also lower-cases all 
other characters!), however I might argue that a more useful 
implementation "makes the first character have titlecase..."

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue4610>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue4610] Unicode case mappings are incorrect

Reply via email to