[issue4610] Unicode case mappings are incorrect
Jeff Senn added the comment: Has there been any action on this? a PEP? I disagree that using ICU is good way to simply get proper unicode casing. (A heavy hammer for a small task...) I agree locales are a different issue (and would prefer optional arguments to the unicode object casing methods -- that could then be used within any future sort of locale object to handle correct casing -- but don't rely on such.) Most of the special casing rules can be accomplished by a decomposition (or recursive decomposition) on the character followed by casing the result -- so NO new table is necessary -- only marking up the characters so implicated (there are extra unused bits in the char type table that could be used for this purpose -- so no additional space needed there either). What remains are a tiny handful of cases that need to be handled in code. I have a half finished implementation of this, in case anyone is interested. -- nosy: +senn ___ Python tracker <http://bugs.python.org/issue4610> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4610] Unicode case mappings are incorrect
Jeff Senn added the comment: > Feel free to upload it here. I'm fairly skeptical that it is > possible to implement casing "correctly" in a locale-independent > way. Ok. I will try to find time to complete it enough to be readable. Unicode (see sec 3.13) specifies the casing of unicode strings pretty completely -- i.e. it gives "Default Casing" rules to be used when no locale specific "tailoring" is available. The only dependencies on locale for the special casing rules are for Turkish, Azeri, and Lithuanian. And you only need to know that that is the language, no other details. So I'm sure that a complete implementation is possible without resort to a lot of locale munging -- at least for .lower() .upper() and .title(). .swapcase() is just ...err... dumb^h^h^h^h questionably useful. However .capitalize() is a bit weird; and I'm not sure it isn't incorrectly implemented now: It UPPERCASES the first character, rather than TITLECASING, which is probably wrong in the very few cases where it makes a difference: e.g. (using Croatian ligatures) >>> u'\u01c5amonjna'.title() u'\u01c4amonjna' >>> u'\u01c5amonjna'.capitalize() u'\u01c5amonjna' "Capitalization" is not precisely defined (by the Unicode standard) -- the currently python implementation doesn't even do what the docs say: "makes the first character have upper case" (it also lower-cases all other characters!), however I might argue that a more useful implementation "makes the first character have titlecase..." -- ___ Python tracker <http://bugs.python.org/issue4610> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4610] Unicode case mappings are incorrect
Jeff Senn added the comment: Yikes! I just noticed that u''.title() is really broken! It doesn't really pay attention to word breaks -- only characters that "have case". Therefore when there are (caseless) combining characters in a word it's really broken e.g. >>> u'n\u0303on\u0303e'.title() u'N\u0303On\u0303E' That is (where '~' is combining-tilde-over) n~on~e -title-cases-to-> N~On~E -- ___ Python tracker <http://bugs.python.org/issue4610> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue6412] Titlecase as defined in Unicode Case Mappings not followed
Jeff Senn added the comment: Referred to this from issue 4610... anyone following this might want to look there as well. -- nosy: +senn ___ Python tracker <http://bugs.python.org/issue6412> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue6412] Titlecase as defined in Unicode Case Mappings not followed
Jeff Senn added the comment: So, is it not considered a bug that: >>> "This isn't right".title() "This Isn'T Right" !?!?!? -- ___ Python tracker <http://bugs.python.org/issue6412> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com