[issue4610] Unicode case mappings are incorrect

2009-10-13 Thread Jeff Senn

Jeff Senn  added the comment:

Has there been any action on this? a PEP?

I disagree that using ICU is good way to simply get proper
unicode casing. (A heavy hammer for a small task...)

I agree locales are a different issue (and would prefer
optional arguments to the unicode object casing methods -- 
that could then be used within any future sort of locale object 
to handle correct casing -- but don't rely on such.)

Most of the special casing rules can be accomplished by 
a decomposition (or recursive decomposition) on the character
followed by casing the result -- so NO new table is necessary
-- only marking up the characters so implicated (there are
extra unused bits in the char type table that could be used 
for this purpose -- so no additional space needed there either).  

What remains are a tiny handful of cases that need to be handled
in code.

I have a half finished implementation of this, in case anyone
is interested.

--
nosy: +senn

___
Python tracker 
<http://bugs.python.org/issue4610>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4610] Unicode case mappings are incorrect

2009-10-14 Thread Jeff Senn

Jeff Senn  added the comment:

> Feel free to upload it here. I'm fairly skeptical that it is
> possible to implement casing "correctly" in a locale-independent
> way.

Ok. I will try to find time to complete it enough to be readable.

Unicode (see sec 3.13) specifies the casing of unicode strings pretty 
completely -- i.e. it gives "Default Casing" rules to be used when no 
locale specific "tailoring" is available.  The only dependencies on 
locale for the special casing rules are for Turkish, Azeri, and 
Lithuanian.  And you only need to know that that is the language, no 
other details.  So I'm sure that a complete implementation is possible 
without resort to a lot of locale munging -- at least for .lower() 
.upper() and .title().

.swapcase() is just ...err... dumb^h^h^h^h questionably useful. 

However .capitalize() is a bit weird; and I'm not sure it isn't 
incorrectly implemented now:

It UPPERCASES the first character, rather than TITLECASING, which is 
probably wrong in the very few cases where it makes a difference:
e.g. (using Croatian ligatures)

>>> u'\u01c5amonjna'.title()
u'\u01c4amonjna'
>>> u'\u01c5amonjna'.capitalize()
u'\u01c5amonjna'

"Capitalization" is not precisely defined (by the Unicode standard) -- 
the currently python implementation doesn't even do what the docs say: 
"makes the first character have upper case" (it also lower-cases all 
other characters!), however I might argue that a more useful 
implementation "makes the first character have titlecase..."

--

___
Python tracker 
<http://bugs.python.org/issue4610>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4610] Unicode case mappings are incorrect

2009-10-14 Thread Jeff Senn

Jeff Senn  added the comment:

Yikes! I just noticed that u''.title() is really broken! 

It doesn't really pay attention to word breaks -- 
only characters that "have case".  
Therefore when there are (caseless)
combining characters in a word it's really broken e.g.

>>> u'n\u0303on\u0303e'.title()
u'N\u0303On\u0303E'

That is (where '~' is combining-tilde-over)
n~on~e -title-cases-to-> N~On~E

--

___
Python tracker 
<http://bugs.python.org/issue4610>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6412] Titlecase as defined in Unicode Case Mappings not followed

2009-10-14 Thread Jeff Senn

Jeff Senn  added the comment:

Referred to this from issue 4610... anyone following this might want to 
look there as well.

--
nosy: +senn

___
Python tracker 
<http://bugs.python.org/issue6412>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6412] Titlecase as defined in Unicode Case Mappings not followed

2009-10-14 Thread Jeff Senn

Jeff Senn  added the comment:

So, is it not considered a bug that:

>>> "This isn't right".title()
"This Isn'T Right"

!?!?!?

--

___
Python tracker 
<http://bugs.python.org/issue6412>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com