[issue6412] Titlecase as defined in Unicode Case Mappings not followed

Terry J. Reedy Wed, 04 Aug 2010 10:49:37 -0700

Terry J. Reedy <[email protected]> added the comment:

Christoph is responding above to a previous version of this message with an 
erroneous conclusion based on a misreading of his original message.


The proposed patch makes this issue overlap #7008, which had some contentious 
discussion, so I am adding some people from that to this nosy list so they may 
opine here. Otherwise starting over:

3.1 has the same bug.

3.1.2
>>> 'H\u0301ngh'.istitle()
False
>>> 'H\u0301ngh'=='H\u0301ngh'.title()
False
>>> 'H\u0301ngh'.title()
'H́Ngh' # in IDLE, the accent is over the H

The problem is that .title() treats the accent that looks like an apostrophe 
'\u0301' as if it were an apostrophe "'". The latter are documented as forming 
word boundaries, as in

>>> "De'souza".title()
"De'Souza"
>>> "O'brian".title()
"O'Brian"

Here is the beginning of the 3.1.2 title() doc:
"str.title() 
Return a titlecased version of the string where words start with an uppercase 
character and the remaining characters are lowercase.

The algorithm uses a simple language-independent definition of a word as groups 
of consecutive letters. The definition works in many contexts but it means that 
apostrophes in contractions and possessives form word boundaries, which may not 
be the desired result:"

That means that

>>> "This Isn'T Right".istitle()
True

is correct as documented.

I interpret the conclusion of #7008, based on Guido's msg93242, as saying that 
that should be left alone. but I interpret previous messages and the test in 
unicodeobject.titlecase.3.diff as saying this would become be False. Such a 
change would badly affect the prior examples where the post ' capital *is* 
wanted. The is why that change was rejected in #7008. So I think ' should be 
removed from the current patch. I do not know about the other chars that are 
hard-coded.

With or without that, there is the issue of whether the current behavior really 
contradicts the somewhat vague doc and whether change would break enough code 
that this issue should be treated as a feature change for 3.2 only.

Reading this from msg93265 
"As I said, the patch is only a second best solution, as the correct
path would be implementing the word breaking algorithm as described in
the newest standard. This patch is just an improvement over the current
situation."
makes me wonder whether .title & and .istitle should be left alone until the 
right solution is implemented.

----------
nosy: +pitrou, r.david.murray, rhettinger
versions: +Python 3.1, Python 3.2

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue6412>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue6412] Titlecase as defined in Unicode Case Mappings not followed

Reply via email to