[issue12737] string.title() is overzealous by upcasing combining marks inappropriately

Tom Christiansen Thu, 11 Aug 2011 15:40:09 -0700

New submission from Tom Christiansen <tchr...@perl.com>:

Python's string.title() function claims it titlecases the first letter in each 
word and lowercases the rest.  However, this is not true.  It is not using 
either of the two word detection algorithms that Unicode provides.  One allows 
you to use a legacy \w+, where \w means any Alphabetic, Mark, Decimal Number, 
or Connector Punctuation (see UTS#18 Annex C: Compatibility Properties), and 
the other uses the more sophisticated word-break provided by the Word_Break 
properties such as Word_Break=MidNumLet

Python is using neither of these, so gets the wrong answer.

titlecase of déme un café should be Déme Un Café not DéMe Un Café
titlecase of i̇stanbul should be İstanbul not İStanbul
titlecase of ᾲ στο διάολο should be Ὰͅ Στο Διάολο not ᾺΙ Στο ΔιάΟλο

Because those are in NFD form, you get different answers than if they are in
NFC. That is not right. You should get the same answer. The bug is you aren't
using the right definition for \w, and so get screwed up. This is likely
related to issue 12731.

In the enclosed tester file, which fails 4 out of its 6 tests, there is also a
bug shown with this failed result:

titlecase of 𐐼𐐯𐑅𐐨𐑉𐐯𐐻 should be 𐐔𐐯𐑅𐐨𐑉𐐯𐐻 not 𐐼𐐯𐑅𐐨𐑉𐐯𐐻

That one is related to issue 12730.

See the attached tester, which was run under Python 3.2. As far as I can tell,
these bugs exist in all python versions.

----------
files: titletest.python
messages: 141929
nosy: tchrist
priority: normal
severity: normal
status: open
title: string.title() is overzealous by upcasing combining marks
inappropriately
type: behavior
versions: Python 3.2
Added file: http://bugs.python.org/file22884/titletest.python

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue12737>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12737] string.title() is overzealous by upcasing combining marks inappropriately

Reply via email to