New submission from Tom Christiansen <tchr...@perl.com>: Python's string.title() function claims it titlecases the first letter in each word and lowercases the rest. However, this is not true. It is not using either of the two word detection algorithms that Unicode provides. One allows you to use a legacy \w+, where \w means any Alphabetic, Mark, Decimal Number, or Connector Punctuation (see UTS#18 Annex C: Compatibility Properties), and the other uses the more sophisticated word-break provided by the Word_Break properties such as Word_Break=MidNumLet
Python is using neither of these, so gets the wrong answer. titlecase of déme un café should be Déme Un Café not DéMe Un Café titlecase of i̇stanbul should be İstanbul not İStanbul titlecase of ᾲ στο διάολο should be Ὰͅ Στο Διάολο not ᾺΙ Στο ΔιάΟλο Because those are in NFD form, you get different answers than if they are in NFC. That is not right. You should get the same answer. The bug is you aren't using the right definition for \w, and so get screwed up. This is likely related to issue 12731. In the enclosed tester file, which fails 4 out of its 6 tests, there is also a bug shown with this failed result: titlecase of 𐐼𐐯𐑅𐐨𐑉𐐯𐐻 should be 𐐔𐐯𐑅𐐨𐑉𐐯𐐻 not 𐐼𐐯𐑅𐐨𐑉𐐯𐐻 That one is related to issue 12730. See the attached tester, which was run under Python 3.2. As far as I can tell, these bugs exist in all python versions. ---------- files: titletest.python messages: 141929 nosy: tchrist priority: normal severity: normal status: open title: string.title() is overzealous by upcasing combining marks inappropriately type: behavior versions: Python 3.2 Added file: http://bugs.python.org/file22884/titletest.python _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue12737> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com