Christoph Burgmer <cburg...@ira.uka.de> added the comment: > * U+0027 APOSTROPHE hardcoded (see below) > * U+00AD SOFT HYPHEN (SHY) has the "Format (Cf)" property and thus is included automatically > * U+2019 RIGHT SINGLE QUOTATION MARK hardcoded (see below)
I hardcoded some characters into Tools/unicode/makeunicodedata.py: >>> print ' '.join([u':', u'\xb7', u'\u0387', u'\u05f4', u'\u2027', u'\ufe13', u'\ufe55', u'\uff1a'] + [u"'", u'.', u'\u2018', u'\u2019', u'\u2024', u'\ufe52', u'\uff07', u'\uff0e']) : · · ״ ‧ ︓ ﹕ : ' . ‘ ’ ․ ﹒ ' . Those cannot currently be extracted automatically, as neither DerivedCoreProperties.txt nor the source file for property "Word_Break(C) = MidLetter or MidNumLet" are provided in the script. As I said, the patch is only a second best solution, as the correct path would be implementing the word breaking algorithm as described in the newest standard. This patch is just an improvement over the current situation. ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue6412> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com