Christoph Burgmer <cburg...@ira.uka.de> added the comment:

> * U+0027 APOSTROPHE
hardcoded (see below)
> * U+00AD SOFT HYPHEN (SHY)
has the "Format (Cf)" property and thus is included automatically
> * U+2019 RIGHT SINGLE QUOTATION MARK
hardcoded (see below)

I hardcoded some characters into Tools/unicode/makeunicodedata.py:
>>> print ' '.join([u':', u'\xb7', u'\u0387', u'\u05f4', u'\u2027',
u'\ufe13', u'\ufe55', u'\uff1a'] + [u"'", u'.', u'\u2018', u'\u2019',
u'\u2024', u'\ufe52', u'\uff07', u'\uff0e'])
: · · ״ ‧ ︓ ﹕ : ' . ‘ ’ ․ ﹒ ' .

Those cannot currently be extracted automatically, as neither
DerivedCoreProperties.txt nor the source file for property
"Word_Break(C) = MidLetter or MidNumLet" are provided in the script.

As I said, the patch is only a second best solution, as the correct
path would be implementing the word breaking algorithm as described in
the newest standard. This patch is just an improvement over the current
situation.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue6412>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to