Hi all, I'd like to ask about a surprising possibility I found while investigating the new unicode 6.0 standard for use in python. As python 2 series won't be updated in this regard ( http://bugs.python.org/issue10400 ), I tried my "poor man's approach" of compiling the needed pyd file with the recent unicode data (cf. the older post http://mail.python.org/pipermail/python-list/2010-March/1240002.html ) While checking the changed format, i found to my big surprise, that it is possible to generate the header files using the py3 makeunicodedata.py which has already been updated for Unicode 6.0; this is even much more comfortable than the previous versions, as the needed data are downloaded automatically. http://svn.python.org/view/python/branches/py3k/Tools/unicode/makeunicodedata.py?view=markup&pathrev=85371 It turned out, that the resulting headers are accepted by MS Visual C++ Express along with the py2.7 source files and that the generated unicodedata.pyd seems to be working work at least in the cases I tested sofar.
Is this intended or even guaranteed for these generated files to be compatible across py2.7 and py3, or am I going to be bitten by some less obvious issues later? The newly added ranges and characters are available, only in the CJK Unified Ideographs Extension D the character names are not present (while categories are), but this appears to be the same in the original unicodedadata with 5.2 on CJK Unified Ideographs Extension C. >>> unicodedata.unidata_version '6.0.0' >>> unicodedata.name(u"\U0002B740") # 0x2B740-0x2B81F; CJK Unified Ideographs >>> Extension D # unicode 6.0 addition Traceback (most recent call last): File "<input>", line 1, in <module> ValueError: no such name >>> unicodedata.category(u"\U0002B740") 'Lo' >>> ########################### >>> unicodedata.unidata_version '5.2.0' >>> unicodedata.name(u"\U0002A700") # 0x2A700-0x2B73F; CJK Unified Ideographs >>> Extension C Traceback (most recent call last): File "<input>", line 1, in <module> ValueError: no such name >>> unicodedata.category(u"\U0002A700") 'Lo' >>> Could please anybody confirm, whether this way of updating the unicodedata for 2.7 is generaly viable or point out possible problem this may lead to? Many thanks in advance, Vlastimil Brom -- http://mail.python.org/mailman/listinfo/python-list