2011/6/22 Saul Spatz <saul.sp...@gmail.com>: > Thanks. I agree with you about the generator. Using your first suggestion, > code points above U+FFFF get separated into two "surrogate pair" characters > fron UTF-16. So instead of U=10FFFF I get U+DBFF and U+DFFF. > -- > http://mail.python.org/mailman/listinfo/python-list > Hi, If you realy need the wide unicode functionality on a narrow unicode python build and only need to get the string index of characters including surrogate pairs counting as one item, you can build a list of single characters or surrogate pairs, e.g.:
>>> surrog_txt=u"a𐌰 𐌱 𐌲 𐌳" >>> surrog_txt u'a\U00010330 \U00010331 \U00010332 \U00010333' >>> print surrog_txt a𐌰 𐌱 𐌲 𐌳 >>> list(surrog_txt) [u'a', u'\ud800', u'\udf30', u' ', u'\ud800', u'\udf31', u' ', u'\ud800', u'\udf32', u' ', u'\ud800', u'\udf33'] >>> import re >>> re.findall(ur"(?s)(?:[\ud800-\udbff][\udc00-\udfff])|.", surrog_txt) [u'a', u'\U00010330', u' ', u'\U00010331', u' ', u'\U00010332', u' ', u'\U00010333'] >>> this way, the indices, slices and len() would work on the supplementary list as expected for a normal string; however it probably won't be very efficient for longer texts. Note that surrogates are not the only asymmetry between code points, characters (and glyphs - to take the visual representation of those into account) - there are combining diacritical marks, in various combinations with precomposed diacritical characters, multiple normalisation modes etc. regards, vbr -- http://mail.python.org/mailman/listinfo/python-list