In mid-October 2004, Jeff Epler helped me here with this string iterator: def chars(s): """ This generator function helps iterate over the characters in a string. When the string is unicode and a surrogate pair is encountered, the pair is returned together, regardless of whether Python was built with UCS-4 ('wide') or UCS-2 code values for its internal representation of unicode. This function will raise a ValueError if it detects an illegal surrogate pair. """ if isinstance(s, str): for i in s: yield i return s = iter(s) for i in s: if u'\ud800' <= i < u'\udc00': try: j = s.next() except StopIteration: raise ValueError("Bad pair: string ends after %r" % i) if u'\udc00' <= j < u'\ue000': yield i + j else: raise ValueError("Bad pair: %r (bad second half)" % (i+j)) elif u'\udc00' <= i < u'\ue000': raise ValueError("Bad pair: %r (no first half)" % i) else: yield i
I have since discovered that I can't use it on Python 2.2 on Windows because of some weird module import bug caused by the surrogate code values expressed in the Python code as u'\ud800' and u'\udc00' -- apparently the string literals are being coerced to UTF-8 internally, which results in an invalid byte sequence upon import of the module containing this function. A simpler test case demonstrates the symptom: C:\dev\test>echo x = u'\ud800' > testd800.py C:\dev\test>cat testd800.py x = u'\ud800' C:\dev\test>python -c "import testd800" C:\dev\test>python -c "import testd800" Traceback (most recent call last): File "<string>", line 1, in ? UnicodeError: UTF-8 decoding error: unexpected code byte C:\dev\test>python testd800.py C:\dev\test>python testd800.py Very strange how it only shows up after the 1st import attempt seems to succeed, and it doesn't ever show up if I run the code directly or run the code in the command-line interpreter. The error does not occur with u'\ud800\udc00' or u'\ue000' or any other valid sequence. In my function I can use "if u'\ud7ff' > i ..." to work around the d800 case, but I can't use the same trick for the dc00 case. I will have to go back to calling ord(i) and comparing against integers. IIRC the explicit ord() call slowed things down a bit, though, so I'd like to avoid it if I can. Can anyone tell me what's causing this, or point me to a reference to show when it was fixed? I'm using 2.2.1 and I couldn't find mention of it in any release notes up through 2.3. Any other comments/suggestions (besides "stop supporting narrow unicode builds of Py 2.2") would be appreciated, too. Thanks :) -Mike -- http://mail.python.org/mailman/listinfo/python-list