Very strange how it only shows up after the 1st import attempt seems to succeed, and it doesn't ever show up if I run the code directly or run the code in the command-line interpreter.
The reason for that is that the Python byte code stores the Unicode literal in UTF-8. The first time, the byte code is generated, and an unpaired surrogate is written to disk. The next time, the compiled byte code is read back in, and the codec complains about the unpaired surrogate.
Can anyone tell me what's causing this, or point me to a reference to show when it was fixed?
In Misc/NEWS, we have, for 2.3a1:
- The UTF-8 codec will now encode and decode Unicode surrogates correctly and without raising exceptions for unpaired ones.
Essentially, Python now allows surrogates to occur in UTF-8 encodings.
> I'm using 2.2.1 and I couldn't find mention of it in any
release notes up through 2.3. Any other comments/suggestions (besides "stop supporting narrow unicode builds of Py 2.2") would be appreciated, too. Thanks :)
I see two options. One is to compile the code with exec, avoiding byte code generation. Put
exec """
before the code, and
"""
after it. The other option is to use variables instead of literals:
surr1 = unichr(0xd800) surr2 = unichr(0xdc00) surr3 = unichr(0xe000) def chars(s, surr1=surr1, surr2=surr2, surr3=surr3): ... if surr1 <= i < surr2: ...
I would personally go with "stop supporting Py 2.2". Unless you have the time machine, you can't fix the bugs in old Python releases, and it is a waste of time (IMO) to uglify the code just to work around limitations in older interpreter versions.
Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list