Re: unicode surrogates in py2.2/win

"Martin v. LÃwis" Tue, 08 Mar 2005 00:20:05 -0800

Mike Brown wrote:

Very strange how it only shows up after the 1st import attempt seems to succeed, and it doesn't ever show up if I run the code directly or run the code in the command-line interpreter.


The reason for that is that the Python byte code stores the Unicode
literal in UTF-8. The first time, the byte code is generated, and an
unpaired surrogate is written to disk. The next time, the compiled byte
code is read back in, and the codec complains about the unpaired
surrogate.

Can anyone tell me what's causing this, or point me to a reference to show when it was fixed?


In Misc/NEWS, we have, for 2.3a1:

- The UTF-8 codec will now encode and decode Unicode surrogates
  correctly and without raising exceptions for unpaired ones.

Essentially, Python now allows surrogates to occur in UTF-8 encodings.

> I'm using 2.2.1 and I couldn't find mention of it in any

release notes up through 2.3. Any other comments/suggestions (besides "stop supporting narrow unicode builds of Py 2.2") would be appreciated, too. Thanks :)


I see two options. One is to compile the code with exec, avoiding byte
code generation. Put

exec """

before the code, and

"""

after it. The other option is to use variables instead of literals:

surr1 = unichr(0xd800)
surr2 = unichr(0xdc00)
surr3 = unichr(0xe000)
def chars(s, surr1=surr1, surr2=surr2, surr3=surr3):
...
    if surr1 <= i < surr2:
        ...

I would personally go with "stop supporting Py 2.2". Unless you have the
time machine, you can't fix the bugs in old Python releases, and it is
a waste of time (IMO) to uglify the code just to work around limitations
in older interpreter versions.

Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list

Re: unicode surrogates in py2.2/win

Reply via email to