Amaury Forgeot d'Arc <amaur...@gmail.com> added the comment: Actually, this fails on 2.6 and 2.7 on wide unicode builds, and passes with narrow unicode builds (on my 64bit Linux box).
In pyexpat.c, PyUnknownEncodingHandler accesses 256 characters of a unicode buffer, without checking its length... which happens to be 192 chars long. So buffers overflow, etc. The function has a comment "supports only 8bit encodings"; indeed. Versions 3.2 and 3.3 happen to pass the test, probably by pure luck. Supporting multibytes codecs won't be easy: pyexpat requires to fill an array which specifies the number of bytes needed by each start byte (for example, in utf-8, 0xc3 starts a 2-bytes sequence, 0xef starts a 3-bytes sequence). Our codecs framwork does not provide this information, and some codecs (gb18030 for example) need the second char to determine whether it will need 4 bytes. ---------- nosy: +amaury.forgeotdarc _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue13612> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com