Amaury Forgeot d'Arc <amaur...@gmail.com> added the comment:

Actually, this fails on 2.6 and 2.7 on wide unicode builds, and passes with 
narrow unicode builds (on my 64bit Linux box).

In pyexpat.c, PyUnknownEncodingHandler accesses 256 characters of a unicode 
buffer, without checking its length... which happens to be 192 chars long.
So buffers overflow, etc.  The function has a comment "supports only 8bit 
encodings"; indeed.
Versions 3.2 and 3.3 happen to pass the test, probably by pure luck.

Supporting multibytes codecs won't be easy: pyexpat requires to fill an array 
which specifies the number of bytes needed by each start byte (for example, in 
utf-8, 0xc3 starts a 2-bytes sequence, 0xef starts a 3-bytes sequence).  Our 
codecs framwork does not provide this information, and some codecs (gb18030 for 
example) need the second char to determine whether it will need 4 bytes.

----------
nosy: +amaury.forgeotdarc

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue13612>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to