STINNER Victor added the comment:

Oh, set_encoding.patch is wrong:

+            offset = self._decoded_chars_used - len(next_input)

self._decoded_chars_used is a number of Unicode characters, len(next_input) is 
a number of bytes. It only works with 7 and 8 bit encodings like ascii or 
latin1, but not with multibyte encodings like utf8 or ucs-4.

> peeking into the underlying buffer would be enough to
> handle encoding detection.

I wrote a new patch using this idea. It does not work (yet?) with non seekable 
streams. The raw read buffer (bytes string) is not stored in the _snapshot 
attribute if the stream is not seeakble. It may be changed to solve this issue.

set_encoding-2.patch is still a work-in-progress. It does not patch the _io 
module for example.

----------
Added file: http://bugs.python.org/file26750/set_encoding-2.patch

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue15216>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to