Re: Python 3.0 automatic decoding of UTF16

Terry Reedy Sun, 07 Dec 2008 01:16:25 -0800

John Machin wrote:

Here's the scoop: It's a bug in the newline handling (in io.py, class
IncrementalNewlineDecoder, method decode). It reads text files in 128-
byte chunks. Converting CR LF to \n requires special case handling
when '\r' is detected at the end of the decoded chunk n in case
there's an LF at the start of chunk n+1. Buggy solution: prepend b'\r'
to the chunk n+1 bytes and decode that -- suddenly with a 2-bytes-per-
char encoding like UTF-16 we are 1 byte out of whack. Better (IMVH[1]
O) solution: prepend '\r' to the result of decoding the chunk n+1
bytes. Each of the OP's files have \r on a 64-character boundary.
Note: They would exhibit the same symptoms if encoded in utf-16LE
instead of utf-16BE. With the better solution applied, the first file
[the truncated one] gave the expected error, and the second file [the
apparently OK one] gave sensible looking output.


[1] I thought it best to be Very Humble given what you see when you
do:
   import io
   print(io.__author__)
Hope my surge protector can cope with this :-)
^%!//()
NO CARRIER

Please post this on the tracker so it can get included with other iowork for 3.0.1.


--
http://mail.python.org/mailman/listinfo/python-list

Re: Python 3.0 automatic decoding of UTF16

Reply via email to