On Dec 5, 2008, at 11:36 AM, Johannes Bauer wrote:
I suspect that '?' after \n (\u0a00) is indicates not 'question-mark'
but 'uninterpretable as a utf16 character'. The traceback below
confirms that. It should be an end-of-file marker and should not be
passed to Python. I strongly suspect that whatever wrote the file
screwed up the (OS-specific) end-of-file marker. I have seen this
occasionally on Dos/Windows with ascii byte files, with the same
symptom
of reading random garbage pass the end of the file. Or perhaps
end-of-file does not work right with utf16.
So UTF-16 has an explicit EOF marker within the text?
No, it does not. I don't know what Terry's thinking of there, but
text files do not have any EOF marker. They start at the beginning
(sometimes including a byte-order mark), and go till the end of the
file, period.
I cannot find one in original file, only some kind of starting
sequence I suppose
(0xfeff).
That's your byte-order mark (BOM).
The last characters of the file are 0x00 0x0d 0x00 0x0a,
simple \r\n line ending.
Sounds like a perfectly normal file to me.
It's hard to imagine, but it looks to me like you've found a bug.
Best,
- Joe
--
http://mail.python.org/mailman/listinfo/python-list