Re: Python 3.0 automatic decoding of UTF16

2008-12-07 Thread John Machin
On Dec 8, 2:05 am, Johannes Bauer <[EMAIL PROTECTED]> wrote: > John Machin schrieb: > > > He did. Ugly stuff using readline() :-) Should still work, though. > > Well, well, I'm a C kinda guy used to while (fgets(b, sizeof(b), f)) > kinda loops :-) > > But, seriously - I find that whole "while True:

Re: Python 3.0 automatic decoding of UTF16

2008-12-07 Thread D'Arcy J.M. Cain
On Sun, 07 Dec 2008 16:05:53 +0100 Johannes Bauer <[EMAIL PROTECTED]> wrote: > But, seriously - I find that whole "while True:" and "if line == """ > construct ugly as hell, too. How can reading a file line by line be > achieved in a more pythonic kind of way? for line in open(filename): -- D

Re: Python 3.0 automatic decoding of UTF16

2008-12-07 Thread Johannes Bauer
John Machin schrieb: > He did. Ugly stuff using readline() :-) Should still work, though. Well, well, I'm a C kinda guy used to while (fgets(b, sizeof(b), f)) kinda loops :-) But, seriously - I find that whole "while True:" and "if line == """ construct ugly as hell, too. How can reading a file

Re: Python 3.0 automatic decoding of UTF16

2008-12-07 Thread John Machin
On Dec 7, 8:15 pm, Terry Reedy <[EMAIL PROTECTED]> wrote: > John Machin wrote: > > Here's the scoop: It's a bug in the newline handling (in io.py, class > > IncrementalNewlineDecoder, method decode). It reads text files in 128- > > byte chunks. Converting CR LF to \n requires special case handling

Re: Python 3.0 automatic decoding of UTF16

2008-12-07 Thread Terry Reedy
John Machin wrote: Here's the scoop: It's a bug in the newline handling (in io.py, class IncrementalNewlineDecoder, method decode). It reads text files in 128- byte chunks. Converting CR LF to \n requires special case handling when '\r' is detected at the end of the decoded chunk n in case there

Re: Python 3.0 automatic decoding of UTF16

2008-12-06 Thread John Machin
On Dec 7, 9:34 am, John Machin <[EMAIL PROTECTED]> wrote: > On Dec 7, 9:01 am, David Bolen <[EMAIL PROTECTED]> wrote: > > > Johannes Bauer <[EMAIL PROTECTED]> writes: > > > This is very strange - when using "utf16", endianness should be detected > > > automatically. When I simply truncate the trail

Re: Python 3.0 automatic decoding of UTF16

2008-12-06 Thread John Machin
On Dec 7, 9:01 am, David Bolen <[EMAIL PROTECTED]> wrote: > Johannes Bauer <[EMAIL PROTECTED]> writes: > > This is very strange - when using "utf16", endianness should be detected > > automatically. When I simply truncate the trailing zero byte, I receive: > > Any chance that whatever you used to "

Re: Python 3.0 automatic decoding of UTF16

2008-12-06 Thread David Bolen
Johannes Bauer <[EMAIL PROTECTED]> writes: > This is very strange - when using "utf16", endianness should be detected > automatically. When I simply truncate the trailing zero byte, I receive: Any chance that whatever you used to "simply truncate the trailing zero byte" also removed the BOM at th

Re: Python 3.0 automatic decoding of UTF16

2008-12-06 Thread John Machin
On Dec 7, 6:20 am, "Mark Tolonen" <[EMAIL PROTECTED]> wrote: > "Johannes Bauer" <[EMAIL PROTECTED]> wrote in message > > news:[EMAIL PROTECTED] > > > > >John Machin schrieb: > >> On Dec 6, 5:36 am, Johannes Bauer <[EMAIL PROTECTED]> wrote: > >>> So UTF-16 has an explicit EOF marker within the text?

Re: Python 3.0 automatic decoding of UTF16

2008-12-06 Thread Mark Tolonen
"Johannes Bauer" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] John Machin schrieb: On Dec 6, 5:36 am, Johannes Bauer <[EMAIL PROTECTED]> wrote: So UTF-16 has an explicit EOF marker within the text? I cannot find one in original file, only some kind of starting sequence I suppose

Re: Python 3.0 automatic decoding of UTF16

2008-12-06 Thread MRAB
Johannes Bauer wrote: [EMAIL PROTECTED] schrieb: 2 problems: endianness and trailing zer byte. This works for me: This is very strange - when using "utf16", endianness should be detected automatically. When I simply truncate the trailing zero byte, I receive: Traceback (most recent call last

Re: Python 3.0 automatic decoding of UTF16

2008-12-06 Thread Johannes Bauer
John Machin schrieb: > On Dec 6, 5:36 am, Johannes Bauer <[EMAIL PROTECTED]> wrote: >> So UTF-16 has an explicit EOF marker within the text? I cannot find one >> in original file, only some kind of starting sequence I suppose >> (0xfeff). The last characters of the file are 0x00 0x0d 0x00 0x0a, >>

Re: Python 3.0 automatic decoding of UTF16

2008-12-06 Thread Johannes Bauer
[EMAIL PROTECTED] schrieb: > 2 problems: endianness and trailing zer byte. > This works for me: This is very strange - when using "utf16", endianness should be detected automatically. When I simply truncate the trailing zero byte, I receive: Traceback (most recent call last): File "./modify.py

Re: Python 3.0 automatic decoding of UTF16

2008-12-05 Thread MRAB
John Machin wrote: On Dec 6, 10:35 am, Steven D'Aprano <[EMAIL PROTECTED] cybersource.com.au> wrote: On Fri, 05 Dec 2008 12:00:59 -0700, Joe Strout wrote: So UTF-16 has an explicit EOF marker within the text? No, it does not. I don't know what Terry's thinking of there, but text files do not

Re: Python 3.0 automatic decoding of UTF16

2008-12-05 Thread John Machin
On Dec 6, 10:35 am, Steven D'Aprano <[EMAIL PROTECTED] cybersource.com.au> wrote: > On Fri, 05 Dec 2008 12:00:59 -0700, Joe Strout wrote: > >> So UTF-16 has an explicit EOF marker within the text? > > > No, it does not.  I don't know what Terry's thinking of there, but text > > files do not have an

Re: Python 3.0 automatic decoding of UTF16

2008-12-05 Thread Steven D'Aprano
On Fri, 05 Dec 2008 12:00:59 -0700, Joe Strout wrote: >> So UTF-16 has an explicit EOF marker within the text? > > No, it does not. I don't know what Terry's thinking of there, but text > files do not have any EOF marker. They start at the beginning > (sometimes including a byte-order mark), an

Re: Python 3.0 automatic decoding of UTF16

2008-12-05 Thread John Machin
On Dec 6, 5:36 am, Johannes Bauer <[EMAIL PROTECTED]> wrote: > So UTF-16 has an explicit EOF marker within the text? I cannot find one > in original file, only some kind of starting sequence I suppose > (0xfeff). The last characters of the file are 0x00 0x0d 0x00 0x0a, > simple \r\n line ending. S

Re: Python 3.0 automatic decoding of UTF16

2008-12-05 Thread MRAB
Joe Strout wrote: On Dec 5, 2008, at 11:36 AM, Johannes Bauer wrote: I suspect that '?' after \n (\u0a00) is indicates not 'question-mark' but 'uninterpretable as a utf16 character'. The traceback below confirms that. It should be an end-of-file marker and should not be passed to Python. I s

Re: Python 3.0 automatic decoding of UTF16

2008-12-05 Thread info
On Dec 5, 3:25 pm, Johannes Bauer <[EMAIL PROTECTED]> wrote: > Hello group, > > I'm having trouble reading a utf-16 encoded file with Python3.0. This is > my (complete) code: > > #!/usr/bin/python3.0 > > class AddressBook(): >         def __init__(self, filename): >                 f = open(filenam

Re: Python 3.0 automatic decoding of UTF16

2008-12-05 Thread Joe Strout
On Dec 5, 2008, at 11:36 AM, Johannes Bauer wrote: I suspect that '?' after \n (\u0a00) is indicates not 'question-mark' but 'uninterpretable as a utf16 character'. The traceback below confirms that. It should be an end-of-file marker and should not be passed to Python. I strongly suspect tha

Re: Python 3.0 automatic decoding of UTF16

2008-12-05 Thread Johannes Bauer
Terry Reedy schrieb: > Johannes Bauer wrote: >> Hello group, >> >> I'm having trouble reading a utf-16 encoded file with Python3.0. This is >> my (complete) code: > > what OS. This is often critical when you have a problem interacting > with the OS. It's a 64-bit Linux, currently running: Linux

Re: Python 3.0 automatic decoding of UTF16

2008-12-05 Thread Terry Reedy
Johannes Bauer wrote: Hello group, I'm having trouble reading a utf-16 encoded file with Python3.0. This is my (complete) code: what OS. This is often critical when you have a problem interacting with the OS. #!/usr/bin/python3.0 class AddressBook(): def __init__(self, filename):

Re: Python 3.0 automatic decoding of UTF16

2008-12-05 Thread Richard Brodie
"J Kenneth King" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > It probably means what it says: that the input file contains characters > it cannot read using the specified encoding. That was my first thought. However it appears that there is an off by one error somewhere in the

Re: Python 3.0 automatic decoding of UTF16

2008-12-05 Thread Johannes Bauer
J Kenneth King schrieb: > It probably means what it says: that the input file contains characters > it cannot read using the specified encoding. No, it doesn't. The file is just fine, just as the example. > Are you generating the file from python using a file object with the > same encoding? If

Re: Python 3.0 automatic decoding of UTF16

2008-12-05 Thread J Kenneth King
Johannes Bauer <[EMAIL PROTECTED]> writes: > Traceback (most recent call last): > File "./modify.py", line 12, in > a = AddressBook("2008_11_05_Handy_Backup.txt") > File "./modify.py", line 7, in __init__ > line = f.readline() > File "/usr/local/lib/python3.0/io.py", line 1807, in r

Python 3.0 automatic decoding of UTF16

2008-12-05 Thread Johannes Bauer
Hello group, I'm having trouble reading a utf-16 encoded file with Python3.0. This is my (complete) code: #!/usr/bin/python3.0 class AddressBook(): def __init__(self, filename): f = open(filename, "r", encoding="utf16") while True: