Re: Unicode string handling problem

2006-09-07 Thread John Machin
Richard Schulman wrote: > It turns out that the Unicode input files I was working with (from MS > Word and MS Notepad) were indeed creating eol sequences of \r\n, not > \n\n as I had originally thought. The file reading statement that I > was using, with unpredictable results, was > > #in_file = >

Re: Unicode string handling problem

2006-09-07 Thread Richard Schulman
Many thanks for your help, John, in giving me the tools to work successfully in Python with Unicode from here on out. It turns out that the Unicode input files I was working with (from MS Word and MS Notepad) were indeed creating eol sequences of \r\n, not \n\n as I had originally thought. The fil

Re: Unicode string handling problem

2006-09-05 Thread John Machin
Richard Schulman wrote: [big snip] > > The BOM is little-endian, I believe. Correct. > >in_file = codecs.open(filepath, mode, encoding="utf16???") > > Right you are. Here is the output produced by so doing: You don't say which encoding you used, but I guess that you used utf_16_le. > > > u'

Re: Unicode string handling problem

2006-09-05 Thread Richard Schulman
On Wed, 06 Sep 2006 03:55:18 GMT, Richard Schulman <[EMAIL PROTECTED]> wrote: >...I'm now using the codec with >improved results, but am still puzzled as to how to handle the row >termination of \n\n, which is being interpreted as two rows instead of >one. Of course, I could do a double read on e

Re: Unicode string handling problem

2006-09-05 Thread Richard Schulman
On 5 Sep 2006 19:50:27 -0700, "John Roth" <[EMAIL PROTECTED]> wrote: >> [T]he file I actually want to process is Unicode (utf-16 encoding). >>... >> in_file = open("c:\\pythonapps\\in-graf1.my","rU") >>... John Roth: >You're not detecting the file encoding and then >using it in the open statement

Re: Unicode string handling problem

2006-09-05 Thread Richard Schulman
Thanks for your excellent debugging suggestions, John. See below for my follow-up: Richard Schulman: >> The following program fragment works correctly with an ascii input >> file. >> >> But the file I actually want to process is Unicode (utf-16 encoding). >> The file must be Unicode rather than AS

Re: Unicode string handling problem

2006-09-05 Thread John Roth
Richard Schulman wrote: > The following program fragment works correctly with an ascii input > file. > > But the file I actually want to process is Unicode (utf-16 encoding). > The file must be Unicode rather than ASCII or Latin-1 because it > contains mixed Chinese and English characters. > > Whe

Re: Unicode string handling problem (revised)

2006-09-05 Thread John Machin
Richard Schulman wrote: [snip] > in_line = in_file.readline() [snip] We'd already deduced that that line was incorrectly published. Please don't start new threads like this; if you want to make a correction, do a couple-of-lines reply to your original message. Now please leave this new thread

Unicode string handling problem (revised)

2006-09-05 Thread Richard Schulman
The appended program fragment works correctly with an ascii input file. But the file I actually want to process is Unicode (utf-16 encoding). This file must be Unicode rather than ASCII or Latin-1 because it contains mixed Chinese and English characters. When I run the program I get an attribute_c

Re: Unicode string handling problem

2006-09-05 Thread John Machin
Richard Schulman wrote: > The following program fragment works correctly with an ascii input > file. > > But the file I actually want to process is Unicode (utf-16 encoding). > The file must be Unicode rather than ASCII or Latin-1 because it > contains mixed Chinese and English characters. > > When

Unicode string handling problem

2006-09-05 Thread Richard Schulman
The following program fragment works correctly with an ascii input file. But the file I actually want to process is Unicode (utf-16 encoding). The file must be Unicode rather than ASCII or Latin-1 because it contains mixed Chinese and English characters. When I run the program below I get an attr