Thanks for your excellent debugging suggestions, John. See below for my follow-up:
Richard Schulman: >> The following program fragment works correctly with an ascii input >> file. >> >> But the file I actually want to process is Unicode (utf-16 encoding). >> The file must be Unicode rather than ASCII or Latin-1 because it >> contains mixed Chinese and English characters. >> >> When I run the program below I get an attribute_count of zero, which >> is incorrect for the input file, which should give a value of fifteen >> or sixteen. In other words, the count function isn't recognizing the >> ", characters in the line being read. Here's the program: >>... John Machin: >Insert > print type(in_line) > print repr(in_line) >here [also make the appropriate changes to get the same info from the >first line], run it again, copy/paste what you get, show us what you >see. Here's the revised program, per your suggestion: ===================================================== # This program processes a UTF-16 input file that is # to be loaded later into a mySQL table. The input file # is not yet ready for prime time. The purpose of this # program is to ready it. in_file = open("c:\\pythonapps\\in-graf1.my","rU") try: # The first line read is a SQL INSERT statement; no # processing will be required. in_line = in_file.readline() print type(in_line) #For debugging print repr(in_line) #For debugging # The second line read is the first data row. in_line = in_file.readline() print type(in_line) #For debugging print repr(in_line) #For debugging # For this and subsequent rows, we must count all # the < ", > character-pairs in a given line/row. # This will provide an n-1 measure of the attributes # for a SQL insert of this row. All rows must have # sixteen attributes, but some don't yet. attribute_count = in_line.count('",') print attribute_count finally: in_file.close() ===================================================== The output of this program, which I ran at the command line, must needs to be copied by hand and abridged, but I think I have included the relevant information: C:\pythonapps>python graf_correction.py <type 'str'> '\xff\xfeI\x00N\x00S... [the beginning of a SQL INSERT statement] ...\x00U\x00E\x00S\x00\n' [the VALUES keyword at the end of the row, followed by an end-of-line] <type 'str'> '\x00\n' [oh-oh! For the second row, all we're seeing is an end-of-line character. Is that from the first row? Wasn't the "rU" mode supposed to handle that] 0 [the counter value. It's hardly surprising it's only zero, given that most of the row never got loaded, just an eol mark] J.M.: >If you're coy about that, then you'll have to find out yourself if it >has a BOM at the front, and if not whether it's little/big/endian. The BOM is little-endian, I believe. R.S.: >> Any suggestions? J.M. >1. Read the Unicode HOWTO. >2. Read the docs on the codecs module ... > >You'll need to use > >in_file = codecs.open(filepath, mode, encoding="utf16???????") Right you are. Here is the output produced by so doing: <type 'unicode'> u'\ufeffINSERT INTO [...] VALUES\N' <type 'unicode'> u'\n' 0 [The counter value] >It would also be a good idea to get into the habit of using unicode >constants like u'",' Right. >HTH, >John Yes, it did. Many thanks! Now I've got to figure out the best way to handle that \n\n at the end of each row, which the program is interpreting as two rows. That represents two surprises: first, I thought that Microsoft files ended as \n\r ; second, I thought that Python mode "rU" was supposed to be the universal eol handler and would handle the \n\r as one mark. Richard Schulman -- http://mail.python.org/mailman/listinfo/python-list