Richard Schulman wrote: [big snip] > > The BOM is little-endian, I believe. Correct.
> >in_file = codecs.open(filepath, mode, encoding="utf16???????") > > Right you are. Here is the output produced by so doing: You don't say which encoding you used, but I guess that you used utf_16_le. > > <type 'unicode'> > u'\ufeffINSERT INTO [...] VALUES\N' Use utf_16 -- it will strip off the BOM for you. > <type 'unicode'> > u'\n' > 0 [The counter value] > [snip] > Yes, it did. Many thanks! Now I've got to figure out the best way to > handle that \n\n at the end of each row, which the program is > interpreting as two rows. Well we don't know yet exactly what you have there. We need a byte dump of the first few bytes of your file. Get into the interactive interpreter and do this: open('yourfile', 'rb').read(200) (the 'b' is for binary, in case you are on Windows) That will show us exactly what's there, without *any* EOL interpretation at all. > That represents two surprises: first, I > thought that Microsoft files ended as \n\r ; Nah. Wrong on two counts. In text mode, Microsoft *lines* end in \r\n (not \n\r); *files* may end in ctrl-Z aka chr(26) -- an inheritance from CP/M. Ummmm ... are you saying the file has \n\r at the end of each row?? How did you know that if you didn't know what if any BOM it had??? Who created the file???? > second, I thought that > Python mode "rU" was supposed to be the universal eol handler and > would handle the \n\r as one mark. Nah again. It contemplates only \n, \r, and \r\n as end of line. See the docs. Thus \n\r becomes *two* newlines when read with "rU". Having "\n\r" at the end of each row does fit with your symptoms: | >>> bom = u"\ufeff" | >>> guff = '\n\r'.join(['abc', 'def', 'ghi']) | >>> guffu = unicode(guff) | >>> import codecs | >>> f = codecs.open('guff.utf16le', 'wb', encoding='utf_16_le') | >>> f.write(bom+guffu) | >>> f.close() | >>> open('guff.utf16le', 'rb').read() #### see exactly what we've got | '\xff\xfea\x00b\x00c\x00\n\x00\r\x00d\x00e\x00f\x00\n\x00\r\x00g\x00h\x00i\x00' | >>> codecs.open('guff.utf16le', 'r', encoding='utf_16').read() | u'abc\n\rdef\n\rghi' ######### Look, Mom, no BOM! | >>> codecs.open('guff.utf16le', 'rU', encoding='utf_16').read() | u'abc\n\ndef\n\nghi' #### U means \r -> \n | >>> codecs.open('guff.utf16le', 'rU', encoding='utf_16_le').read() | u'\ufeffabc\n\ndef\n\nghi' ######### reproduces your second experience | >>> open('guff.utf16le', 'rU').readlines() | ['\xff\xfea\x00b\x00c\x00\n', '\x00\n', '\x00d\x00e\x00f\x00\n', '\x00\n', '\x00 | g\x00h\x00i\x00'] | >>> f = open('guff.utf16le', 'rU') | >>> f.readline() | '\xff\xfea\x00b\x00c\x00\n' | >>> f.readline() | '\x00\n' ######### reproduces your first experience | >>> f.readline() | '\x00d\x00e\x00f\x00\n' | >>> If that file is a one-off, you can obviously fix it by throwing away every second line. Otherwise, if it's an ongoing exercise, you need to talk sternly to the file's creator :-) HTH, John -- http://mail.python.org/mailman/listinfo/python-list