Richard Schulman wrote: > The following program fragment works correctly with an ascii input > file. > > But the file I actually want to process is Unicode (utf-16 encoding). > The file must be Unicode rather than ASCII or Latin-1 because it > contains mixed Chinese and English characters. > > When I run the program below I get an attribute_count of zero, which > is incorrect for the input file, which should give a value of fifteen > or sixteen. In other words, the count function isn't recognizing the > ", characters in the line being read. Here's the program: > > in_file = open("c:\\pythonapps\\in-graf1.my","rU") > try: > # Skip the first line; make the second available for processing > in_file.readline() > in_line = readline()
You mean in_line = in_file.readline(), I hope. Do please copy/paste actual code, not what you think you ran. > attribute_count = in_line.count('",') > print attribute_count Insert print type(in_line) print repr(in_line) here [also make the appropriate changes to get the same info from the first line], run it again, copy/paste what you get, show us what you see. If you're coy about that, then you'll have to find out yourself if it has a BOM at the front, and if not whether it's little/big/endian. > finally: > in_file.close() > > Any suggestions? > 1. Read the Unicode HOWTO. 2. Read the docs on the codecs module ... You'll need to use in_file = codecs.open(filepath, mode, encoding="utf16???????") It would also be a good idea to get into the habit of using unicode constants like u'",' HTH, John -- http://mail.python.org/mailman/listinfo/python-list