Zhongjian Lu wrote: > Hi Guys, > > I was processing a UTF-16 coded file with BOM and was not aware of the > codecs package at first. I wrote the following code: > ===== Code 1============================ > for i in open("d:\python24\lzjtest.xml", 'r').readlines(): > i = i.decode("utf-16") > print i > ======================================= > Output was: > Traceback (most recent call last): > File "D:\Python24\testutf-16.py", line 4, in -toplevel- > i = i.decode("utf-16") > File "D:\Python24\lib\encodings\utf_16.py", line 16, in decode > return codecs.utf_16_decode(input, errors, True) > UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position > 84: truncated data >
UTF16 is a 'two-byte encoding'. This means that '\r\n' is represented using : '\r\x00\n\x00' When you use readlines to split this up it splits on byte boundaries. This probably returns something like : '\r', '\x00\n', '\x00' You can see how the last bit is 'truncated' (single byte only) because the data has been split on bytes instead of characters. > I searched google and found an article on the similar problem saying to use > split(). I had not quite caught the meaning of the article and recode as: > ==== Code 2============================== > for i in open("d:\python24\lzjtest.xml", 'r').read().split('\r\n'): > i = i.decode("utf-16") > print i > ======================================= > Then it worked (echo the file). > You will probably find that '\r\n' never occurs in the byte-string, so this does it *all* in one line, but the decode succeeds. HTH All the best, Fuzzyman http://www.voidspace.org.uk/python/index.shtml > Later I get to know codecs and write the following code: > > ==== Code 3 ============================= > import codecs > for i in codecs.open("d:\python24\lzjtesttvs2.xml", 'r', > 'utf-16').readlines(): > print i > ======================================= > It worked and echo the file. > > I am wondering what is the problem with the first code and why the bug > is fixed in > the second. > > Thanks in advance. > > -Zhongjian -- http://mail.python.org/mailman/listinfo/python-list