[EMAIL PROTECTED] wrote: > I have two files: > > test.py: > -------------------------------------------------- > # -*- encoding : utf8 -*- > print 'in this file', repr('中文') > > # tt.txt is saved as utf8 encoding > f = file('tt.txt') > line1 = f.readline().strip() > print 'another file', repr(line1) > ------------------------------------------------------- > > tt.txt: > ---------------------------------------------------- > 中文 > test > ------------------------------------------------------- > run test.py and I get the following output: > in this file '\xe4\xb8\xad\xe6\x96\x87' > another file '\xef\xbb\xbf\xe4\xb8\xad\xe6\x96\x87' > > and I cann't encode line1 like: > line1.decode('utf8').encode('gbk') > get this error: > UnicodeEncodeError: 'gbk' codec can't encode character u'\ufeff' in > position 0: > illegal multibyte sequence > > why did I get the different repr values?
Because whatever you used to "save as" that file has retained or inserted a BOM (byte order mark, U+FEFF) at the start of the file before encoding as UTF-8. It's the '\xef\xbb\xbf' at the start of the file, and also the u'\ufeff' that is giving the gbk codec indigestion. You can remove it in your script. HTH John -- http://mail.python.org/mailman/listinfo/python-list