On Jun 7, 11:25 pm, John Machin <sjmac...@lexicon.net> wrote: > On Jun 7, 10:55 pm, higer <higerinbeij...@gmail.com> wrote: > > > My file contains such strings : > > \xe6\x97\xa5\xe6\x9c\x9f\xef\xbc\x9a >
> Are you sure? Does that occupy 9 bytes in your file or 36 bytes? > It was saved in a file, so it occupy 36 bytes. If I just use a variable to contain this string, it can certainly work out correct result,but how to get right answer when reading from file. > > > > I want to read the content of this file and transfer it to the > > corresponding gbk code,a kind of Chinese character encode style. > > Everytime I was trying to transfer, it will output the same thing no > > matter which method was used. > > It seems like that when Python reads it, Python will taks '\' as a > > common char and this string at last will be represented as "\\xe6\\x97\ > > \xa5\\xe6\\x9c\\x9f\\xef\\xbc\\x9a" , then the "\" can be 'correctly' > > output,but that's not what I want to get. > > > Anyone can help me? > > try this: > > utf8_data = your_data.decode('string-escape') > unicode_data = utf8_data.decode('utf8') > # unicode derived from your sample looks like this 日期: is that what > you expected? You are right , the result is 日期 which I just expect. If you save the string in a variable, you surely can get the correct result. But it is just a sample, so I give a short string, what if so many characters in a file? > gbk_data = unicode_data.encode('gbk') > I have tried this method which you just told me, but unfortunately it does not work(mess code). > If that "doesn't work", do three things: > (1) give us some unambiguous hard evidence about the contents of your > data: > e.g. # assuming Python 2.x My Python versoin is 2.5.2 > your_data = open('your_file.txt', 'rb').read(36) > print repr(your_data) > print len(your_data) > print your_data.count('\\') > print your_data.count('x') > The result is: '\\xe6\\x97\\xa5\\xe6\\x9c\\x9f\\xef\\xbc\\x9a' 36 9 9 > (2) show us the source of the script that you used def UTF8ToChnWords(): f = open("123.txt","rb") content=f.read() print repr(content) print len(content) print content.count("\\") print content.count("x") pass if __name__ == '__main__': UTF8ToChnWords() > (3) Tell us what "doesn't work" means in this case It doesn't work because no matter in what way we deal with it we often get 36 bytes string not 9 bytes.Thus, we can not get the correct answer. > > Cheers, > John Thank you very much, higer -- http://mail.python.org/mailman/listinfo/python-list