"higer" <higerinbeij...@gmail.com> wrote in message
news:0c786326-1651-42c8-ba39-4679f3558...@r13g2000vbr.googlegroups.com...
On Jun 7, 11:25 pm, John Machin <sjmac...@lexicon.net> wrote:
On Jun 7, 10:55 pm, higer <higerinbeij...@gmail.com> wrote:
> My file contains such strings :
> \xe6\x97\xa5\xe6\x9c\x9f\xef\xbc\x9a
Are you sure? Does that occupy 9 bytes in your file or 36 bytes?
It was saved in a file, so it occupy 36 bytes. If I just use a
variable to contain this string, it can certainly work out correct
result,but how to get right answer when reading from file.
Did you create this file? If it is 36 characters, it contains literal
backslash characters, not the 9 bytes that would correctly encode as UTF-8.
If you created the file yourself, show us the code.
> I want to read the content of this file and transfer it to the
> corresponding gbk code,a kind of Chinese character encode style.
> Everytime I was trying to transfer, it will output the same thing no
> matter which method was used.
> It seems like that when Python reads it, Python will taks '\' as a
> common char and this string at last will be represented as "\\xe6\\x97\
> \xa5\\xe6\\x9c\\x9f\\xef\\xbc\\x9a" , then the "\" can be 'correctly'
> output,but that's not what I want to get.
> Anyone can help me?
try this:
utf8_data = your_data.decode('string-escape')
unicode_data = utf8_data.decode('utf8')
# unicode derived from your sample looks like this 日期: is that what
you expected?
You are right , the result is 日期 which I just expect. If you save the
string in a variable, you surely can get the correct result. But it is
just a sample, so I give a short string, what if so many characters in
a file?
gbk_data = unicode_data.encode('gbk')
I have tried this method which you just told me, but unfortunately it
does not work(mess code).
How are you determining this is 'mess code'? How are you viewing the
result? You'll need to use a viewer that understands GBK encoding, such as
"Chinese Window's Notepad".
If that "doesn't work", do three things:
(1) give us some unambiguous hard evidence about the contents of your
data:
e.g. # assuming Python 2.x
My Python versoin is 2.5.2
your_data = open('your_file.txt', 'rb').read(36)
print repr(your_data)
print len(your_data)
print your_data.count('\\')
print your_data.count('x')
The result is:
'\\xe6\\x97\\xa5\\xe6\\x9c\\x9f\\xef\\xbc\\x9a'
36
9
9
(2) show us the source of the script that you used
def UTF8ToChnWords():
f = open("123.txt","rb")
content=f.read()
print repr(content)
print len(content)
print content.count("\\")
print content.count("x")
Try:
utf8data = content.decode('string-escape')
unicodedata = utf8data.decode('utf8')
gbkdata = unicodedata.encode('gbk')
print len(gbkdata),repr(gbkdata)
open("456.txt","wb").write(gbkdata)
The print should give:
6 '\xc8\xd5\xc6\xda\xa3\xba'
This is correct for GBK encoding. 456.txt should contain the 6 bytes of GBK
data. View the file with a program that understand GBK encoding.
-Mark
--
http://mail.python.org/mailman/listinfo/python-list