I use chinese charactors as an example here. >>>s1='你好吗' >>>repr(s1) "'\\xc4\\xe3\\xba\\xc3\\xc2\\xf0'" >>>b1=s1.decode('GBK')
My first question is : what strategy does 'decode' use to tell the way to seperate the words. I mean since s1 is an multi-bytes-char string, how did it determine to seperate the string every 2bytes or 1byte? My second question is: is there any one who has tested very long mbcs decode? I tried to decode a long(20+MB) xml yesterday, which turns out to be very strange and cause SAX fail to parse the decoded string. However, I use another text editor to convert the file to utf-8 and SAX will parse the content successfully. I'm not sure if some special byte array or too long text caused this problem. Or maybe thats a BUG of python 2.5? -- http://mail.python.org/mailman/listinfo/python-list