>> Is there any way to solve this better? >> I mean if I shouldn't convert the GBK string to unicode string, what >> should I do to make SAX work? > > Decode it and then encode it to utf-8 before feeding it to the parser.
The tricky part is that you also need to change the encoding declaration in doing so, but in this case, it should be fairly simple: unicode_doc = original_doc.decode("gbk") unicode_doc = unicode_doc.replace('gbk','utf-8',1) utf8_doc = unicode_doc.encode("utf-8") This assumes that the string "gbk" occurs in the encoding declaration as <?xml version="1.0" encoding="gbk"?> If the encoding name has a different spelling (e.g. GBK), you need to cater for that as well. You might want to try replacing the entire XML declaration (i.e. everything between <? and ?>), or just the encoding= parameter. Notice that the encoding declaration may include ' instead of ", and may have additional spaces, e.g. <?xml version = '1.0' encoding= 'gbK' ?> HTH, Martin -- http://mail.python.org/mailman/listinfo/python-list