On Jan 28, 2:53 pm, glacier <[EMAIL PROTECTED]> wrote: > > Thanks,John. > It's no doubt that you proved SAX didn't support GBK encoding. > But can you give some suggestion on how to make SAX parse some GBK > string?
Yes, the same suggestion as was given to you by others very early in this thread, the same as I demonstrated in the middle of proving that SAX doesn't support a GBK-encoded input file. Suggestion: Recode your input from GBK to UTF-8. Ensure that the XML declaration doesn't have an unsupported encoding. Your handler will get data encoded as UTF-8. Recode that to GBK if needed. Here's a cut down version of the previous script, focussed on demonstrating that the recoding strategy works. C:\junk>type gbksax2.py import xml.sax, xml.sax.saxutils import cStringIO unistr = u''.join(unichr(0x4E00+i) + unichr(ord('W')+i) for i in range(4)) gbkstr = unistr.encode('gbk') print 'This is a GBK-encoded string: %r' % gbkstr utf8str = gbkstr.decode('gbk').encode('utf8') print 'Now recoded as UTF-8 to be fed to a SAX parser: %r' % utf8str xml_template = """<?xml version="1.0" encoding="%s"?><data>%s</ data>""" utf8doc = xml_template % ('utf-8', unistr.encode('utf8')) f = cStringIO.StringIO() handler = xml.sax.saxutils.XMLGenerator(f, encoding='utf8') xml.sax.parseString(utf8doc, handler) result = f.getvalue() f.close() start = result.find('<data>') + 6 end = result.find('</data>') mydata = result[start:end] print "SAX output (UTF-8): %r" % mydata print "SAX output recoded to GBK: %r" % mydata.decode('utf8').encode('gbk') C:\junk>gbksax2.py This is a GBK-encoded string: '[EMAIL PROTECTED]' Now recoded as UTF-8 to be fed to a SAX parser: '\xe4\xb8\x80W \xe4\xb8\x81X\xe4\xb8\x82Y\xe4\xb8\x83Z' SAX output (UTF-8): '\xe4\xb8\x80W\xe4\xb8\x81X\xe4\xb8\x82Y \xe4\xb8\x83Z' SAX output recoded to GBK: '[EMAIL PROTECTED]' HTH, John -- http://mail.python.org/mailman/listinfo/python-list