On Jan 28, 7:47 am, "Mark Tolonen" <[EMAIL PROTECTED]> wrote: > >"John Machin" <[EMAIL PROTECTED]> wrote in message > >news:[EMAIL PROTECTED] > >On Jan 27, 9:17 pm, glacier <[EMAIL PROTECTED]> wrote: > >> On 1月24日, 下午3时29分, "Gabriel Genellina" <[EMAIL PROTECTED]> > >> wrote: > > >*IF* the file is well-formed GBK, then the codec will not mess up when > >decoding it to Unicode. The usual cause of mess is a combination of a > >human and a text editor :-) > > SAX uses the expat parser. From the pyexpat module docs: > > Expat doesn't support as many encodings as Python does, and its repertoire > of encodings can't be extended; it supports UTF-8, UTF-16, ISO-8859-1 > (Latin1), and ASCII. If encoding is given it will override the implicit or > explicit encoding of the document. > > --Mark
Thank you for pointing out where that list of encodings had been cunningly concealed. However the relevance of dropping it in as an apparent response to my answer to the OP's question about decoding possibly butchered GBK strings is .... what? In any case, it seems to support other 8-bit encodings e.g. iso-8859-2 and koi8-r ... C:\junk>type gbksax.py import xml.sax, xml.sax.saxutils import cStringIO unistr = u''.join(unichr(0x4E00+i) + unichr(ord('W')+i) for i in range(4)) print 'unistr=%r' % unistr gbkstr = unistr.encode('gbk') print 'gbkstr=%r' % gbkstr unistr2 = gbkstr.decode('gbk') assert unistr2 == unistr print "latin1 FF -> utf8 = %r" % '\xff'.decode('iso-8859-1').encode('utf8') print "latin2 FF -> utf8 = %r" % '\xff'.decode('iso-8859-2').encode('utf8') print "koi8r FF -> utf8 = %r" % '\xff'.decode('koi8-r').encode('utf8') xml_template = """<?xml version="1.0" encoding="%s"?><data>%s</ data>""" asciidoc = xml_template % ('ascii', 'The quick brown fox etc') utf8doc = xml_template % ('utf-8', unistr.encode('utf8')) latin1doc = xml_template % ('iso-8859-1', 'nil illegitimati carborundum' + '\xff') latin2doc = xml_template % ('iso-8859-2', 'duo secundus' + '\xff') koi8rdoc = xml_template % ('koi8-r', 'Moskva' + '\xff') gbkdoc = xml_template % ('gbk', gbkstr) for doc in (asciidoc, utf8doc, latin1doc, latin2doc, koi8rdoc, gbkdoc): f = cStringIO.StringIO() handler = xml.sax.saxutils.XMLGenerator(f, encoding='utf8') xml.sax.parseString(doc, handler) result = f.getvalue() f.close print repr(result[result.find('<data>'):]) C:\junk>gbksax.py unistr=u'\u4e00W\u4e01X\u4e02Y\u4e03Z' gbkstr='[EMAIL PROTECTED]' latin1 FF -> utf8 = '\xc3\xbf' latin2 FF -> utf8 = '\xcb\x99' koi8r FF -> utf8 = '\xd0\xaa' '<data>The quick brown fox etc</data>' '<data>\xe4\xb8\x80W\xe4\xb8\x81X\xe4\xb8\x82Y\xe4\xb8\x83Z</data>' '<data>nil illegitimati carborundum\xc3\xbf</data>' '<data>duo secundus\xcb\x99</data>' '<data>Moskva\xd0\xaa</data>' Traceback (most recent call last): File "C:\junk\gbksax.py", line 27, in <module> xml.sax.parseString(doc, handler) File "C:\Python25\lib\xml\sax\__init__.py", line 49, in parseString parser.parse(inpsrc) File "C:\Python25\lib\xml\sax\expatreader.py", line 107, in parse xmlreader.IncrementalParser.parse(self, source) File "C:\Python25\lib\xml\sax\xmlreader.py", line 123, in parse self.feed(buffer) File "C:\Python25\lib\xml\sax\expatreader.py", line 211, in feed self._err_handler.fatalError(exc) File "C:\Python25\lib\xml\sax\handler.py", line 38, in fatalError raise exception xml.sax._exceptions.SAXParseException: <unknown>:1:30: unknown encoding C:\junk> -- http://mail.python.org/mailman/listinfo/python-list