On 1月28日, 上午5时50分, John Machin <[EMAIL PROTECTED]> wrote: > On Jan 28, 7:47 am, "Mark Tolonen" <[EMAIL PROTECTED]> > wrote: > > > > > > > >"John Machin" <[EMAIL PROTECTED]> wrote in message > > >news:[EMAIL PROTECTED] > > >On Jan 27, 9:17 pm, glacier <[EMAIL PROTECTED]> wrote: > > >> On 1月24日, 下午3时29分, "Gabriel Genellina" <[EMAIL PROTECTED]> > > >> wrote: > > > >*IF* the file is well-formed GBK, then the codec will not mess up when > > >decoding it to Unicode. The usual cause of mess is a combination of a > > >human and a text editor :-) > > > SAX uses the expat parser. From the pyexpat module docs: > > > Expat doesn't support as many encodings as Python does, and its repertoire > > of encodings can't be extended; it supports UTF-8, UTF-16, ISO-8859-1 > > (Latin1), and ASCII. If encoding is given it will override the implicit or > > explicit encoding of the document. > > > --Mark > > Thank you for pointing out where that list of encodings had been > cunningly concealed. However the relevance of dropping it in as an > apparent response to my answer to the OP's question about decoding > possibly butchered GBK strings is .... what? > > In any case, it seems to support other 8-bit encodings e.g. iso-8859-2 > and koi8-r ... > > C:\junk>type gbksax.py > import xml.sax, xml.sax.saxutils > import cStringIO > > unistr = u''.join(unichr(0x4E00+i) + unichr(ord('W')+i) for i in > range(4)) > print 'unistr=%r' % unistr > gbkstr = unistr.encode('gbk') > print 'gbkstr=%r' % gbkstr > unistr2 = gbkstr.decode('gbk') > assert unistr2 == unistr > > print "latin1 FF -> utf8 = %r" % > '\xff'.decode('iso-8859-1').encode('utf8') > print "latin2 FF -> utf8 = %r" % > '\xff'.decode('iso-8859-2').encode('utf8') > print "koi8r FF -> utf8 = %r" % '\xff'.decode('koi8-r').encode('utf8') > > xml_template = """<?xml version="1.0" encoding="%s"?><data>%s</ > data>""" > > asciidoc = xml_template % ('ascii', 'The quick brown fox etc') > utf8doc = xml_template % ('utf-8', unistr.encode('utf8')) > latin1doc = xml_template % ('iso-8859-1', 'nil illegitimati > carborundum' + '\xff') > latin2doc = xml_template % ('iso-8859-2', 'duo secundus' + '\xff') > koi8rdoc = xml_template % ('koi8-r', 'Moskva' + '\xff') > gbkdoc = xml_template % ('gbk', gbkstr) > > for doc in (asciidoc, utf8doc, latin1doc, latin2doc, koi8rdoc, > gbkdoc): > f = cStringIO.StringIO() > handler = xml.sax.saxutils.XMLGenerator(f, encoding='utf8') > xml.sax.parseString(doc, handler) > result = f.getvalue() > f.close > print repr(result[result.find('<data>'):]) > > C:\junk>gbksax.py > unistr=u'\u4e00W\u4e01X\u4e02Y\u4e03Z' > gbkstr='[EMAIL PROTECTED]' > latin1 FF -> utf8 = '\xc3\xbf' > latin2 FF -> utf8 = '\xcb\x99' > koi8r FF -> utf8 = '\xd0\xaa' > '<data>The quick brown fox etc</data>' > '<data>\xe4\xb8\x80W\xe4\xb8\x81X\xe4\xb8\x82Y\xe4\xb8\x83Z</data>' > '<data>nil illegitimati carborundum\xc3\xbf</data>' > '<data>duo secundus\xcb\x99</data>' > '<data>Moskva\xd0\xaa</data>' > Traceback (most recent call last): > File "C:\junk\gbksax.py", line 27, in <module> > xml.sax.parseString(doc, handler) > File "C:\Python25\lib\xml\sax\__init__.py", line 49, in parseString > parser.parse(inpsrc) > File "C:\Python25\lib\xml\sax\expatreader.py", line 107, in parse > xmlreader.IncrementalParser.parse(self, source) > File "C:\Python25\lib\xml\sax\xmlreader.py", line 123, in parse > self.feed(buffer) > File "C:\Python25\lib\xml\sax\expatreader.py", line 211, in feed > self._err_handler.fatalError(exc) > File "C:\Python25\lib\xml\sax\handler.py", line 38, in fatalError > raise exception > xml.sax._exceptions.SAXParseException: <unknown>:1:30: unknown > encoding > > C:\junk>- 隐藏被引用文字 - > > - 显示引用的文字 -
Thanks,John. It's no doubt that you proved SAX didn't support GBK encoding. But can you give some suggestion on how to make SAX parse some GBK string? -- http://mail.python.org/mailman/listinfo/python-list