Re: Some questions about decode/encode

glacier Sun, 27 Jan 2008 19:56:46 -0800

On 1月28日, 上午5时50分, John Machin <[EMAIL PROTECTED]> wrote:
> On Jan 28, 7:47 am, "Mark Tolonen" <[EMAIL PROTECTED]>
> wrote:
>
>
>
>
>
> > >"John Machin" <[EMAIL PROTECTED]> wrote in message
> > >news:[EMAIL PROTECTED]
> > >On Jan 27, 9:17 pm, glacier <[EMAIL PROTECTED]> wrote:
> > >> On 1月24日, 下午3时29分, "Gabriel Genellina" <[EMAIL PROTECTED]>
> > >> wrote:
>
> > >*IF* the file is well-formed GBK, then the codec will not mess up when
> > >decoding it to Unicode. The usual cause of mess is a combination of a
> > >human and a text editor :-)
>
> > SAX uses the expat parser.  From the pyexpat module docs:
>
> > Expat doesn't support as many encodings as Python does, and its repertoire
> > of encodings can't be extended; it supports UTF-8, UTF-16, ISO-8859-1
> > (Latin1), and ASCII. If encoding is given it will override the implicit or
> > explicit encoding of the document.
>
> > --Mark
>
> Thank you for pointing out where that list of encodings had been
> cunningly concealed. However the relevance of dropping it in as an
> apparent response to my answer to the OP's question about decoding
> possibly butchered GBK strings is .... what?
>
> In any case, it seems to support other 8-bit encodings e.g. iso-8859-2
> and koi8-r ...
>
> C:\junk>type gbksax.py
> import xml.sax, xml.sax.saxutils
> import cStringIO
>
> unistr = u''.join(unichr(0x4E00+i) + unichr(ord('W')+i) for i in
> range(4))
> print 'unistr=%r' % unistr
> gbkstr = unistr.encode('gbk')
> print 'gbkstr=%r' % gbkstr
> unistr2 = gbkstr.decode('gbk')
> assert unistr2 == unistr
>
> print "latin1 FF -> utf8 = %r" %
> '\xff'.decode('iso-8859-1').encode('utf8')
> print "latin2 FF -> utf8 = %r" %
> '\xff'.decode('iso-8859-2').encode('utf8')
> print "koi8r FF -> utf8 = %r" % '\xff'.decode('koi8-r').encode('utf8')
>
> xml_template = """<?xml version="1.0" encoding="%s"?><data>%s</
> data>"""
>
> asciidoc = xml_template % ('ascii', 'The quick brown fox etc')
> utf8doc = xml_template % ('utf-8', unistr.encode('utf8'))
> latin1doc = xml_template % ('iso-8859-1', 'nil illegitimati
> carborundum' + '\xff')
> latin2doc = xml_template % ('iso-8859-2', 'duo secundus' + '\xff')
> koi8rdoc = xml_template % ('koi8-r', 'Moskva' + '\xff')
> gbkdoc = xml_template % ('gbk', gbkstr)
>
> for doc in (asciidoc, utf8doc, latin1doc, latin2doc, koi8rdoc,
> gbkdoc):
>     f = cStringIO.StringIO()
>     handler = xml.sax.saxutils.XMLGenerator(f, encoding='utf8')
>     xml.sax.parseString(doc, handler)
>     result = f.getvalue()
>     f.close
>     print repr(result[result.find('<data>'):])
>
> C:\junk>gbksax.py
> unistr=u'\u4e00W\u4e01X\u4e02Y\u4e03Z'
> gbkstr='[EMAIL PROTECTED]'
> latin1 FF -> utf8 = '\xc3\xbf'
> latin2 FF -> utf8 = '\xcb\x99'
> koi8r FF -> utf8 = '\xd0\xaa'
> '<data>The quick brown fox etc</data>'
> '<data>\xe4\xb8\x80W\xe4\xb8\x81X\xe4\xb8\x82Y\xe4\xb8\x83Z</data>'
> '<data>nil illegitimati carborundum\xc3\xbf</data>'
> '<data>duo secundus\xcb\x99</data>'
> '<data>Moskva\xd0\xaa</data>'
> Traceback (most recent call last):
>   File "C:\junk\gbksax.py", line 27, in <module>
>     xml.sax.parseString(doc, handler)
>   File "C:\Python25\lib\xml\sax\__init__.py", line 49, in parseString
>     parser.parse(inpsrc)
>   File "C:\Python25\lib\xml\sax\expatreader.py", line 107, in parse
>     xmlreader.IncrementalParser.parse(self, source)
>   File "C:\Python25\lib\xml\sax\xmlreader.py", line 123, in parse
>     self.feed(buffer)
>   File "C:\Python25\lib\xml\sax\expatreader.py", line 211, in feed
>     self._err_handler.fatalError(exc)
>   File "C:\Python25\lib\xml\sax\handler.py", line 38, in fatalError
>     raise exception
> xml.sax._exceptions.SAXParseException: <unknown>:1:30: unknown
> encoding
>
> C:\junk>- 隐藏被引用文字 -
>
> - 显示引用的文字 -


Thanks,John.
It's no doubt that you proved SAX didn't support GBK encoding.
But can you give some suggestion on how to make SAX parse some GBK
string?
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Some questions about decode/encode

Reply via email to