New submission from Dan Callaghan: Python 2.7.3 (default, Jul 24 2012, 10:05:38) [GCC 4.7.0 20120507 (Red Hat 4.7.0-5)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> c = u'\u65e5\u672c\u8a9e' >>> import xml.dom.minidom
Encoded as UTF-8, everything is fine: >>> xml.dom.minidom.parseString('<?xml version="1.0" encoding="UTF-8" >>> ?><x>%s</x>' % c.encode('UTF-8')) <xml.dom.minidom.Document instance at 0x7f310d27dcf8> but not ISO-2022-JP: >>> xml.dom.minidom.parseString('<?xml version="1.0" encoding="ISO-2022-JP" >>> ?><x>%s</x>' % c.encode('ISO-2022-JP')) Traceback (most recent call last): File "<stdin>", line 3, in <module> File "/usr/lib64/python2.7/site-packages/_xmlplus/dom/minidom.py", line 1925, in parseString return expatbuilder.parseString(string) File "/usr/lib64/python2.7/site-packages/_xmlplus/dom/expatbuilder.py", line 942, in parseString return builder.parseString(string) File "/usr/lib64/python2.7/site-packages/_xmlplus/dom/expatbuilder.py", line 223, in parseString parser.Parse(string, True) xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 48 lxml can handle it fine though: >>> import lxml.etree >>> lxml.etree.fromstring('<?xml version="1.0" encoding="ISO-2022-JP" >>> ?><x>%s</x>' % c.encode('ISO-2022-JP')) <Element x at 0x7f310d284960> >>> _.text == c True ---------- components: XML messages: 169974 nosy: dcallagh priority: normal severity: normal status: open title: xml.dom.minidom cannot parse ISO-2022-JP versions: Python 2.7 _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue15877> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com