Bugs item #1074200, was opened at 2004-11-27 14:58 Message generated for change (Comment added) made by effbot You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1074200&group_id=5470
Category: Unicode Group: Python 2.3 >Status: Closed >Resolution: Wont Fix Priority: 5 Submitted By: Peer Janssen (peerjanssen) Assigned to: Nobody/Anonymous (nobody) Summary: xml.dom.minidom produces errors with certain unicode chars Initial Comment: (note: I tried to file this before, but it didn't show up in the list, so I try again.) In a XML document generated by Trados Translators Workbench (a TMX V 1.1 Translation Memory), the Unicode characters U+0001 ("START OF HEADING", see http://www.fileformat.info/info/unicode/char/0001/index.htm) and SINGLE LOW-9 QUOTATION MARK (U+201A, see http://www.fileformat.info/info/unicode/char/201a/index.htm) produce errors when parsing it from a file with "xml.dom.minidom". The first one (0001) produces this output: Traceback (most recent call last): File "G:\_Prog\TMworks\domtree.py", line 7, in ? dom=parse(tm) File "C:\Python23\lib\xml\dom\minidom.py", line 1919, in parse return expatbuilder.parse(file) File "C:\Python23\lib\xml\dom\expatbuilder.py", line 928, in parse result = builder.parseFile(file) File "C:\Python23\lib\xml\dom\expatbuilder.py", line 207, in parseFile parser.Parse(buffer, 0) xml.parsers.expat.ExpatError: not well-formed (invalid token): line 420, column 106 The second one (201A) produces this output: Traceback (most recent call last): File "G:\_Prog\TMworks\domtree.py", line 7, in ? dom=parse(tm) File "C:\Python23\lib\xml\dom\minidom.py", line 1919, in parse return expatbuilder.parse(file) File "C:\Python23\lib\xml\dom\expatbuilder.py", line 928, in parse result = builder.parseFile(file) File "C:\Python23\lib\xml\dom\expatbuilder.py", line 207, in parseFile parser.Parse(buffer, 0) xml.parsers.expat.ExpatError: mismatched tag: line 624, column 2 Deleting these two characters in the whole document produces the desired result. I don't see why these characters should be of any problem, especially the quotation mark. ---------------------------------------------------------------------- >Comment By: Fredrik Lundh (effbot) Date: 2004-12-03 12:29 Message: Logged In: YES user_id=38376 Closing; see leogah's reply for background. ---------------------------------------------------------------------- Comment By: Richard Brodie (leogah) Date: 2004-12-03 01:37 Message: Logged In: YES user_id=356893 I don't think there are any bugs here: at least not Python ones. U+0001 (SOH) isn't an allowed character in XML 1.0: http://www.w3.org/International/questions/qa-controls U+201A (SINGLE LOW-9 QUOTATION MARK) should be fine, except that \x1A is converted to EOF on Windows; then expat chokes on all the unclosed tags. Open the file 'rb'. RB. ---------------------------------------------------------------------- Comment By: Peer Janssen (peerjanssen) Date: 2004-11-27 15:29 Message: Logged In: YES user_id=896722 The file. ---------------------------------------------------------------------- Comment By: Peer Janssen (peerjanssen) Date: 2004-11-27 15:27 Message: Logged In: YES user_id=896722 Here is a zip file with a test program domtree.py and two test files. I noticed that the first test file produces it's bug only on my windows box, but the second test file produces an error on both my windows and my linux box. The windows python version is: Python 2.3.4 (#53, May 25 2004, 21:17:02) [MSC v.1200 32 bit (Intel)] on win32 The linux python version is: Python 2.3.3. (#2, Feb 17, 2004, 11:45:40) [GCC 3.3.2 (Mandrake Linux 10.0 3.3.2-6mdk)] on linux2 ---------------------------------------------------------------------- Comment By: M.-A. Lemburg (lemburg) Date: 2004-11-27 15:02 Message: Logged In: YES user_id=38388 Please provide an example that lets us reproduce the error. Unassigning, since I'm not an expert for minidom. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1074200&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com