On May 5, 12:11 am, "Barak, Ron" <ron.ba...@lsi.com> wrote: > > -----Original Message----- > > From: Stefan Behnel [mailto:stefan...@behnel.de] > > Sent: Tuesday, May 04, 2010 10:24 AM > > To: python-l...@python.org > > Subject: Re: How to get xml.etree.ElementTree not bomb on > > invalid characters in XML file ? > > > Barak, Ron, 04.05.2010 09:01: > > > I'm parsing XML files using ElementTree from xml.etree (see code > > > below (and attached xml_parse_example.py)). > > > > However, I'm coming across input XML files (attached an example: > > > tmp.xml) which include invalid characters, that produce the > > following > > > traceback: > > > > $ python xml_parse_example.py > > > Traceback (most recent call last): > > > xml.parsers.expat.ExpatError: not well-formed (invalid > > token): line 6, > > > column 34 > > > I hope you are aware that this means that the input you are > > parsing is not XML. It's best to reject the file and tell the > > producers that they are writing broken output files. You > > should always fix the source, instead of trying to make sense > > out of broken input in fragile ways. > > > > I read the documentation for xml.etree.ElementTree and see > > that it may > > > take an optional parser parameter, but I don't know what > > this parser > > > should be - to ignore the invalid characters. > > > > Could you suggest a way to call ElementTree, so it won't > > bomb on these > > > invalid characters ? > > > No. The parser in lxml.etree has a 'recover' option that lets > > it try to recover from input errors, but in general, XML > > parsers are required to reject non well-formed input. > > > Stefan > > Hi Stefan, > The XML file seems to be valid XML (all XML viewers I tried were able to read > it). > You can verify this by trying to read the XML example I attached to the > original message (attached again here). > Actually, when trying to view the file with an XML viewer, these offensive > characters are not shown. > It's just that some of the fields include characters that the parser used by > ElementTree seems to chock on. > Bye, > Ron. > > tmp_small.xml > < 1KViewDownload
Have a look at your file with e.g. a hex editor or just Python repr() -- see below. You will see that there are four cases of <tag>good_data\x00garbage</tag> where "garbage" is repeated \x00 or just random line noise or uninitialised memory. <m_sanApiName1>"MainStorage_snap\x00\x00*SNIP*\x00\x00"</ m_sanApiName1> <m_detail>"BROLB21\x00\xee"\x00\x00\x00\x90,\x02G\xdc\xfb\x04P\xdc \xfb\x04\x01a\xfc>(\xe8\xfb\x04"</m_detail> It's a toss-up whether the > in there is accidental or a deliberate attempt to sanitise the garbage !-) <m_detail>"Alstom\x00\x00o\x00m\x00\x00*SNIP*\x00\x00"</m_detail> <m_sanApiVersion>"V5R1.28.1 [R - LA]\x00\x00*SNIP*\x00\x00"</ m_sanApiVersion> The garbage in the 2nd case is such as to make the initial declaration encoding="UTF-8" an outright lie and I'm curious as to how the XML parser managed to get as far as it did -- it must decode a line at a time. As already advised: it's much better to reject that rubbish outright than to attempt to repair it. Repair should be contemplated only if it's a one-off exercise AND you can't get a fixed copy from the source. And while we're on the subject of rubbish: """The XML file seems to be valid XML (all XML viewers I tried were able to read it).""" The conclusion from that is that all XML viewers that you tried are rubbish. -- http://mail.python.org/mailman/listinfo/python-list