ElementTree XML parsing problem

Mike Wed, 27 Apr 2011 11:33:25 -0700

I'm using ElementTree to parse an XML file, but it stops at the secondrecord (id = 002), which contains a non-standard ascii character, ä.Here's the XML:


<?xml version="1.0"?>
<snapshot time="Mon Apr 25 08:47:23 PDT 2011">
<records>
<record id="001" education="High School" employment="7 yrs" />
<record id="002" education="Universität Bremen" employment="3 years" />
<record id="003" education="River College" employment="5 yrs" />
</records>
</snapshot>


The complaint offered up by the parser is

Unexpected error opening simple_fail.xml: not well-formed (invalidtoken): line 5, column 40

and if I change the line to eliminate the ä, everything is wonderful.The parser is perfectly happy with this modification:


<record id="002" education="University Bremen" employment="3 yrs" />

I can't find anything in the ElementTree docs about allowing additionaltext characters or coercing strange ascii to Unicode.

Is there a way to coerce the text so it doesn't cause the parser toraise an exception?

Here's my test script (simple_fail contains the offending line, andsimple_pass contains the line that passes).


import sys
import xml.etree.ElementTree as ET

def main():

    xml_files = ['simple_fail.xml', 'simple_pass.xml']
    for xml_file in xml_files:

        print
        print 'XML file: %s' % (xml_file)

        try:
            tree = ET.parse(xml_file)
        except Exception, inst:
            print "Unexpected error opening %s: %s" % (xml_file, inst)
            continue

        root = tree.getroot()
        records = root.find('records')
        for record in records:
            print record.attrib['id'], record.attrib['education']

if __name__ == "__main__":
        main()


Thanks,

-- Mike --

--
http://mail.python.org/mailman/listinfo/python-list

ElementTree XML parsing problem

Reply via email to