On Wed, Apr 27, 2011 at 2:26 PM, Mike <Mike@invalid.invalid> wrote: > I'm using ElementTree to parse an XML file, but it stops at the second > record (id = 002), which contains a non-standard ascii character, ä. Here's > the XML: > > <?xml version="1.0"?> > <snapshot time="Mon Apr 25 08:47:23 PDT 2011"> > <records> > <record id="001" education="High School" employment="7 yrs" /> > <record id="002" education="Universität Bremen" employment="3 years" /> > <record id="003" education="River College" employment="5 yrs" /> > </records> > </snapshot> > > The complaint offered up by the parser is > > Unexpected error opening simple_fail.xml: not well-formed (invalid token): > line 5, column 40 > > and if I change the line to eliminate the ä, everything is wonderful. The > parser is perfectly happy with this modification: > > <record id="002" education="University Bremen" employment="3 yrs" /> > > I can't find anything in the ElementTree docs about allowing additional text > characters or coercing strange ascii to Unicode. > > Is there a way to coerce the text so it doesn't cause the parser to raise an > exception? >
Have you tried specifying the file encoding? ä is not "strange ascii". It's outside the ASCII range so if the parser expects ASCII, it will get confused. > Here's my test script (simple_fail contains the offending line, and > simple_pass contains the line that passes). > > import sys > import xml.etree.ElementTree as ET > > def main(): > > xml_files = ['simple_fail.xml', 'simple_pass.xml'] > for xml_file in xml_files: > > print > print 'XML file: %s' % (xml_file) > > try: > tree = ET.parse(xml_file) > except Exception, inst: > print "Unexpected error opening %s: %s" % (xml_file, inst) > continue > > root = tree.getroot() > records = root.find('records') > for record in records: > print record.attrib['id'], record.attrib['education'] > > if __name__ == "__main__": > main() > > > Thanks, > > -- Mike -- > > -- > http://mail.python.org/mailman/listinfo/python-list > -- http://mail.python.org/mailman/listinfo/python-list