On 2011-04-27, Mike <Mike@invalid.invalid> wrote: > I'm using ElementTree to parse an XML file, but it stops at the > second record (id = 002), which contains a non-standard ascii > character, ?. Here's the XML: > ><?xml version="1.0"?> ><snapshot time="Mon Apr 25 08:47:23 PDT 2011"> ><records> ><record id="001" education="High School" employment="7 yrs" /> ><record id="002" education="Universit?t Bremen" employment="3 years" /> ><record id="003" education="River College" employment="5 yrs" /> ></records> ></snapshot> > > The complaint offered up by the parser is > > Unexpected error opening simple_fail.xml: not well-formed > (invalid token): line 5, column 40
It seems to be an invalid XML document, as another poster indicated. > and if I change the line to eliminate the ?, everything is > wonderful. The parser is perfectly happy with this > modification: > > <record id="002" education="University Bremen" employment="3 > yrs" /> > > I can't find anything in the ElementTree docs about allowing > additional text characters or coercing strange ascii to > Unicode. If you're not the one generating that bogus file, then you can specify the encoding yourself instead by declaring an XMLParser. import xml.etree.ElementTree as etree with open('file.xml') as xml_file: parser = etree.XMLParser(encoding='ISO-8859-1') root = etree.parse(xml_file, parser=parser).getroot() -- Neil Cerutti -- http://mail.python.org/mailman/listinfo/python-list