On 4/27/2011 12:24 PM, Neil Cerutti wrote:
On 2011-04-27, Mike<Mike@invalid.invalid> wrote:
I'm using ElementTree to parse an XML file, but it stops at the
second record (id = 002), which contains a non-standard ascii
character, ?. Here's the XML:
<?xml version="1.0"?>
<snapshot time="Mon Apr 25 08:47:23 PDT 2011">
<records>
<record id="001" education="High School" employment="7 yrs" />
<record id="002" education="Universit?t Bremen" employment="3 years" />
<record id="003" education="River College" employment="5 yrs" />
</records>
</snapshot>
The complaint offered up by the parser is
Unexpected error opening simple_fail.xml: not well-formed
(invalid token): line 5, column 40
It seems to be an invalid XML document, as another poster
indicated.
and if I change the line to eliminate the ?, everything is
wonderful. The parser is perfectly happy with this
modification:
<record id="002" education="University Bremen" employment="3
yrs" />
I can't find anything in the ElementTree docs about allowing
additional text characters or coercing strange ascii to
Unicode.
If you're not the one generating that bogus file, then you can
specify the encoding yourself instead by declaring an XMLParser.
import xml.etree.ElementTree as etree
with open('file.xml') as xml_file:
parser = etree.XMLParser(encoding='ISO-8859-1')
root = etree.parse(xml_file, parser=parser).getroot()
Thanks, Neil. I'm not generating the file, just trying to parse it. Your
solution is precisely what I was looking for, even if I didn't quite ask
correctly. I appreciate the help!
-- Mike --
--
http://mail.python.org/mailman/listinfo/python-list