Fredrik Lundh wrote: > Steven Bethard wrote: > >> Hmm... I downloaded the newest cElementTree (and I already had the >> newest ElementTree), and here's what I get: > >> >>> tree = myparser(filename, 'gbk') >> Traceback (most recent call last): >> File "<interactive input>", line 1, in ? >> File "<interactive input>", line 8, in myparser >> SyntaxError: not well-formed (invalid token): line 8, column 6 >> >> FWIW, the file used above doesn't have an <?xml encoding?> header: >> >> >>> open(filename).read() >> '<DOC>\n<DOCID>ART242</DOCID>\n<HEADER>\n >> <DATE></DATE>\n</HEADER>\n<BODY>\n<HEADLINE>\n<S ID=2566> > > <S ID=2655> isn't a valid XML tag (the attribute value must be quoted) > > if I recode the file into UTF-8 and fix the two S tags, the result displays > just fine in IE and Firefox (I get a few boxes/question marks, but I assume > that's a font problem).
Thanks (to both Fredrik and Just). You stare at XML too long and you start to miss the obvious things too. =) Everything works great now: >>> text = open(filename).read() >>> text = re.sub(r'<S ID=(\w+)', r'<S ID="\1"', text) >>> text = text.decode('gbk').encode('utf-8') >>> et.fromstring(text) <Element 'DOC' at 00A2AF38> =) Steve -- http://mail.python.org/mailman/listinfo/python-list