(this question was also posted in the devshed python forum: http://forums.devshed.com/python-programming-11/parsing-xml-with-elementtree-unicode-problem-461518.html ). -----------------------------
(it's a bit longish but I hope I give all the information) 1. here is my problem: I'm trying to parse an XML file (saved locally) using elementtree.parse but I get the following error: xml.parsers.expat.ExpatError: not well-formed (invalid token): line 13, column 327 apparently, the problem is caused by the token 'Saunière' due to the apostrophe. the thing is that I'm sure that python (ElementTree module and parse() function) can handle this type of encoding since I obtain my xml file from the web by opening it with: from elementtree import ElementTree from urllib import urlopen query = r'http://ecs.amazonaws.com/onca/xml? Service=AWSECommerceService&AWSAccessKeyId=189P5TE3VP7N9MN0G302&Operation=ItemLookup&ItemId=1400079179&ResponseGroup=Reviews&ReviewPage=166' root = ElementTree.parse(urlopen(query)) where query is a query to the AWS, and this specific query has the 'Saunière' in the response. (you could simply open the query with a web browser and see the xml). I create a local version of the XML file, containing only the tags that are of interest. my file looks something like this (I replaced some of the content with 'bla bla' string in order to make it fit here): <ReviewBatch> <Review> <ID>805</ID> <Rating>3</Rating> <HelpfulVotes>5</HelpfulVotes> <TotalVotes>6</TotalVotes> <Date>2004-04-03</Date> <Summary>Not as good as Angels and Demons</Summary> <Content>I found that this book was not as good and thrilling as Angels and Demons. bla bla.</Content> </Review> <Review> <ID>827</ID> <Rating>4</Rating> <HelpfulVotes>2</HelpfulVotes> <TotalVotes>8</TotalVotes> <Date>2004-04-01</Date> <Summary>The Da Vinci Code, a master piece of words</Summary> <Content>The Da Vinci Code by Dan Brown is a well-written bla bla. The story starts out in Paris, France with a murder of Jacque Saunière, the head curator at Le Louvre.bla bla </Content> </Review> </ReviewBatch> BUT, then trying: fIn = open(file,'r') #or even 'import codecs' and opening with 'fIn = codecs.open(file,encoding = 'utf-8')' tree = ElementTree.parse(fIn) where file is the saved file, I get the error above (xml.parsers.expat.ExpatError: not well-formed (invalid token): line 13, column 327). so what's the difference? how comes parsing is fine in the first case but erroneous in the second case? please advise. 2. there is another problem that might be similar I get a similar error if the content of the (locally saved) xml have special characters such as '&', for example in 'angles & demons' (vs. 'angles and demons'). is it the same problem? same solution? thanks! -- http://mail.python.org/mailman/listinfo/python-list