Sorry for the top posting - I found out that the problem I encountered was not something new in Python 3.0.
Here's a test program: ============ import xml.etree.ElementTree ElementTree = xml.etree.ElementTree import htmlentitydefs class XmlParser(ElementTree.ElementTree): def __init__(self, file=None): ElementTree.ElementTree.__init__(self) parser = ElementTree.XMLTreeBuilder( target=ElementTree.TreeBuilder(ElementTree.Element)) parser.entity = htmlentitydefs.entitydefs self.parse(source=file, parser=parser) return f = open('test.html') tree = XmlParser(f) tree.write('test_out.html', encoding='utf-8') ====== This program should be run with the following test file (test.html): ===== <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http:// www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> <title>test</title> </head> <body> <p>Α</p> </body> </html> ====== If run as such, it will print out the following: ------ <html> <head> <meta content="text/html; charset=utf-8" http-equiv="Content-Type" /> <title>test</title> </head> <body> <p>&#913;</p> </body> </html> ------- Notice how it is &#913; that appears instead of Α This is the behaviour with both Python3.0 and 2.5. (When I was running with Python 2.5, I was always preprocessing the files with BeautifulSoup, which removed many problems). If I use "my_htmlentitiesdef.py" described in a previous message, I do get an Alpha printed out (admittedly, not the character entity). I would prefer to find a way to process such files and get Α instead... (or even, to process files with hard-coded characters e.g. é instead of é and have them processed properly...). unicode-challengedly-yrs, André On Dec 26, 9:14 pm, "André" <[EMAIL PROTECTED]> wrote: > On Dec 26, 8:53 pm, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > > > Without an additional parser, I was getting the following error > > > message: > > [...] > > > xml.parsers.expat.ExpatError: undefined entity é: line 401, column 11 > > > To understand that problem better, it would have been helpful to see > > what line 401, column 11 of the input file actually says. AFAICT, > > it must have been something like "&é;" which would be really puzzling > > to have in an XML file (usually, people restrict themselves to ASCII > > for entity names). > > No, that one was é (testing with my own name that appeared in > a file). > > > > > > for entity in ent: > > > if entity not in parser.entity: > > > parser.entity[entity] = ent[entity] > > > This looks fine to me. > > > > The output was "wrong". For example, one of the test I used was to > > > process a copy of the main dict of htmlentitydefs.py inside an html page. > > > A > > > few of the characters came ok, but I got things like: > > > > 'Α': 0x0391, # greek capital letter alpha, U+0391 > > > Why do you think this is wrong? > > Sorry, that was just cut-and-pasted from the browser (not the source); > in the source of the processed html page, it is > '&#913;': 0x0391, # greek capital letter alpha, U+0391 > > i.e. the "&" was transformed into "&" in a number of places (all > places above ascii 127 I believe). > > Here are a few more lines extracted from the html file that was > processed: > ============= > 'Â': 0x00c2, # latin capital letter A with circumflex, U+00C2 > ISOlat1 > 'À': 0x00c0, # latin capital letter A with grave = latin capital > letter A grave, U+00C0 ISOlat1 > '&#913;': 0x0391, # greek capital letter alpha, U+0391 > 'Å': 0x00c5, # latin capital letter A with ring above = latin > capital letter A ring, U+00C5 ISOlat1 > 'Ã': 0x00c3, # latin capital letter A with tilde, U+00C3 ISOlat1 > 'Ä': 0x00c4, # latin capital letter A with diaeresis, U+00C4 > ISOlat1 > '&#914;': 0x0392, # greek capital letter beta, U+0392 > 'Ç': 0x00c7, # latin capital letter C with cedilla, U+00C7 > ISOlat1 > '&#935;': 0x03a7, # greek capital letter chi, U+03A7 > '&#8225;': 0x2021, # double dagger, U+2021 ISOpub > '&#916;': 0x0394, # greek capital letter delta, U+0394 > ISOgrk3 > ============ > > > > > > When using my modified version, I got the following (which may not be > > > transmitted properly by email...) > > > 'Α': 0x0391, # greek capital letter alpha, U+0391 > > > > It does look like a Greek capital letter alpha here. > > > Sure, however, your first version ALSO has the Greek capital letter > > alpha there; it is just spelled as Α (which *is* a valid spelling > > for that latter in XML). > > Agreed that it would be... However that was not how it was > transformed, see above; sorry if I was not clear about what was > happening (I should not have cut-and-pasted from the browser window). > > > > > > I hope the above is of some help. > > > Thanks; I now think that htmlentitydefs is just as fine as it always > > was - I don't see any problem here. > > You may well be right in that the problem may lie elsewhere. But as > making the change I mentioned "fixed" the problem at my, I figured > this was where the problem was located - and thought I should at least > report it here. > > Regards, > André > > > Regards, > > Martin -- http://mail.python.org/mailman/listinfo/python-list